Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4abfcaa
Add native Iceberg storage support using PyIceberg and DuckDB
tommy-ca Jan 13, 2026
0093113
feat(offline-store): Complete Iceberg offline store Phase 2 implement…
tommy-ca Jan 14, 2026
b9659ad
feat(online-store): Complete Iceberg online store Phase 3 implementation
tommy-ca Jan 14, 2026
7042b0d
docs: Complete Iceberg documentation Phase 4
tommy-ca Jan 14, 2026
8ce4bd8
fix: Phase 5.1 - Fix offline/online store bugs from code audit
tommy-ca Jan 14, 2026
d54624a
feat: Phase 5.2-5.4 - Complete Iceberg integration tests, examples, a…
tommy-ca Jan 14, 2026
2c35063
docs: Update plan.md with Phase 5 completion and Phase 6 roadmap
tommy-ca Jan 14, 2026
d804d79
docs: Update design specs with final statistics and create implementa…
tommy-ca Jan 14, 2026
80b6ab3
docs: Complete Phase 6 - Final review and production readiness
tommy-ca Jan 14, 2026
eca8bc6
docs: Add comprehensive project completion summary
tommy-ca Jan 14, 2026
ed29614
docs: Add comprehensive lessons learned and project closure
tommy-ca Jan 14, 2026
6d440e9
docs: Add comprehensive documentation index and navigation guide
tommy-ca Jan 14, 2026
da09162
fix: Final robust fixes for Iceberg storage integration
tommy-ca Jan 15, 2026
69f0750
docs(specs): streamline Iceberg plan Phase 6 summary
tommy-ca Jan 15, 2026
3b8f2e2
docs(specs): update Iceberg offline store final details
tommy-ca Jan 15, 2026
850a89d
docs(specs): update Iceberg online store final details
tommy-ca Jan 15, 2026
f877d15
docs(specs): fix Iceberg quickstart config examples
tommy-ca Jan 15, 2026
a171cb9
docs(specs): remove stale Iceberg online store status section
tommy-ca Jan 15, 2026
56e51ee
docs(specs): add Iceberg production readiness hardening backlog
tommy-ca Jan 15, 2026
a1dce29
docs(reference): align Iceberg offline store examples with config
tommy-ca Jan 15, 2026
c0c5627
fix(online-store): project columns and align entity_hash partitions
tommy-ca Jan 15, 2026
363e26d
feat(offline-store): validate IcebergSource configuration
tommy-ca Jan 15, 2026
02ba04d
docs: mark Iceberg stores beta and define certified matrix
tommy-ca Jan 15, 2026
637224d
docs(specs): align Iceberg spec dependencies with implementation
tommy-ca Jan 15, 2026
0df1cb2
fix(offline-store): configure DuckDB for S3 endpoints
tommy-ca Jan 15, 2026
87f306c
examples: add Iceberg REST+MinIO certification smoke test
tommy-ca Jan 15, 2026
5496feb
docs: add Iceberg certification checklist and Make targets
tommy-ca Jan 15, 2026
0dda4fa
chore: make Iceberg smoke targets uv-native
tommy-ca Jan 15, 2026
f4ce843
docs(examples): switch Iceberg workflow to uv run
tommy-ca Jan 15, 2026
0bba23e
fix(examples): create iceberg-local data directories
tommy-ca Jan 15, 2026
3282530
chore(make): add Iceberg certification target
tommy-ca Jan 15, 2026
7a955e2
chore(examples): ignore iceberg-local output data
tommy-ca Jan 15, 2026
30e2a2b
docs(specs): update Iceberg hardening schedule
tommy-ca Jan 15, 2026
d36083a
fix(iceberg): critical security and correctness fixes for Iceberg stores
tommy-ca Jan 16, 2026
18f4539
test(iceberg): add comprehensive tests for critical bug fixes
tommy-ca Jan 16, 2026
82baff6
fix(iceberg): resolve P0 critical security issues and additional impr…
tommy-ca Jan 16, 2026
4b638b7
docs(solutions): add security solution for SQL injection and credenti…
tommy-ca Jan 16, 2026
4cc3a88
docs(planning): add rescheduled work plan for remaining P1/P2 issues
tommy-ca Jan 16, 2026
92941a0
docs(summary): add comprehensive session summary
tommy-ca Jan 16, 2026
e1ed1fa
fix(iceberg): resolve Session 1 P1 issues and add TTL validation
tommy-ca Jan 16, 2026
29f1522
docs(todos): verify and close Session 2 issues
tommy-ca Jan 17, 2026
c49ae25
docs(session): update summary with Sessions 1-2 completion
tommy-ca Jan 17, 2026
b1c148d
docs(completion): add comprehensive Sessions 1-2 completion summary
tommy-ca Jan 17, 2026
d7b1634
perf(iceberg): add catalog connection caching to online store
tommy-ca Jan 17, 2026
13e92fc
docs(session): add Session 3 completion summary
tommy-ca Jan 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: Complete Iceberg documentation Phase 4
- Add comprehensive user guide: docs/reference/offline-stores/iceberg.md
- Add performance guide: docs/reference/online-stores/iceberg.md
- Add quickstart tutorial: docs/specs/iceberg_quickstart.md
- Update design specs with implementation status
- Update plan.md with Phase 4 completion

Phase 4 documentation complete. Full Iceberg storage support documented.

Documentation includes:
- Installation with UV native workflow (uv sync --extra iceberg)
- Configuration examples for REST, Glue, Hive, SQL catalogs
- Partition strategies and performance tuning guides
- Production deployment patterns (S3, GCS, Azure)
- Monitoring, troubleshooting, and best practices
- Quickstart tutorials for local and production setup

Key features:
- UV native commands throughout (never pip/pytest/python directly)
- Functionality matrices for offline and online stores
- Performance comparison tables (Iceberg vs Redis/SQLite)
- Complete configuration reference
- End-to-end workflow examples

Total documentation: 5 files, 1448+ lines
  • Loading branch information
tommy-ca committed Jan 14, 2026
commit 7042b0d4930848292d182b538264b7c28549d0f9
295 changes: 295 additions & 0 deletions docs/reference/offline-stores/iceberg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# Iceberg offline store

## Description

The Iceberg offline store provides native support for [Apache Iceberg](https://iceberg.apache.org/) tables using [PyIceberg](https://py.iceberg.apache.org/). It offers a modern, open table format with ACID transactions, schema evolution, and time travel capabilities for feature engineering at scale.

**Key Features:**
* Native Iceberg table format support via PyIceberg
* Hybrid read strategy: Copy-on-Write (COW) and Merge-on-Read (MOR) optimization
* Point-in-time correct joins using DuckDB SQL engine
* Support for multiple catalog types (REST, Glue, Hive, SQL)
* Schema evolution and versioning
* Efficient metadata pruning for large tables
* Compatible with data lakes (S3, GCS, Azure Blob Storage)

**Read Strategy:**
* **COW Tables** (no deletes): Direct Parquet reading via DuckDB for maximum performance
* **MOR Tables** (with deletes): In-memory Arrow table loading for data correctness

Entity dataframes can be provided as a Pandas dataframe or SQL query.

## Getting started

In order to use this offline store, you'll need to install the Iceberg dependencies:

```bash
uv sync --extra iceberg
```

Or if using pip:
```bash
pip install 'feast[iceberg]'
```

This installs:
* `pyiceberg[sql,duckdb]>=0.8.0` - Native Iceberg table operations
* `duckdb>=1.0.0` - SQL engine for point-in-time joins

## Example

### Basic Configuration (REST Catalog)

{% code title="feature_store.yaml" %}
```yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: rest
catalog_name: feast_catalog
uri: http://localhost:8181
warehouse: s3://my-bucket/warehouse
namespace: feast
online_store:
type: sqlite
path: data/online_store.db
```
{% endcode %}

### AWS Glue Catalog Configuration

{% code title="feature_store.yaml" %}
```yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: glue
catalog_name: feast_catalog
warehouse: s3://my-bucket/warehouse
namespace: feast
storage_options:
s3.region: us-west-2
s3.access-key-id: ${AWS_ACCESS_KEY_ID}
s3.secret-access-key: ${AWS_SECRET_ACCESS_KEY}
online_store:
type: dynamodb
region: us-west-2
```
{% endcode %}

### Local Development (SQL Catalog)

{% code title="feature_store.yaml" %}
```yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: sql
catalog_name: feast_catalog
uri: sqlite:///data/iceberg_catalog.db
warehouse: data/warehouse
namespace: feast
online_store:
type: sqlite
path: data/online_store.db
```
{% endcode %}

## Configuration Options

The full set of configuration options is available in `IcebergOfflineStoreConfig`:

| Option | Type | Required | Default | Description |
|--------|------|----------|---------|-------------|
| `type` | str | Yes | - | Must be `feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore` |
| `catalog_type` | str | Yes | `"rest"` | Type of Iceberg catalog: `rest`, `glue`, `hive`, `sql` |
| `catalog_name` | str | Yes | `"feast_catalog"` | Name of the Iceberg catalog |
| `uri` | str | No | - | Catalog URI (required for REST/SQL catalogs) |
| `warehouse` | str | Yes | `"warehouse"` | Warehouse path (S3/GCS/local path) |
| `namespace` | str | No | `"feast"` | Iceberg namespace for feature tables |
| `storage_options` | dict | No | `{}` | Additional storage configuration (e.g., S3 credentials) |

## Data Source Configuration

To use Iceberg tables as feature sources:

```python
from feast import Field
from feast.types import Int64, String, Float32
from feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg_source import (
IcebergSource,
)

# Define an Iceberg data source
my_iceberg_source = IcebergSource(
name="driver_stats",
table="feast.driver_hourly_stats", # namespace.table_name
timestamp_field="event_timestamp",
created_timestamp_column="created",
)

# Use in a Feature View
driver_stats_fv = FeatureView(
name="driver_hourly_stats",
entities=[driver],
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
source=my_iceberg_source,
ttl=timedelta(days=1),
)
```

## Functionality Matrix

The set of functionality supported by offline stores is described in detail [here](overview.md#functionality).
Below is a matrix indicating which functionality is supported by the Iceberg offline store.

| | Iceberg |
| :----------------------------------------------------------------- | :----- |
| `get_historical_features` (point-in-time correct join) | yes |
| `pull_latest_from_table_or_query` (retrieve latest feature values) | yes |
| `pull_all_from_table_or_query` (retrieve a saved dataset) | yes |
| `offline_write_batch` (persist dataframes to offline store) | yes |
| `write_logged_features` (persist logged features to offline store) | yes |

Below is a matrix indicating which functionality is supported by `IcebergRetrievalJob`.

| | Iceberg |
| ----------------------------------------------------- | ----- |
| export to dataframe | yes |
| export to arrow table | yes |
| export to arrow batches | no |
| export to SQL | no |
| export to data lake (S3, GCS, etc.) | no |
| export to data warehouse | no |
| export as Spark dataframe | no |
| local execution of Python-based on-demand transforms | yes |
| remote execution of Python-based on-demand transforms | no |
| persist results in the offline store | yes |
| preview the query plan before execution | no |
| read partitioned data | yes |

To compare this set of functionality against other offline stores, please see the full [functionality matrix](overview.md#functionality-matrix).

## Performance Considerations

### Read Optimization

The Iceberg offline store automatically selects the optimal read strategy:

* **COW Tables**: Direct Parquet file reading via DuckDB for maximum performance
* **MOR Tables**: In-memory Arrow table loading to handle delete files correctly

This hybrid approach balances performance and correctness based on table characteristics.

### Metadata Pruning

Iceberg's metadata layer enables efficient partition pruning and file skipping:

```python
# Iceberg automatically prunes partitions and data files based on filters
# No full table scan required for filtered queries
historical_features = store.get_historical_features(
entity_df=entity_df,
features=[
"driver_hourly_stats:conv_rate",
"driver_hourly_stats:acc_rate",
],
)
```

### Best Practices

1. **Partition Strategy**: Use appropriate partitioning (by date, entity, etc.) for your access patterns
2. **Compaction**: Periodically compact small files to maintain read performance
3. **Catalog Selection**: Use REST catalog for production, SQL catalog for local development
4. **Storage Credentials**: Store sensitive credentials in environment variables, not in YAML

## Catalog Types

### REST Catalog (Recommended for Production)

```yaml
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: rest
catalog_name: feast_catalog
uri: http://iceberg-rest:8181
warehouse: s3://data-lake/warehouse
```

### AWS Glue Catalog

```yaml
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: glue
catalog_name: feast_catalog
warehouse: s3://data-lake/warehouse
storage_options:
s3.region: us-west-2
```

### Hive Metastore

```yaml
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: hive
catalog_name: feast_catalog
uri: thrift://hive-metastore:9083
warehouse: s3://data-lake/warehouse
```

### SQL Catalog (Local Development)

```yaml
offline_store:
type: feast.infra.offline_stores.contrib.iceberg_offline_store.iceberg.IcebergOfflineStore
catalog_type: sql
catalog_name: feast_catalog
uri: sqlite:///data/iceberg_catalog.db
warehouse: data/warehouse
```

## Schema Evolution

Iceberg supports schema evolution natively. When feature schemas change, Iceberg handles:
* Adding new columns
* Removing columns
* Renaming columns
* Type promotions (e.g., int32 to int64)

Changes are tracked in the metadata layer without rewriting data files.

## Time Travel

Leverage Iceberg's time travel capabilities for reproducible feature engineering:

```python
# Access historical table snapshots for point-in-time correct features
# Iceberg automatically uses the correct snapshot based on timestamps
```

## Limitations

* **Write Path**: Currently uses append-only writes (no upserts/deletes)
* **Export Formats**: Limited to dataframe and Arrow table exports
* **Remote Execution**: Does not support remote on-demand transforms
* **Spark Integration**: Direct Spark dataframe export not yet implemented

## Resources

* [Apache Iceberg Documentation](https://iceberg.apache.org/docs/latest/)
* [PyIceberg Documentation](https://py.iceberg.apache.org/)
* [Iceberg Table Format Specification](https://iceberg.apache.org/spec/)
* [Feast Iceberg Examples](https://github.com/feast-dev/feast/tree/master/examples)
Loading