Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
d2c637d
feat: Add MongoDB offline store (ibis-based PIT join, v1 alpha)
caseyclements Mar 3, 2026
77f243e
refactor: improve MongoDB offline store code quality
caseyclements Mar 4, 2026
bb134a2
Started work on full Mongo/MQL implementation. Kept MongoDBOfflineSto…
caseyclements Mar 9, 2026
46e67b7
refactor: rename alpha to preview, clarify MQL pipeline comments
caseyclements Mar 17, 2026
d0c00ca
Added unit tests for offline store retrieval, requiring docker and py…
caseyclements Mar 17, 2026
c940d05
Added test of multiple feature views and compound join keys
caseyclements Mar 17, 2026
37fad38
Initial implementation of native single-collection offline store
caseyclements Mar 17, 2026
67c39d4
Added DriverInfo to MongoDBClients
caseyclements Mar 18, 2026
7987405
Optimized MQL. Applied FV-level TTL
caseyclements Mar 18, 2026
adf1fb0
filter TTL by relevant FVs only, cautiously reset df index; add creat…
caseyclements Mar 18, 2026
ed1571e
Updated docstrings
caseyclements Mar 18, 2026
c09faee
Lazy index creation via _get_client_and_ensure_indexes
caseyclements Mar 18, 2026
91e939c
Add performance benchmarks comparing Ibis vs Native MongoDB offline s…
caseyclements Mar 18, 2026
e4bfc31
Refactor Native get_historical_features: replace with fetch+pandas join
caseyclements Mar 18, 2026
6218fa8
Refactor get_historical_features with chunked processing for large en…
caseyclements Mar 19, 2026
b7ffd84
Optimize Native get_historical_features: reuse client, increase batch…
caseyclements Mar 19, 2026
affdc2d
Remove duplicate MongoDBOfflineStoreNative from mongodb.py
caseyclements Mar 19, 2026
afef4fd
Consolidate mongodb_source.py into mongodb.py
caseyclements Mar 19, 2026
f273744
Rename mongodb_offline_store to mongodb, use One/Many naming convention
caseyclements Mar 19, 2026
df00e47
Add README.md documenting MongoDB offline store implementations
caseyclements Mar 20, 2026
83d063b
Rename mongodb/ to mongodb_offline_store/, organize tests
caseyclements Mar 20, 2026
977c240
Update docstring in benchmark.py
caseyclements Mar 20, 2026
166d151
Update README to show created_at tie-breaker in Many schema
caseyclements Mar 20, 2026
fe24361
Update README index recommendations for Many implementation
caseyclements Mar 20, 2026
398110c
Add auto-create index to MongoDBOfflineStoreMany
caseyclements Mar 20, 2026
2feb97d
Update benchmark.py to use One/Many naming convention
caseyclements Mar 20, 2026
9624311
Add comprehensive module docstring to mongodb_many.py
caseyclements Mar 20, 2026
5c93a51
Add Feature Freshness and Schema Evolution docs to mongodb_many.py
caseyclements Mar 20, 2026
fd70c13
Add MongoDB DataSourceCreators for universal Feast tests
caseyclements Mar 20, 2026
b4a0260
Add .secrets.baseline
caseyclements Mar 20, 2026
9efd700
Addressed PR comment: join_keys = get_expected_join_keys(project, fea…
caseyclements Apr 14, 2026
5d67b3f
Adds tests scenario that not using offline_utils.get_expected_join_ke…
caseyclements Apr 14, 2026
f889e10
Tests revealed possible name collision in pandas.merge_asof
caseyclements Apr 15, 2026
fcbd609
Add further (Large) benchmark tests
caseyclements Apr 15, 2026
7786393
Upgrades from Devin comments. Class cache _index_initialized; get_exp…
caseyclements Apr 15, 2026
b8fa01c
Addressed PR comments
caseyclements Apr 15, 2026
6db7cce
Apply lower bound via max(TTL) when all feature viewws in a chunk hav…
caseyclements Apr 15, 2026
596b126
Add created_at to compound index so that materialization is correct i…
caseyclements Apr 15, 2026
c0173e0
Handdle numpy scalers in _serialize_entity_key_from_row as suggested.
caseyclements Apr 15, 2026
1f80f47
Add persist and tests
caseyclements Apr 15, 2026
daaaf32
Remove accidentally included design notes.
caseyclements Apr 15, 2026
3a78f9d
Fix entity key serialization: per-FV join key types and numpy 2.0 compat
caseyclements Apr 16, 2026
44abd92
Add offline_write_batch to MongoDBOfflineStoreOne
caseyclements Apr 16, 2026
63fab1b
mongodb_one: clarify pipeline sort rationale and avoid sparse-column …
caseyclements Apr 16, 2026
90dd224
Add mongodb_native.py: initial MQL-based offline store (pre-refactor …
caseyclements Apr 17, 2026
0733688
Refactor mongodb_native: Atlas-first $documents+$lookup PIT join
caseyclements Apr 17, 2026
47f1040
Add unit tests for MongoDBOfflineStoreNative
caseyclements Apr 17, 2026
84ac27b
Add cross-implementation equivalence suite (test_cross.py)
caseyclements Apr 17, 2026
8346f28
Add benchmark_sweep.py: four-dimensional scaling suite across all thr…
caseyclements Apr 17, 2026
448a698
Add mongodb_agg offline store — $match+$sort+$group, O(log P) without…
caseyclements Apr 19, 2026
810b7d0
Vectorize agg scoring path, add upfront index build, ignore design/
caseyclements Apr 19, 2026
2a56c13
Adds offline_write_batch
caseyclements Apr 20, 2026
7e35e6a
Adds detail to handling of K in benchmarks.
caseyclements Apr 20, 2026
fcbb8e1
Adds missing typing.
caseyclements Apr 21, 2026
6e8f502
Consolidate MongoDB offline store to single implementation
caseyclements Apr 23, 2026
a6dca86
Fixes strict_pit_false unit test.
caseyclements Apr 23, 2026
055d238
Fix MongoDB offline store: projection keying, TTL bounds, field mappi…
caseyclements Apr 24, 2026
136dc09
Fix MongoDB test DataSourceCreator: implement create_logged_features_…
caseyclements Apr 24, 2026
f7e6230
Fix pd.isna() ValueError on list/array features in offline_write_batch
caseyclements Apr 24, 2026
9d9f7f1
Merge branch 'master' into FEAST-OfflineStore-INTPYTHON-297
caseyclements Apr 24, 2026
274e6f8
Fix bool/int type inference order in get_table_column_names_and_types
caseyclements Apr 24, 2026
fcb92da
Fix mongodb_to_feast_value_type to accept type strings from get_table…
caseyclements Apr 24, 2026
23b6788
Sort join keys in _serialize_entity_key_from_row for consistent entit…
caseyclements Apr 24, 2026
26b5153
Resolve .secrets.baseline merge conflict with master
caseyclements Apr 24, 2026
5227804
Add mongodb to CI extras so pymongo is installed in CI
caseyclements Apr 24, 2026
2905b80
Remove MongoDB from universal test parametrization
caseyclements Apr 24, 2026
a9b9e8f
Regenerate pixi lockfile after pymongo addition to ci extras
caseyclements Apr 24, 2026
f558aa5
Fix scoring_path heuristic: check entity uniqueness per-FV, not globally
caseyclements Apr 28, 2026
349c5f1
Fix offline_write_batch: use original join key names for entity seria…
caseyclements Apr 28, 2026
8f8de11
Fix pull_latest and pull_all to return join key columns
caseyclements Apr 28, 2026
9de5548
Fix scoring_path: require homogeneous timestamps to prevent data loss
caseyclements Apr 29, 2026
c34b6cd
Fix training path: sort fv_df by created_at to break event_timestamp …
caseyclements Apr 29, 2026
1d2a1c9
Clean up stale docstrings: remove references to MongoDBOfflineStoreOn…
caseyclements Apr 29, 2026
02e457c
Clean up stale docstrings: remove references to MongoDBOfflineStoreOn…
caseyclements Apr 29, 2026
ae5256f
Added driver metadata to clients
caseyclements Apr 30, 2026
f1996fc
Update .secrets.baseline
caseyclements Apr 30, 2026
b3b7563
Remove preview warnings from MongoDB offline store
caseyclements Apr 30, 2026
4a43815
Merge branch 'master' into FEAST-OfflineStore-INTPYTHON-297
caseyclements Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -1539,5 +1539,5 @@
}
]
},
"generated_at": "2026-04-17T13:31:24Z"
"generated_at": "2026-04-30T13:56:37Z"
}
6 changes: 3 additions & 3 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ test = [
]

ci = [
"feast[test, aws, azure, cassandra, clickhouse, couchbase, delta, docling, duckdb, elasticsearch, faiss, gcp, ge, go, grpcio, hazelcast, hbase, ibis, image, k8s, mcp, milvus, mssql, mysql, openlineage, opentelemetry, oracle, spark, trino, postgres, pytorch, qdrant, rag, ray, redis, singlestore, snowflake, sqlite_vec]",
"feast[test, aws, azure, cassandra, clickhouse, couchbase, delta, docling, duckdb, elasticsearch, faiss, gcp, ge, go, grpcio, hazelcast, hbase, ibis, image, k8s, mcp, milvus, mongodb, mssql, mysql, openlineage, opentelemetry, oracle, spark, trino, postgres, pytorch, qdrant, rag, ray, redis, singlestore, snowflake, sqlite_vec]",
"build",
"virtualenv==20.23.0",
"dbt-artifacts-parser",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# MongoDB Offline Store

This offline store lets you train models and run batch scoring directly from it.
All feature views share a single collection (`feature_history`). Reads use
MongoDB aggregation pipelines with a compound index, so per-entity cost is
O(log n_observations) regardless of collection size, and K feature views with the same
entity key collapse into one round-trip instead of K (1 if your data shares a unique id.)

## Schema

All feature views share one collection (default: `feature_history`), discriminated by the `feature_view` field.

```javascript
// Collection: feature_history
{
"entity_id": Binary("..."), // Serialized entity key (bytes)
"feature_view": "driver_stats", // Discriminator
"features": { // Nested subdocument
"trips_today": 5,
"rating": 4.8
},
"event_timestamp": ISODate("2024-01-15T10:00:00Z"),
"created_at": ISODate("2024-01-15T10:00:01Z")
}
```
## Index

The store creates one compound index lazily on first use. This index supports every query issued..

```javascript
db.feature_history.createIndex({
"entity_id": 1,
"feature_view": 1,
"event_timestamp": -1,
"created_at": -1
})

```
## Configuration

```yaml
offline_store:
type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStore
connection_string: mongodb://localhost:27017
database: feast
collection: feature_history # optional, default: feature_history
```

## Key Features

**Query-collapse** — Feature views that share the same join key set are grouped into a single MongoDB aggregation round-trip instead of one per feature view. Reduces round-trips from K to the number of unique join key signatures, often one.

**Scoring path** — When `entity_df` contains unique entity IDs, a `$match + $sort + $group` pipeline performs server-side deduplication returning at most one document per `(entity_id, feature_view)`. The compound index makes per-entity cost O(log n_obs).

**Training path** — When `entity_df` contains repeated entity IDs at different timestamps, the `$group` stage is omitted and `pandas.merge_asof` performs per-row point-in-time joins optimized in C.

**`strict_pit`** — `get_historical_features` accepts a `strict_pit` keyword argument (default `True`). With `strict_pit=True` (default, safe for training), documents whose timestamp is strictly after the entity request timestamp are returned as `NULL`. Set `strict_pit=False` for real-time inference where you always want the most recent observation.


## Writing Data

Use `offline_write_batch` (called automatically by `feast materialize`) to write feature observations:

```python
store.write_to_offline_store(feature_view_name, df)
```

Documents are appended; `pull_latest` and the scoring path select the highest `created_at` at read time.

## Memory Behaviour

The store filters by entity key in `$match` rather than loading the entire collection. Memory usage is bounded by the number of unique entity IDs × documents per entity, not the total collection size.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import feast.version

try:
from pymongo.driver_info import DriverInfo

DRIVER_METADATA = DriverInfo(name="Feast", version=feast.version.get_version())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used somewhere ? probably can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntkathole Thanks for catching this! It is passed to the client so that we can follow how many clusters are using the feast integration. Without it, we'd have no data. I added a commit for this.

except ImportError:
DRIVER_METADATA = None # type: ignore[assignment]
Loading
Loading