-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: MongoDB offline store #6138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ntkathole
merged 78 commits into
feast-dev:master
from
caseyclements:FEAST-OfflineStore-INTPYTHON-297
Apr 30, 2026
Merged
Changes from all commits
Commits
Show all changes
78 commits
Select commit
Hold shift + click to select a range
d2c637d
feat: Add MongoDB offline store (ibis-based PIT join, v1 alpha)
caseyclements 77f243e
refactor: improve MongoDB offline store code quality
caseyclements bb134a2
Started work on full Mongo/MQL implementation. Kept MongoDBOfflineSto…
caseyclements 46e67b7
refactor: rename alpha to preview, clarify MQL pipeline comments
caseyclements d0c00ca
Added unit tests for offline store retrieval, requiring docker and py…
caseyclements c940d05
Added test of multiple feature views and compound join keys
caseyclements 37fad38
Initial implementation of native single-collection offline store
caseyclements 67c39d4
Added DriverInfo to MongoDBClients
caseyclements 7987405
Optimized MQL. Applied FV-level TTL
caseyclements adf1fb0
filter TTL by relevant FVs only, cautiously reset df index; add creat…
caseyclements ed1571e
Updated docstrings
caseyclements c09faee
Lazy index creation via _get_client_and_ensure_indexes
caseyclements 91e939c
Add performance benchmarks comparing Ibis vs Native MongoDB offline s…
caseyclements e4bfc31
Refactor Native get_historical_features: replace with fetch+pandas join
caseyclements 6218fa8
Refactor get_historical_features with chunked processing for large en…
caseyclements b7ffd84
Optimize Native get_historical_features: reuse client, increase batch…
caseyclements affdc2d
Remove duplicate MongoDBOfflineStoreNative from mongodb.py
caseyclements afef4fd
Consolidate mongodb_source.py into mongodb.py
caseyclements f273744
Rename mongodb_offline_store to mongodb, use One/Many naming convention
caseyclements df00e47
Add README.md documenting MongoDB offline store implementations
caseyclements 83d063b
Rename mongodb/ to mongodb_offline_store/, organize tests
caseyclements 977c240
Update docstring in benchmark.py
caseyclements 166d151
Update README to show created_at tie-breaker in Many schema
caseyclements fe24361
Update README index recommendations for Many implementation
caseyclements 398110c
Add auto-create index to MongoDBOfflineStoreMany
caseyclements 2feb97d
Update benchmark.py to use One/Many naming convention
caseyclements 9624311
Add comprehensive module docstring to mongodb_many.py
caseyclements 5c93a51
Add Feature Freshness and Schema Evolution docs to mongodb_many.py
caseyclements fd70c13
Add MongoDB DataSourceCreators for universal Feast tests
caseyclements b4a0260
Add .secrets.baseline
caseyclements 9efd700
Addressed PR comment: join_keys = get_expected_join_keys(project, fea…
caseyclements 5d67b3f
Adds tests scenario that not using offline_utils.get_expected_join_ke…
caseyclements f889e10
Tests revealed possible name collision in pandas.merge_asof
caseyclements fcbd609
Add further (Large) benchmark tests
caseyclements 7786393
Upgrades from Devin comments. Class cache _index_initialized; get_exp…
caseyclements b8fa01c
Addressed PR comments
caseyclements 6db7cce
Apply lower bound via max(TTL) when all feature viewws in a chunk hav…
caseyclements 596b126
Add created_at to compound index so that materialization is correct i…
caseyclements c0173e0
Handdle numpy scalers in _serialize_entity_key_from_row as suggested.
caseyclements 1f80f47
Add persist and tests
caseyclements daaaf32
Remove accidentally included design notes.
caseyclements 3a78f9d
Fix entity key serialization: per-FV join key types and numpy 2.0 compat
caseyclements 44abd92
Add offline_write_batch to MongoDBOfflineStoreOne
caseyclements 63fab1b
mongodb_one: clarify pipeline sort rationale and avoid sparse-column …
caseyclements 90dd224
Add mongodb_native.py: initial MQL-based offline store (pre-refactor …
caseyclements 0733688
Refactor mongodb_native: Atlas-first $documents+$lookup PIT join
caseyclements 47f1040
Add unit tests for MongoDBOfflineStoreNative
caseyclements 84ac27b
Add cross-implementation equivalence suite (test_cross.py)
caseyclements 8346f28
Add benchmark_sweep.py: four-dimensional scaling suite across all thr…
caseyclements 448a698
Add mongodb_agg offline store — $match+$sort+$group, O(log P) without…
caseyclements 810b7d0
Vectorize agg scoring path, add upfront index build, ignore design/
caseyclements 2a56c13
Adds offline_write_batch
caseyclements 7e35e6a
Adds detail to handling of K in benchmarks.
caseyclements fcbb8e1
Adds missing typing.
caseyclements 6e8f502
Consolidate MongoDB offline store to single implementation
caseyclements a6dca86
Fixes strict_pit_false unit test.
caseyclements 055d238
Fix MongoDB offline store: projection keying, TTL bounds, field mappi…
caseyclements 136dc09
Fix MongoDB test DataSourceCreator: implement create_logged_features_…
caseyclements f7e6230
Fix pd.isna() ValueError on list/array features in offline_write_batch
caseyclements 9d9f7f1
Merge branch 'master' into FEAST-OfflineStore-INTPYTHON-297
caseyclements 274e6f8
Fix bool/int type inference order in get_table_column_names_and_types
caseyclements fcb92da
Fix mongodb_to_feast_value_type to accept type strings from get_table…
caseyclements 23b6788
Sort join keys in _serialize_entity_key_from_row for consistent entit…
caseyclements 26b5153
Resolve .secrets.baseline merge conflict with master
caseyclements 5227804
Add mongodb to CI extras so pymongo is installed in CI
caseyclements 2905b80
Remove MongoDB from universal test parametrization
caseyclements a9b9e8f
Regenerate pixi lockfile after pymongo addition to ci extras
caseyclements f558aa5
Fix scoring_path heuristic: check entity uniqueness per-FV, not globally
caseyclements 349c5f1
Fix offline_write_batch: use original join key names for entity seria…
caseyclements 8f8de11
Fix pull_latest and pull_all to return join key columns
caseyclements 9de5548
Fix scoring_path: require homogeneous timestamps to prevent data loss
caseyclements c34b6cd
Fix training path: sort fv_df by created_at to break event_timestamp …
caseyclements 1d2a1c9
Clean up stale docstrings: remove references to MongoDBOfflineStoreOn…
caseyclements 02e457c
Clean up stale docstrings: remove references to MongoDBOfflineStoreOn…
caseyclements ae5256f
Added driver metadata to clients
caseyclements f1996fc
Update .secrets.baseline
caseyclements b3b7563
Remove preview warnings from MongoDB offline store
caseyclements 4a43815
Merge branch 'master' into FEAST-OfflineStore-INTPYTHON-297
caseyclements File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1539,5 +1539,5 @@ | |
| } | ||
| ] | ||
| }, | ||
| "generated_at": "2026-04-17T13:31:24Z" | ||
| "generated_at": "2026-04-30T13:56:37Z" | ||
| } | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
72 changes: 72 additions & 0 deletions
72
sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| # MongoDB Offline Store | ||
|
|
||
| This offline store lets you train models and run batch scoring directly from it. | ||
| All feature views share a single collection (`feature_history`). Reads use | ||
| MongoDB aggregation pipelines with a compound index, so per-entity cost is | ||
| O(log n_observations) regardless of collection size, and K feature views with the same | ||
| entity key collapse into one round-trip instead of K (1 if your data shares a unique id.) | ||
|
|
||
| ## Schema | ||
|
|
||
| All feature views share one collection (default: `feature_history`), discriminated by the `feature_view` field. | ||
|
|
||
| ```javascript | ||
| // Collection: feature_history | ||
| { | ||
| "entity_id": Binary("..."), // Serialized entity key (bytes) | ||
| "feature_view": "driver_stats", // Discriminator | ||
| "features": { // Nested subdocument | ||
| "trips_today": 5, | ||
| "rating": 4.8 | ||
| }, | ||
| "event_timestamp": ISODate("2024-01-15T10:00:00Z"), | ||
| "created_at": ISODate("2024-01-15T10:00:01Z") | ||
| } | ||
| ``` | ||
| ## Index | ||
|
|
||
| The store creates one compound index lazily on first use. This index supports every query issued.. | ||
|
|
||
| ```javascript | ||
| db.feature_history.createIndex({ | ||
| "entity_id": 1, | ||
| "feature_view": 1, | ||
| "event_timestamp": -1, | ||
| "created_at": -1 | ||
| }) | ||
|
|
||
| ``` | ||
| ## Configuration | ||
|
|
||
| ```yaml | ||
| offline_store: | ||
| type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStore | ||
| connection_string: mongodb://localhost:27017 | ||
| database: feast | ||
| collection: feature_history # optional, default: feature_history | ||
| ``` | ||
|
|
||
| ## Key Features | ||
|
|
||
| **Query-collapse** — Feature views that share the same join key set are grouped into a single MongoDB aggregation round-trip instead of one per feature view. Reduces round-trips from K to the number of unique join key signatures, often one. | ||
|
|
||
| **Scoring path** — When `entity_df` contains unique entity IDs, a `$match + $sort + $group` pipeline performs server-side deduplication returning at most one document per `(entity_id, feature_view)`. The compound index makes per-entity cost O(log n_obs). | ||
|
|
||
| **Training path** — When `entity_df` contains repeated entity IDs at different timestamps, the `$group` stage is omitted and `pandas.merge_asof` performs per-row point-in-time joins optimized in C. | ||
|
|
||
| **`strict_pit`** — `get_historical_features` accepts a `strict_pit` keyword argument (default `True`). With `strict_pit=True` (default, safe for training), documents whose timestamp is strictly after the entity request timestamp are returned as `NULL`. Set `strict_pit=False` for real-time inference where you always want the most recent observation. | ||
|
|
||
|
|
||
| ## Writing Data | ||
|
|
||
| Use `offline_write_batch` (called automatically by `feast materialize`) to write feature observations: | ||
|
|
||
| ```python | ||
| store.write_to_offline_store(feature_view_name, df) | ||
| ``` | ||
|
|
||
| Documents are appended; `pull_latest` and the scoring path select the highest `created_at` at read time. | ||
|
|
||
| ## Memory Behaviour | ||
|
|
||
| The store filters by entity key in `$match` rather than loading the entire collection. Memory usage is bounded by the number of unique entity IDs × documents per entity, not the total collection size. |
8 changes: 8 additions & 0 deletions
8
sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| import feast.version | ||
|
|
||
| try: | ||
| from pymongo.driver_info import DriverInfo | ||
|
|
||
| DRIVER_METADATA = DriverInfo(name="Feast", version=feast.version.get_version()) | ||
| except ImportError: | ||
| DRIVER_METADATA = None # type: ignore[assignment] | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used somewhere ? probably can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntkathole Thanks for catching this! It is passed to the client so that we can follow how many clusters are using the feast integration. Without it, we'd have no data. I added a commit for this.