-
Notifications
You must be signed in to change notification settings - Fork 1.3k
docs: Add MongoDB data-source and offline-store reference documentation #6351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b322422
506d235
e1ab8d8
03aaa78
71b8853
3eab42a
8d6125a
6f0af71
c6f1223
a23e7e2
49ab239
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| # MongoDB source (contrib) | ||
|
|
||
| ## Description | ||
|
|
||
| MongoDB data sources are [MongoDB](https://www.mongodb.com/) collections that can be used as a source for feature data. The `MongoDBSource` points at a MongoDB collection and provides the metadata Feast needs to read historical features from the offline store's collection. | ||
|
|
||
| ## Examples | ||
|
|
||
| Defining a MongoDB source: | ||
|
|
||
| ```python | ||
| from feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb import ( | ||
| MongoDBSource, | ||
| ) | ||
|
|
||
| driver_stats_source = MongoDBSource( | ||
| name="driver_stats", | ||
| timestamp_field="event_timestamp", | ||
| created_timestamp_column="created_at", | ||
| ) | ||
| ``` | ||
|
|
||
| The `name` field becomes the `feature_view` discriminator stored in every document in the `feature_history` collection. | ||
|
|
||
| Configuration options such as `connection_string`, `database`, and `collection` are inherited from the offline store configuration in `feature_store.yaml`. | ||
|
|
||
| The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBSource). | ||
|
|
||
| ## Vector Search | ||
|
|
||
| The MongoDB online store supports [Atlas Vector Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/), enabling similarity search over feature embeddings stored in MongoDB Atlas. This is powered by the `$vectorSearch` aggregation stage and requires MongoDB Atlas (or the `mongodb/mongodb-atlas-local` Docker image for local development). | ||
|
|
||
| See [PR #6344](https://github.com/feast-dev/feast/pull/6344) for full implementation details. | ||
|
|
||
| ### Configuration | ||
|
|
||
| Enable vector search in your `feature_store.yaml`: | ||
|
|
||
| ```yaml | ||
| project: my_project | ||
| provider: local | ||
| online_store: | ||
| type: mongodb | ||
| connection_string: mongodb+srv://<user>:<pass>@cluster.mongodb.net | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'll see in CI, but this may need |
||
| vector_enabled: true | ||
| similarity: cosine # cosine | euclidean | dotProduct | ||
| vector_index_wait_timeout: 60 # seconds to wait for index to become queryable | ||
| vector_index_wait_poll_interval: 1.0 # seconds between polls | ||
| ``` | ||
|
|
||
| ### Defining a Feature View with Vector Index | ||
|
|
||
| Mark embedding fields with `vector_index=True` and specify `vector_length`: | ||
|
|
||
| ```python | ||
| from feast import Entity, FeatureView, Field, FileSource | ||
| from feast.types import Array, Float32, Int64, String | ||
| from datetime import timedelta | ||
|
|
||
| item_embeddings = FeatureView( | ||
| name="item_embeddings", | ||
| entities=[Entity(name="item_id", join_keys=["item_id"])], | ||
| schema=[ | ||
| Field( | ||
| name="embedding", | ||
| dtype=Array(Float32), | ||
| vector_index=True, | ||
| vector_length=384, | ||
| vector_search_metric="cosine", | ||
| ), | ||
| Field(name="title", dtype=String), | ||
| Field(name="item_id", dtype=Int64), | ||
| ], | ||
| source=FileSource(path="items.parquet", timestamp_field="event_timestamp"), | ||
| ttl=timedelta(hours=24), | ||
| ) | ||
| ``` | ||
|
|
||
| When `feast apply` (or `store.update()`) runs with `vector_enabled=True`, Atlas vector search indexes are automatically created for any field with `vector_index=True`. Indexes are also automatically dropped when feature views are removed. | ||
|
|
||
| ### Retrieving Documents via Vector Search | ||
|
|
||
| Use `retrieve_online_documents_v2()` to perform similarity search: | ||
|
|
||
| ```python | ||
| source = FeatureStore(repo_path=".") | ||
| results = store.retrieve_online_documents_v2( | ||
| config=repo_config, | ||
| table=item_embeddings, | ||
| requested_features=["embedding", "title"], | ||
| embedding=[0.1, 0.2, ...], # query vector | ||
| top_k=5, | ||
| ) | ||
|
|
||
| # Each result is a (event_timestamp, entity_key_proto, feature_dict) tuple. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The result of FeatureStore's version is Given the change, I suggest reviewing the comments below. They are about the implementation in MongoDB, which IS what they want to hear about (how it works), just not specific to the function above. |
||
| # feature_dict includes a synthetic "distance" key with the vector search score. | ||
| for ts, entity_key, features in results: | ||
| print(features["title"].string_val, features["distance"].float_val) | ||
| ``` | ||
| ``` | ||
|
|
||
| ### How It Works | ||
|
|
||
| - **Index creation**: `update()` creates an Atlas vector search index named `<feature_view>__<field>__vs_index` for each vector-indexed field. It waits for the index to reach `READY` status before proceeding. | ||
| - **Query execution**: `retrieve_online_documents_v2()` builds a `$vectorSearch` aggregation pipeline with `numCandidates = max(top_k * 10, 100)` and the specified `limit`. | ||
| - **Score**: Results include a `distance` field populated from `$meta: "vectorSearchScore"`. | ||
| - **BSON compatibility**: Query vectors are coerced to native Python floats to avoid numpy serialization issues. | ||
| - **Idempotency**: Calling `update()` multiple times will not duplicate indexes. | ||
|
|
||
| ## Supported Types | ||
|
|
||
| MongoDB data sources support all eight primitive types (`bytes`, `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`) and their corresponding array types. Complex types such as `Map` and `Struct` are preserved through the MongoDB document model. | ||
| For a comparison against other batch data sources, please see [here](overview.md#functionality-matrix). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| # MongoDB offline store (contrib) | ||
|
|
||
| ## Description | ||
|
|
||
| The MongoDB offline store provides support for reading [MongoDBSource](../data-sources/mongodb.md). | ||
| * Uses a single shared collection with a compound index for all FeatureViews, distinguished by a `feature_view` discriminator field. | ||
| * Entity dataframes can be provided as a Pandas dataframe. The offline store converts entity identifiers into serialized entity keys for efficient lookup against the collection. | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Skip the description. |
||
| ## Getting started | ||
|
|
||
| In order to use this offline store, you'll need to run `pip install 'feast[mongodb]'`. | ||
|
|
||
| ## Example | ||
|
|
||
| {% code title="feature_store.yaml" %} | ||
| ```yaml | ||
| project: my_project | ||
| registry: data/registry.db | ||
| provider: local | ||
| offline_store: | ||
| type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStore | ||
| connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" # pragma: allowlist secret | ||
| database: feast | ||
| collection: feature_history | ||
| online_store: | ||
| type: mongodb | ||
| connection_string: "mongodb+srv://user:pass@cluster.mongodb.net" # pragma: allowlist secret | ||
| database_name: feast_online_store | ||
| collection_suffix: latest | ||
| client_kwargs: {} | ||
| ``` | ||
| {% endcode %} | ||
|
|
||
| The full set of configuration options is available in [MongoDBOfflineStoreConfig](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb.MongoDBOfflineStoreConfig). | ||
|
|
||
| ## Data Model | ||
|
|
||
| The offline store uses a single shared collection (by default `feature_history`) that stores append-only historical feature rows for all feature views. Each document represents one observation of one entity for one FeatureView at a specific event timestamp: | ||
|
|
||
| ```json | ||
| { | ||
| "entity_id": "Binary(...)", | ||
| "feature_view": "driver_stats", | ||
| "event_timestamp": "ISODate(2024-01-15T12:00:00Z)", | ||
| "created_at": "ISODate(2024-01-15T12:01:00Z)", | ||
| "features": { | ||
| "conv_rate": 0.72, | ||
| "acc_rate": 0.91, | ||
| "avg_daily_trips": 14 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Key properties: | ||
|
|
||
| * **Append-only**: Historical data is treated as immutable; corrections are written as new rows with newer `created_at` timestamps rather than in-place updates. | ||
| * **Time-series friendly**: `event_timestamp` represents when the feature value was observed; `created_at` is used as a tie-breaker when multiple observations share the same event timestamp. | ||
| * **Feature grouping by FeatureView**: `feature_view` identifies which FeatureView the row belongs to, so a single collection can host multiple FVs. | ||
|
|
||
| A single compound index supports all major query patterns: | ||
|
|
||
| ``` | ||
| (entity_id ASC, feature_view ASC, event_timestamp DESC, created_at DESC) | ||
| ``` | ||
|
|
||
| This index enables efficient range scans over entities and feature views, while ensuring that the most recent observation per `(entity_id, feature_view)` is seen first during aggregation. The index is created lazily on first use and cached per connection string. | ||
|
|
||
| ## Key Optimizations | ||
|
|
||
| * **Scoring vs. training paths**: When each entity appears only once in `entity_df` (scoring/inference — one feature lookup per entity), server-side `$group $first` efficiently returns the single latest value per entity. When the same entity appears at multiple timestamps (training — building a dataset with many historical snapshots per entity), the store retrieves all candidate rows and uses `pd.merge_asof` to select the correct point-in-time value for each request timestamp. | ||
| * **Two-level chunking**: `CHUNK_SIZE` (50,000 rows) controls the size of intermediate DataFrames in memory; `MONGO_BATCH_SIZE` (10,000 entity IDs) limits the query size sent to MongoDB. | ||
|
|
||
| ## Functionality Matrix | ||
|
|
||
| The set of functionality supported by offline stores is described in detail [here](overview.md#functionality). | ||
| Below is a matrix indicating which functionality is supported by the MongoDB offline store. | ||
|
|
||
| | | MongoDB | | ||
| | :----------------------------------------------------------------- | :------ | | ||
| | `get_historical_features` (point-in-time correct join) | yes | | ||
| | `pull_latest_from_table_or_query` (retrieve latest feature values) | yes | | ||
| | `pull_all_from_table_or_query` (retrieve a saved dataset) | yes | | ||
| | `offline_write_batch` (persist dataframes to offline store) | yes | | ||
| | `write_logged_features` (persist logged features to offline store) | no | | ||
|
|
||
| Below is a matrix indicating which functionality is supported by `MongoDBRetrievalJob`. | ||
|
|
||
| | | MongoDB | | ||
| | ----------------------------------------------------- | ------- | | ||
| | export to dataframe | yes | | ||
| | export to arrow table | yes | | ||
| | export to arrow batches | no | | ||
| | export to SQL | no | | ||
| | export to data lake (S3, GCS, etc.) | no | | ||
| | export to data warehouse | no | | ||
| | export as Spark dataframe | no | | ||
| | local execution of Python-based on-demand transforms | yes | | ||
| | remote execution of Python-based on-demand transforms | no | | ||
| | persist results in the offline store | yes | | ||
| | preview the query plan before execution | no | | ||
| | read partitioned data | no | | ||
|
|
||
| To compare this set of functionality against other offline stores, please see the full [functionality matrix](overview.md#functionality-matrix). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line be included? It links back to code..