|
| 1 | +# MLflow Integration |
| 2 | + |
| 3 | +Feast provides **native integration** with [MLflow](https://mlflow.org/) for automatic feature lineage tracking alongside ML experiments. When enabled, every feature retrieval is logged to the active MLflow run. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +- **Which features did this model use?** -- auto-logged on every `get_historical_features()` / `get_online_features()` call |
| 8 | +- **Which feature service should I use to serve this model?** -- resolved from model URI via `store.mlflow.resolve_features()` |
| 9 | +- **Can I reproduce the exact training data?** -- entity DataFrame saved as an MLflow artifact |
| 10 | +- **Which models break if I change a feature view?** -- reverse index via the Feast UI `/api/mlflow-feature-usage` endpoint |
| 11 | +- **When was the feature store last updated?** -- `feast apply` and `feast materialize` logged to a separate ops experiment |
| 12 | + |
| 13 | +### Capabilities |
| 14 | + |
| 15 | +| Capability | How | |
| 16 | +|---|---| |
| 17 | +| Auto-log feature metadata | Tags on every retrieval inside an active MLflow run | |
| 18 | +| Entity DataFrame archival | `entity_df.parquet` artifact for full reproducibility | |
| 19 | +| Model registration with lineage | `feast.feature_service` tag propagated to model versions | |
| 20 | +| Training-to-prediction linkage | `store.mlflow.load_model()` links prediction runs back to training runs | |
| 21 | +| Model-to-feature resolution | Map any model URI back to its Feast feature service | |
| 22 | +| Operation audit trail | `feast apply` / `feast materialize` logged to `{project}-feast-ops` | |
| 23 | +| `store.mlflow` API | Single entry point — zero `import mlflow`, zero client objects | |
| 24 | +| Feast UI integration | Per-feature-view usage stats and registered model associations | |
| 25 | + |
| 26 | +## Installation |
| 27 | + |
| 28 | +MLflow is an optional dependency: |
| 29 | + |
| 30 | +```bash |
| 31 | +pip install feast[mlflow] |
| 32 | +``` |
| 33 | + |
| 34 | +## Configuration |
| 35 | + |
| 36 | +Add the `mlflow` section to your `feature_store.yaml`: |
| 37 | + |
| 38 | +```yaml |
| 39 | +project: my_project |
| 40 | +registry: data/registry.db |
| 41 | +provider: local |
| 42 | +online_store: |
| 43 | + type: sqlite |
| 44 | + path: data/online_store.db |
| 45 | + |
| 46 | +mlflow: |
| 47 | + enabled: true |
| 48 | + tracking_uri: http://127.0.0.1:5000 # optional, falls back to MLFLOW_TRACKING_URI env var |
| 49 | + auto_log: true # default |
| 50 | + auto_log_entity_df: false # default |
| 51 | + entity_df_max_rows: 100000 # default |
| 52 | + log_operations: false # default |
| 53 | + ops_experiment_suffix: "-feast-ops" # default |
| 54 | +``` |
| 55 | +
|
| 56 | +### Configuration options |
| 57 | +
|
| 58 | +| Option | Type | Default | Description | |
| 59 | +|--------|------|---------|-------------| |
| 60 | +| `enabled` | bool | `false` | Master switch for the entire integration | |
| 61 | +| `tracking_uri` | string | *(none)* | MLflow tracking server URI. Falls back to `MLFLOW_TRACKING_URI` env var, then MLflow default (`./mlruns`) | |
| 62 | +| `auto_log` | bool | `true` | Automatically log feature metadata on every retrieval when an active MLflow run exists | |
| 63 | +| `auto_log_entity_df` | bool | `false` | Save the entity DataFrame as `entity_df.parquet` artifact on historical retrieval | |
| 64 | +| `entity_df_max_rows` | int | `100000` | Skip entity DataFrame artifact upload for DataFrames exceeding this limit | |
| 65 | +| `log_operations` | bool | `false` | Log `feast apply` and `feast materialize` to a separate MLflow experiment | |
| 66 | +| `ops_experiment_suffix` | string | `"-feast-ops"` | Suffix appended to project name for the operations experiment | |
| 67 | + |
| 68 | +### Tracking URI resolution |
| 69 | + |
| 70 | +The tracking URI is resolved in this order: |
| 71 | + |
| 72 | +1. `tracking_uri` field in `feature_store.yaml` |
| 73 | +2. `MLFLOW_TRACKING_URI` environment variable |
| 74 | +3. MLflow's default (`./mlruns` local directory) |
| 75 | + |
| 76 | +This means you can omit `tracking_uri` from the YAML and set `MLFLOW_TRACKING_URI` in your environment instead, or it would be pulled from `./mlruns` automatically when both are not set. |
| 77 | + |
| 78 | +## What gets logged |
| 79 | + |
| 80 | +### Tags on retrieval runs |
| 81 | + |
| 82 | +When `auto_log: true` and an active MLflow run exists, each `get_historical_features()` or `get_online_features()` call records: |
| 83 | + |
| 84 | +| Tag | Example | Description | |
| 85 | +|-----|---------|-------------| |
| 86 | +| `feast.project` | `my_project` | Feast project name | |
| 87 | +| `feast.retrieval_type` | `historical` / `online` | Type of feature retrieval | |
| 88 | +| `feast.feature_service` | `driver_activity_v1` | Auto-resolved feature service name (if matched) | |
| 89 | +| `feast.feature_views` | `driver_hourly_stats` | Comma-separated feature view names | |
| 90 | +| `feast.feature_refs` | `driver_hourly_stats:conv_rate,...` | All feature references | |
| 91 | +| `feast.entity_count` | `200` | Number of entities in the request | |
| 92 | +| `feast.feature_count` | `5` | Number of features retrieved | |
| 93 | + |
| 94 | +### Metrics |
| 95 | + |
| 96 | +| Metric | Example | Description | |
| 97 | +|--------|---------|-------------| |
| 98 | +| `feast.job_submission_sec` | `0.4321` | Feature retrieval duration in seconds | |
| 99 | + |
| 100 | +### Artifacts |
| 101 | + |
| 102 | +When `auto_log_entity_df: true` and the entity DataFrame has fewer than `entity_df_max_rows` rows: |
| 103 | + |
| 104 | +| Artifact | Description | |
| 105 | +|----------|-------------| |
| 106 | +| `entity_df.parquet` | Full entity DataFrame used in the retrieval | |
| 107 | + |
| 108 | +When a model is logged via `store.mlflow.log_model()`: |
| 109 | + |
| 110 | +| Artifact | Description | |
| 111 | +|----------|-------------| |
| 112 | +| `feast_features.json` | JSON list of feature references the model was trained on | |
| 113 | + |
| 114 | +### Entity DataFrame metadata |
| 115 | + |
| 116 | +Regardless of `auto_log_entity_df`, the following metadata is logged when present: |
| 117 | + |
| 118 | +| Tag / Param | When | Description | |
| 119 | +|-------------|------|-------------| |
| 120 | +| `feast.entity_df_type` | Always | `dataframe`, `sql`, or `range` | |
| 121 | +| `feast.entity_df_rows` | DataFrame input | Row count | |
| 122 | +| `feast.entity_df_columns` | DataFrame input | Column names | |
| 123 | +| `feast.entity_df_query` | SQL input | The SQL query string | |
| 124 | +| `feast.start_date` / `feast.end_date` | Range-based input | Date range | |
| 125 | + |
| 126 | +### Operation logs |
| 127 | + |
| 128 | +When `log_operations: true`, `feast apply` and `feast materialize` create self-contained runs in the `{project}{ops_experiment_suffix}` experiment (default: `my_project-feast-ops`): |
| 129 | + |
| 130 | +**Apply runs:** |
| 131 | + |
| 132 | +| Tag / Metric | Example | |
| 133 | +|--------------|---------| |
| 134 | +| `feast.operation` | `apply` | |
| 135 | +| `feast.project` | `my_project` | |
| 136 | +| `feast.feature_views_changed` | `driver_hourly_stats,order_stats` | |
| 137 | +| `feast.feature_services_changed` | `driver_activity_v1` | |
| 138 | +| `feast.entities_changed` | `driver,restaurant` | |
| 139 | +| `feast.apply.feature_views_count` | `2` | |
| 140 | +| `feast.apply.feature_services_count` | `1` | |
| 141 | +| `feast.apply.entities_count` | `2` | |
| 142 | + |
| 143 | +**Materialize runs:** |
| 144 | + |
| 145 | +| Tag / Metric | Example | |
| 146 | +|--------------|---------| |
| 147 | +| `feast.operation` | `materialize` / `materialize_incremental` | |
| 148 | +| `feast.project` | `my_project` | |
| 149 | +| `feast.materialize.feature_views` | `driver_hourly_stats` | |
| 150 | +| `feast.materialize.start_date` | `2024-01-01T00:00:00` | |
| 151 | +| `feast.materialize.end_date` | `2024-01-02T00:00:00` | |
| 152 | +| `feast.materialize.duration_sec` | `12.3456` | |
| 153 | + |
| 154 | +## Usage |
| 155 | + |
| 156 | +### Automatic logging (zero code) |
| 157 | + |
| 158 | +With the configuration above, feature metadata is logged automatically whenever there is an active MLflow run. No explicit `import mlflow` is needed — just use `store.mlflow`: |
| 159 | + |
| 160 | +```python |
| 161 | +from feast import FeatureStore |
| 162 | +
|
| 163 | +store = FeatureStore(".") |
| 164 | +
|
| 165 | +with store.mlflow.start_run(run_name="my_training"): |
| 166 | + training_df = store.get_historical_features( |
| 167 | + features=store.get_feature_service("driver_activity_v1"), |
| 168 | + entity_df=entity_df, |
| 169 | + ).to_df() |
| 170 | + # The run is now tagged with feast.feature_refs, feast.feature_views, etc. |
| 171 | +
|
| 172 | + model = train(training_df) |
| 173 | + store.mlflow.log_model(model, "model") |
| 174 | +``` |
| 175 | + |
| 176 | +No extra code needed — the tags are written automatically. |
| 177 | + |
| 178 | +### `store.mlflow` API (recommended) |
| 179 | + |
| 180 | +`store.mlflow` is the primary way to interact with the Feast–MLflow integration. It provides Feast-enhanced versions of common MLflow operations, and delegates everything else to the raw `mlflow` module: |
| 181 | + |
| 182 | +```python |
| 183 | +from feast import FeatureStore |
| 184 | +from sklearn.linear_model import LogisticRegression |
| 185 | +
|
| 186 | +store = FeatureStore(".") |
| 187 | +
|
| 188 | +# Training |
| 189 | +with store.mlflow.start_run(run_name="v1_training"): |
| 190 | + df = store.get_historical_features( |
| 191 | + features=store.get_feature_service("driver_activity_v1"), |
| 192 | + entity_df=entity_df, |
| 193 | + ).to_df() |
| 194 | +
|
| 195 | + model = LogisticRegression().fit(X, y) |
| 196 | + store.mlflow.log_model(model, "model") # Feast-enhanced: saves feast_features.json |
| 197 | + train_run_id = store.mlflow.active_run_id |
| 198 | +
|
| 199 | +# Register model (auto-tags version with feast.feature_service) |
| 200 | +store.mlflow.register_model(f"runs:/{train_run_id}/model", "driver_model") |
| 201 | +
|
| 202 | +# Prediction (auto-links to training run) |
| 203 | +with store.mlflow.start_run(run_name="prediction"): |
| 204 | + model = store.mlflow.load_model("models:/driver_model/1") |
| 205 | + online_features = store.get_online_features( |
| 206 | + features=store.get_feature_service("driver_activity_v1"), |
| 207 | + entity_rows=[{"driver_id": 1001}], |
| 208 | + ) |
| 209 | + predictions = model.predict(...) |
| 210 | +``` |
| 211 | + |
| 212 | +### `feast.mlflow` module API (alternative) |
| 213 | + |
| 214 | +For users who prefer a module-level import, `feast.mlflow` is a **drop-in replacement for `import mlflow`** that delegates to the same `store.mlflow` client under the hood: |
| 215 | + |
| 216 | +```python |
| 217 | +import feast.mlflow |
| 218 | +from feast import FeatureStore |
| 219 | +
|
| 220 | +store = FeatureStore(".") # auto-registers with feast.mlflow |
| 221 | +
|
| 222 | +with feast.mlflow.start_run(run_name="training"): |
| 223 | + df = store.get_historical_features(...).to_df() |
| 224 | + feast.mlflow.log_params({"lr": "0.01"}) # plain passthrough |
| 225 | + feast.mlflow.log_metrics({"f1": 0.85}) # plain passthrough |
| 226 | + feast.mlflow.log_model(model, "model") # Feast-enhanced |
| 227 | +``` |
| 228 | + |
| 229 | +#### Store resolution |
| 230 | + |
| 231 | +`feast.mlflow` resolves its `FeatureStore` in this order: |
| 232 | + |
| 233 | +1. **Explicit `feast.mlflow.init(store)`** — if called, overrides everything |
| 234 | +2. **Auto-registered** — the most recently created `FeatureStore` with `mlflow.enabled=true` registers itself automatically |
| 235 | +3. **Auto-discovery** — falls back to `FeatureStore(".")` from the current directory |
| 236 | + |
| 237 | +In most cases, simply creating a `FeatureStore(...)` is enough — no `init()` needed. |
| 238 | + |
| 239 | +#### Error handling |
| 240 | + |
| 241 | +`feast.mlflow` raises clear errors on first use if something is misconfigured: |
| 242 | + |
| 243 | +| Condition | Error | |
| 244 | +|-----------|-------| |
| 245 | +| No `feature_store.yaml` in cwd and no store created | `RuntimeError` with guidance to call `feast.mlflow.init(store)` | |
| 246 | +| `mlflow.enabled` is not set to `true` | `RuntimeError` with guidance to set `mlflow.enabled=true` | |
| 247 | +| `mlflow` pip package not installed | `ImportError` with guidance to run `pip install feast[mlflow]` | |
| 248 | + |
| 249 | +When `mlflow.enabled` is `false` (or omitted), `store.mlflow` returns `None`, allowing callers to guard with `if store.mlflow:`. The `feast.mlflow` module raises `RuntimeError` only when you attempt to use it without an enabled store. |
| 250 | + |
| 251 | +### Feast-enhanced functions |
| 252 | + |
| 253 | +These functions add automatic Feast tagging and lineage on top of their MLflow counterparts: |
| 254 | + |
| 255 | +| Function | Enhancement | |
| 256 | +|----------|-------------| |
| 257 | +| `store.mlflow.start_run(run_name, tags)` | Auto-tags run with `feast.project` | |
| 258 | +| `store.mlflow.log_model(model, path, flavor)` | Auto-attaches `feast_features.json` artifact | |
| 259 | +| `store.mlflow.register_model(model_uri, name)` | Auto-tags model version with `feast.feature_service` | |
| 260 | +| `store.mlflow.load_model(model_uri)` | Auto-tags prediction run with training lineage | |
| 261 | + |
| 262 | +**Supported model flavors for `log_model()`:** `sklearn`, `pytorch`, `xgboost`, `lightgbm`, `tensorflow`, `keras`, `pyfunc`. |
| 263 | + |
| 264 | +### Feast-only functions |
| 265 | + |
| 266 | +These are unique to the Feast integration and have no `mlflow` equivalent: |
| 267 | + |
| 268 | +| Function | Description | |
| 269 | +|----------|-------------| |
| 270 | +| `store.mlflow.resolve_features(model_uri)` | Resolve model URI to Feast feature service name | |
| 271 | +| `store.mlflow.get_training_entity_df(run_id, ...)` | Recover entity DataFrame from a past MLflow run | |
| 272 | +| `store.mlflow.log_training_dataset(df, dataset_name)` | Log a training DataFrame as an MLflow dataset input | |
| 273 | +| `store.mlflow.active_run_id` | Current active MLflow run ID (or `None`) | |
| 274 | +| `store.mlflow.client` | The underlying `MlflowClient` instance for advanced queries | |
| 275 | +| `feast.mlflow.init(store)` | Explicitly bind `feast.mlflow` module to a `FeatureStore` (optional) | |
| 276 | + |
| 277 | +### Passthrough behavior |
| 278 | + |
| 279 | +The `feast.mlflow` module delegates any attribute not listed above to the raw `mlflow` module. This means you can use `feast.mlflow` as a drop-in replacement for `import mlflow`: |
| 280 | + |
| 281 | +```python |
| 282 | +feast.mlflow.log_params(params) # passes through to mlflow.log_params |
| 283 | +feast.mlflow.log_metrics(metrics) |
| 284 | +feast.mlflow.set_tag("env", "staging") |
| 285 | +feast.mlflow.MlflowClient() |
| 286 | +``` |
| 287 | + |
| 288 | +`store.mlflow` does **not** have this passthrough — it only exposes the Feast-enhanced and Feast-only methods listed above. To access raw `mlflow` functions from `store.mlflow`, use the escape hatches: |
| 289 | + |
| 290 | +```python |
| 291 | +store.mlflow.client.log_param(run_id, "lr", "0.01") # via MlflowClient instance |
| 292 | +store.mlflow.mlflow.log_params(params) # via raw mlflow module |
| 293 | +``` |
| 294 | + |
| 295 | +### Resolve a model back to its feature service |
| 296 | + |
| 297 | +```python |
| 298 | +from feast import FeatureStore |
| 299 | +
|
| 300 | +store = FeatureStore(".") |
| 301 | +fs_name = store.mlflow.resolve_features("models:/driver_model/1") |
| 302 | +# Returns: "driver_activity_v1" |
| 303 | +``` |
| 304 | + |
| 305 | +Resolution order: |
| 306 | +1. Model version tag `feast.feature_service` (set by `register_model()`) |
| 307 | +2. Training run tag `feast.feature_service` (set by auto-logging) |
| 308 | + |
| 309 | +### Reproduce training from a past run |
| 310 | + |
| 311 | +```python |
| 312 | +from feast import FeatureStore |
| 313 | +
|
| 314 | +store = FeatureStore(".") |
| 315 | +
|
| 316 | +entity_df = store.mlflow.get_training_entity_df(run_id="abc123") |
| 317 | +
|
| 318 | +with store.mlflow.start_run(run_name="retrain_v2"): |
| 319 | + new_df = store.get_historical_features( |
| 320 | + features=store.get_feature_service("driver_activity_v1"), |
| 321 | + entity_df=entity_df, |
| 322 | + ).to_df() |
| 323 | + model = train(new_df) |
| 324 | + store.mlflow.log_model(model, "model") |
| 325 | +``` |
| 326 | + |
| 327 | +This requires `auto_log_entity_df: true` to have been enabled when the original run was recorded. |
| 328 | + |
| 329 | +## Feast UI integration |
| 330 | + |
| 331 | +The Feast UI server exposes three API endpoints that aggregate data from MLflow: |
| 332 | + |
| 333 | +| Endpoint | Description | |
| 334 | +|----------|-------------| |
| 335 | +| `/api/mlflow-runs` | All Feast-tagged MLflow runs with linked registered models | |
| 336 | +| `/api/mlflow-feature-usage` | Per-feature-view usage stats (run count, last used, associated models) | |
| 337 | +| `/api/mlflow-feature-models` | Reverse index of feature refs to registered models | |
| 338 | + |
| 339 | +The feature view detail page in the Feast UI displays: |
| 340 | +- **MLflow Training Runs** count and **Last Used** date in the header stats |
| 341 | +- An **MLflow Usage** panel showing training run count, relative last-used time, and a table of registered models that depend on the feature view |
| 342 | + |
| 343 | +Start the Feast UI with: |
| 344 | + |
| 345 | +```bash |
| 346 | +feast ui --host 127.0.0.1 --port 8888 |
| 347 | +``` |
0 commit comments