feat: MLflow distributed tracing for Feast feature server#6405
feat: MLflow distributed tracing for Feast feature server#6405Vperiodt wants to merge 16 commits into
Conversation
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
There was a problem hiding this comment.
🔴 Removing detect_types from SQLite connection breaks event_ts in retrieve_online_documents_v2
The PR removes detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES from _initialize_conn at sqlite.py:705. This flag was responsible for automatically converting stored timestamps back to datetime objects via the registered converters (lines 99-101). Since datetime objects are stored as epoch integers (via the adapt_datetime_epoch adapter at line 81), removing detect_types means timestamp columns now return raw int values instead of datetime objects.
The online_read method was correctly updated (lines 278-285) to handle int timestamps. However, retrieve_online_documents_v2 was NOT updated. At line 677, the check isinstance(entity_dict[entity_key_value]["event_ts"], datetime) will always be False because event_ts is now an int, causing res_event_ts to always be None. This loses all timestamp information for document retrieval results, which flows into _retrieve_from_online_store_v2 (feature_store.py:3575) and ultimately into the API response.
(Refers to lines 677-678)
Was this helpful? React with 👍 or 👎 to provide feedback.
|
linter is failing |
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
| @@ -0,0 +1,130 @@ | |||
| # Feast MLflow Tracing | |||
jyejare
left a comment
There was a problem hiding this comment.
This PR adds comprehensive MLflow distributed tracing support to Feast, enabling automatic trace collection for feature server endpoints, cross-process trace linking via traceparent headers, and in-process feature context tagging. The implementation is well-structured with proper separation of concerns, comprehensive test coverage, and graceful degradation when dependencies are missing.
| if isinstance(ts, (int, float)): | ||
| res_ts = datetime.fromtimestamp(ts, tz=timezone.utc) | ||
| else: | ||
| res_ts = ts.replace(tzinfo=timezone.utc) | ||
| ts = cast(datetime, ts) | ||
| if ts.tzinfo is not None: | ||
| res_ts = ts.astimezone(timezone.utc) | ||
| else: |
There was a problem hiding this comment.
[Warning] Inconsistent timestamp handling may cause runtime errors
The code handles both int/float and datetime timestamp types, but the original code assumed datetime objects. This change could break existing code that expects consistent datetime handling and may cause timezone issues with numeric timestamps.
Suggested:
| if isinstance(ts, (int, float)): | |
| res_ts = datetime.fromtimestamp(ts, tz=timezone.utc) | |
| else: | |
| res_ts = ts.replace(tzinfo=timezone.utc) | |
| ts = cast(datetime, ts) | |
| if ts.tzinfo is not None: | |
| res_ts = ts.astimezone(timezone.utc) | |
| else: | |
| ts = cast(datetime, ts) | |
| if ts.tzinfo is not None: | |
| res_ts = ts.astimezone(timezone.utc) | |
| else: | |
| res_ts = ts.replace(tzinfo=timezone.utc) |
| oracle = ["ibis-framework[oracle]>=10.0.0"] | ||
| mysql = ["pymysql", "types-PyMySQL"] | ||
| openlineage = ["openlineage-python>=1.40.0"] | ||
| mlflow = [ | ||
| "mlflow>=2.14.0", | ||
| "opentelemetry-api>=1.28.0", | ||
| "opentelemetry-sdk>=1.28.0", | ||
| "opentelemetry-instrumentation-fastapi>=0.49b0", | ||
| "opentelemetry-instrumentation-httpx>=0.49b0", |
There was a problem hiding this comment.
[Suggestion] Consider adding opentelemetry-exporter-otlp-proto-http dependency
The mlflow extra includes several OpenTelemetry packages but is missing opentelemetry-exporter-otlp-proto-http which was mentioned in the pixi.lock file. This could cause issues if OTLP export is needed.
Suggested:
| oracle = ["ibis-framework[oracle]>=10.0.0"] | |
| mysql = ["pymysql", "types-PyMySQL"] | |
| openlineage = ["openlineage-python>=1.40.0"] | |
| mlflow = [ | |
| "mlflow>=2.14.0", | |
| "opentelemetry-api>=1.28.0", | |
| "opentelemetry-sdk>=1.28.0", | |
| "opentelemetry-instrumentation-fastapi>=0.49b0", | |
| "opentelemetry-instrumentation-httpx>=0.49b0", | |
| mlflow = [ | |
| "mlflow>=2.14.0", | |
| "opentelemetry-api>=1.28.0", | |
| "opentelemetry-sdk>=1.28.0", | |
| "opentelemetry-instrumentation-fastapi>=0.49b0", | |
| "opentelemetry-instrumentation-httpx>=0.49b0", | |
| "opentelemetry-exporter-otlp-proto-http>=1.28.0", | |
| ] |
| return | ||
|
|
||
| try: | ||
| from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor | ||
|
|
||
| FastAPIInstrumentor.instrument_app(app) | ||
| logger.info("FastAPI OTEL instrumentation enabled for trace propagation") | ||
| except ImportError: | ||
| logger.debug( | ||
| "opentelemetry-instrumentation-fastapi not installed; " | ||
| "cross-process trace linking disabled" | ||
| ) | ||
|
|
There was a problem hiding this comment.
[Suggestion] Improve error handling in FastAPI instrumentation
The ImportError handling is good, but it would be helpful to also catch other potential exceptions during instrumentation and provide more specific error messages.
Suggested:
| return | |
| try: | |
| from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor | |
| FastAPIInstrumentor.instrument_app(app) | |
| logger.info("FastAPI OTEL instrumentation enabled for trace propagation") | |
| except ImportError: | |
| logger.debug( | |
| "opentelemetry-instrumentation-fastapi not installed; " | |
| "cross-process trace linking disabled" | |
| ) | |
| try: | |
| from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor | |
| FastAPIInstrumentor.instrument_app(app) | |
| logger.info("FastAPI OTEL instrumentation enabled for trace propagation") | |
| except ImportError: | |
| logger.debug( | |
| "opentelemetry-instrumentation-fastapi not installed; " | |
| "cross-process trace linking disabled" | |
| ) | |
| except Exception as e: | |
| logger.warning(f"Failed to instrument FastAPI app: {e}") |
| # async exporter must send the span before that happens. | ||
| if has_traceparent: | ||
| try: | ||
| _mlflow_mod.flush_trace_async_logging() | ||
| except Exception: |
There was a problem hiding this comment.
[Warning] Potential performance impact from synchronous flush in async context
Calling flush_trace_async_logging() synchronously in an async endpoint could block the event loop and impact performance. Consider making this operation truly asynchronous or moving it to a background task.
| def feast_trace_scope() -> Iterator[FeastTraceContext]: | ||
| """Context manager that creates and cleans up a ``FeastTraceContext``.""" | ||
| ctx = FeastTraceContext() | ||
| _thread_local.feast_ctx = ctx | ||
| try: | ||
| yield ctx | ||
| finally: | ||
| ctx.clear() |
There was a problem hiding this comment.
[Suggestion] Consider adding context validation
The feast_trace_scope context manager could benefit from validation to ensure the context is properly cleaned up even if exceptions occur during setup.
Suggested:
| def feast_trace_scope() -> Iterator[FeastTraceContext]: | |
| """Context manager that creates and cleans up a ``FeastTraceContext``.""" | |
| ctx = FeastTraceContext() | |
| _thread_local.feast_ctx = ctx | |
| try: | |
| yield ctx | |
| finally: | |
| ctx.clear() | |
| @contextmanager | |
| def feast_trace_scope() -> Iterator[FeastTraceContext]: | |
| """Context manager that creates and cleans up a ``FeastTraceContext``.""" | |
| old_ctx = getattr(_thread_local, 'feast_ctx', None) | |
| ctx = FeastTraceContext() | |
| _thread_local.feast_ctx = ctx | |
| try: | |
| yield ctx | |
| finally: | |
| ctx.clear() | |
| _thread_local.feast_ctx = old_ctx |
| enable_tracing: StrictBool = True | ||
| """ bool: When True and mlflow.enabled=True, server-side API calls | ||
| create MLflow trace spans via mlflow.start_span(). Spans appear | ||
| in the MLflow UI Traces tab and support parent-child linking via | ||
| traceparent headers. Defaults to True. """ | ||
|
|
There was a problem hiding this comment.
[Nitpick] Consider more descriptive field name
The field name 'enable_tracing' could be more specific since it only controls MLflow tracing, not all tracing in Feast.
Suggested:
| enable_tracing: StrictBool = True | |
| """ bool: When True and mlflow.enabled=True, server-side API calls | |
| create MLflow trace spans via mlflow.start_span(). Spans appear | |
| in the MLflow UI Traces tab and support parent-child linking via | |
| traceparent headers. Defaults to True. """ | |
| enable_distributed_tracing: StrictBool = True | |
| """ bool: When True and mlflow.enabled=True, server-side API calls | |
| create MLflow trace spans via mlflow.start_span(). Spans appear | |
| in the MLflow UI Traces tab and support parent-child linking via | |
| traceparent headers. Defaults to True. """ |
What this PR does / why we need it:
Server-side spans: Every feature retrieval endpoint (get-online-features, retrieve-online-documents, write-to-online-store) creates an MLflow span with attributes like feature names, entity count, and project name. Enabled via
mlflow.enable_tracing: truein feature_store.yaml.Cross-process trace linking: When the caller sends a W3C
traceparentheader, the Feast span becomes a child of the caller's trace. One trace tree, two processes. The MCP server also forwardsmcp-session-idas a span attribute for session grouping.In-process LLM span tagging: When an agent uses the Feast Python SDK directly (same process), a span processor automatically tags LLM spans with which features were retrieved.
Tracing never blocks feature serving. Missing MLflow, old MLflow version, or exceptions during span creation all degrade gracefully.
Which issue(s) this PR fixes:
Checks
git commit -s)Testing Strategy
Misc