Skip to content

feat: MLflow distributed tracing for Feast feature server#6405

Open
Vperiodt wants to merge 16 commits into
feast-dev:masterfrom
Vperiodt:mcp-trace
Open

feat: MLflow distributed tracing for Feast feature server#6405
Vperiodt wants to merge 16 commits into
feast-dev:masterfrom
Vperiodt:mcp-trace

Conversation

@Vperiodt

@Vperiodt Vperiodt commented May 14, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

  1. Server-side spans: Every feature retrieval endpoint (get-online-features, retrieve-online-documents, write-to-online-store) creates an MLflow span with attributes like feature names, entity count, and project name. Enabled via mlflow.enable_tracing: true in feature_store.yaml.

  2. Cross-process trace linking: When the caller sends a W3C traceparent header, the Feast span becomes a child of the caller's trace. One trace tree, two processes. The MCP server also forwards mcp-session-id as a span attribute for session grouping.

  3. In-process LLM span tagging: When an agent uses the Feast Python SDK directly (same process), a span processor automatically tags LLM spans with which features were retrieved.

Tracing never blocks feature serving. Missing MLflow, old MLflow version, or exceptions during span creation all degrade gracefully.

Configuration:
    mlflow:
      enabled: true
      tracking_uri: "http://mlflow:5000"
      enable_tracing: true

Which issue(s) this PR fixes:

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Manual tests
  • Testing is not required for this change

Misc


Open in Devin Review

@Vperiodt Vperiodt changed the title mlflow-tracing [WIP] mlflow-tracing May 14, 2026
@Vperiodt Vperiodt changed the title [WIP] mlflow-tracing [WIP] : mlflow-tracing May 24, 2026
Vperiodt added 2 commits May 24, 2026 14:10
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Vperiodt and others added 2 commits May 25, 2026 11:59
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt changed the title [WIP] : mlflow-tracing feat: MLflow native distributed tracing for Feast feature server May 26, 2026
@Vperiodt Vperiodt marked this pull request as ready for review May 26, 2026 14:59
@Vperiodt Vperiodt requested a review from a team as a code owner May 26, 2026 14:59
devin-ai-integration[bot]

This comment was marked as resolved.

Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt changed the title feat: MLflow native distributed tracing for Feast feature server feat: MLflow distributed tracing for Feast feature server May 27, 2026

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 7 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removing detect_types from SQLite connection breaks event_ts in retrieve_online_documents_v2

The PR removes detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES from _initialize_conn at sqlite.py:705. This flag was responsible for automatically converting stored timestamps back to datetime objects via the registered converters (lines 99-101). Since datetime objects are stored as epoch integers (via the adapt_datetime_epoch adapter at line 81), removing detect_types means timestamp columns now return raw int values instead of datetime objects.

The online_read method was correctly updated (lines 278-285) to handle int timestamps. However, retrieve_online_documents_v2 was NOT updated. At line 677, the check isinstance(entity_dict[entity_key_value]["event_ts"], datetime) will always be False because event_ts is now an int, causing res_event_ts to always be None. This loses all timestamp information for document retrieval results, which flows into _retrieve_from_online_store_v2 (feature_store.py:3575) and ultimately into the API response.

(Refers to lines 677-678)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@franciscojavierarceo

Copy link
Copy Markdown
Member

linter is failing

Vperiodt and others added 9 commits June 2, 2026 13:29
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@@ -0,0 +1,130 @@
# Feast MLflow Tracing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this official docs

@jyejare jyejare left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds comprehensive MLflow distributed tracing support to Feast, enabling automatic trace collection for feature server endpoints, cross-process trace linking via traceparent headers, and in-process feature context tagging. The implementation is well-structured with proper separation of concerns, comprehensive test coverage, and graceful degradation when dependencies are missing.

Comment on lines +278 to +284
if isinstance(ts, (int, float)):
res_ts = datetime.fromtimestamp(ts, tz=timezone.utc)
else:
res_ts = ts.replace(tzinfo=timezone.utc)
ts = cast(datetime, ts)
if ts.tzinfo is not None:
res_ts = ts.astimezone(timezone.utc)
else:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Warning] Inconsistent timestamp handling may cause runtime errors

The code handles both int/float and datetime timestamp types, but the original code assumed datetime objects. This change could break existing code that expects consistent datetime handling and may cause timezone issues with numeric timestamps.

Suggested:

Suggested change
if isinstance(ts, (int, float)):
res_ts = datetime.fromtimestamp(ts, tz=timezone.utc)
else:
res_ts = ts.replace(tzinfo=timezone.utc)
ts = cast(datetime, ts)
if ts.tzinfo is not None:
res_ts = ts.astimezone(timezone.utc)
else:
ts = cast(datetime, ts)
if ts.tzinfo is not None:
res_ts = ts.astimezone(timezone.utc)
else:
res_ts = ts.replace(tzinfo=timezone.utc)

Comment thread pyproject.toml
Comment on lines 105 to +113
oracle = ["ibis-framework[oracle]>=10.0.0"]
mysql = ["pymysql", "types-PyMySQL"]
openlineage = ["openlineage-python>=1.40.0"]
mlflow = [
"mlflow>=2.14.0",
"opentelemetry-api>=1.28.0",
"opentelemetry-sdk>=1.28.0",
"opentelemetry-instrumentation-fastapi>=0.49b0",
"opentelemetry-instrumentation-httpx>=0.49b0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Consider adding opentelemetry-exporter-otlp-proto-http dependency

The mlflow extra includes several OpenTelemetry packages but is missing opentelemetry-exporter-otlp-proto-http which was mentioned in the pixi.lock file. This could cause issues if OTLP export is needed.

Suggested:

Suggested change
oracle = ["ibis-framework[oracle]>=10.0.0"]
mysql = ["pymysql", "types-PyMySQL"]
openlineage = ["openlineage-python>=1.40.0"]
mlflow = [
"mlflow>=2.14.0",
"opentelemetry-api>=1.28.0",
"opentelemetry-sdk>=1.28.0",
"opentelemetry-instrumentation-fastapi>=0.49b0",
"opentelemetry-instrumentation-httpx>=0.49b0",
mlflow = [
"mlflow>=2.14.0",
"opentelemetry-api>=1.28.0",
"opentelemetry-sdk>=1.28.0",
"opentelemetry-instrumentation-fastapi>=0.49b0",
"opentelemetry-instrumentation-httpx>=0.49b0",
"opentelemetry-exporter-otlp-proto-http>=1.28.0",
]

Comment on lines +313 to +325
return

try:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

FastAPIInstrumentor.instrument_app(app)
logger.info("FastAPI OTEL instrumentation enabled for trace propagation")
except ImportError:
logger.debug(
"opentelemetry-instrumentation-fastapi not installed; "
"cross-process trace linking disabled"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Improve error handling in FastAPI instrumentation

The ImportError handling is good, but it would be helpful to also catch other potential exceptions during instrumentation and provide more specific error messages.

Suggested:

Suggested change
return
try:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app)
logger.info("FastAPI OTEL instrumentation enabled for trace propagation")
except ImportError:
logger.debug(
"opentelemetry-instrumentation-fastapi not installed; "
"cross-process trace linking disabled"
)
try:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app)
logger.info("FastAPI OTEL instrumentation enabled for trace propagation")
except ImportError:
logger.debug(
"opentelemetry-instrumentation-fastapi not installed; "
"cross-process trace linking disabled"
)
except Exception as e:
logger.warning(f"Failed to instrument FastAPI app: {e}")

Comment on lines +139 to +143
# async exporter must send the span before that happens.
if has_traceparent:
try:
_mlflow_mod.flush_trace_async_logging()
except Exception:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Warning] Potential performance impact from synchronous flush in async context

Calling flush_trace_async_logging() synchronously in an async endpoint could block the event loop and impact performance. Consider making this operation truly asynchronous or moving it to a background task.

Comment on lines +75 to +82
def feast_trace_scope() -> Iterator[FeastTraceContext]:
"""Context manager that creates and cleans up a ``FeastTraceContext``."""
ctx = FeastTraceContext()
_thread_local.feast_ctx = ctx
try:
yield ctx
finally:
ctx.clear()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Consider adding context validation

The feast_trace_scope context manager could benefit from validation to ensure the context is properly cleaned up even if exceptions occur during setup.

Suggested:

Suggested change
def feast_trace_scope() -> Iterator[FeastTraceContext]:
"""Context manager that creates and cleans up a ``FeastTraceContext``."""
ctx = FeastTraceContext()
_thread_local.feast_ctx = ctx
try:
yield ctx
finally:
ctx.clear()
@contextmanager
def feast_trace_scope() -> Iterator[FeastTraceContext]:
"""Context manager that creates and cleans up a ``FeastTraceContext``."""
old_ctx = getattr(_thread_local, 'feast_ctx', None)
ctx = FeastTraceContext()
_thread_local.feast_ctx = ctx
try:
yield ctx
finally:
ctx.clear()
_thread_local.feast_ctx = old_ctx

Comment on lines +63 to +68
enable_tracing: StrictBool = True
""" bool: When True and mlflow.enabled=True, server-side API calls
create MLflow trace spans via mlflow.start_span(). Spans appear
in the MLflow UI Traces tab and support parent-child linking via
traceparent headers. Defaults to True. """

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick] Consider more descriptive field name

The field name 'enable_tracing' could be more specific since it only controls MLflow tracing, not all tracing in Feast.

Suggested:

Suggested change
enable_tracing: StrictBool = True
""" bool: When True and mlflow.enabled=True, server-side API calls
create MLflow trace spans via mlflow.start_span(). Spans appear
in the MLflow UI Traces tab and support parent-child linking via
traceparent headers. Defaults to True. """
enable_distributed_tracing: StrictBool = True
""" bool: When True and mlflow.enabled=True, server-side API calls
create MLflow trace spans via mlflow.start_span(). Spans appear
in the MLflow UI Traces tab and support parent-child linking via
traceparent headers. Defaults to True. """

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants