Skip to content

Commit 7d1ef73

Browse files
jyejarentkathole
authored andcommitted
docs: Operational Metrics for offline store retrieval and SOX Complaince metrics
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
1 parent 96d7169 commit 7d1ef73

4 files changed

Lines changed: 386 additions & 0 deletions

File tree

Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
---
2+
title: "Extending Feast Observability: Offline Store Metrics and SOX Audit Logging"
3+
description: "Feast now captures RED metrics for offline store retrievals and emits structured SOX audit logs for both online and offline feature access — closing the observability gap between serving and training paths."
4+
date: 2026-06-09
5+
authors: ["Jitendra Yejare"]
6+
---
7+
8+
<div class="hero-image">
9+
<img src="/images/blog/feast-metrics-hero.png" alt="Feast Offline Store Metrics and SOX Audit Logging — Prometheus metrics for offline retrievals and structured audit logs for compliance" loading="lazy">
10+
</div>
11+
12+
# Extending Feast Observability: Offline Store Metrics and SOX Audit Logging
13+
14+
In [our previous post](/blog/feast-feature-server-monitoring), we introduced built-in Prometheus metrics for the Feast feature server — covering the full online serving lifecycle from HTTP request handling through online store reads, on-demand feature transformations, materialization pipelines, and feature freshness tracking.
15+
16+
That covered the **online** path. But production ML systems don't just serve features in real time — they also build training datasets through offline store retrievals. And for teams operating in regulated environments (financial services, healthcare, government), observability isn't enough. You need an **auditable record** of who accessed what data, when, and how much.
17+
18+
This post covers two new capabilities added to Feast:
19+
20+
1. **Offline Store RED Metrics** — Prometheus counters and histograms for offline store retrieval operations (request rate, error rate, latency, row counts)
21+
2. **SOX Audit Logging** — Structured JSON audit log entries for both online and offline feature retrieval paths, routed to a dedicated `feast.audit` logger
22+
23+
Together, these close the observability gap between online and offline operations and give compliance teams the structured audit trail they need.
24+
25+
## Offline Store Metrics: Closing the Observability Gap
26+
27+
The online feature server already had comprehensive metrics, but the offline store — where `get_historical_features` queries execute against your data warehouse to build training datasets — had zero instrumentation. This matters because training-serving skew, stalled pipelines, and data volume anomalies all originate in the offline path.
28+
29+
### The Problem
30+
31+
Without offline store metrics, teams faced three blind spots:
32+
33+
- **Silent training failures** — An offline retrieval that returns incomplete data (or errors out) produces a corrupted training dataset. Models trained on bad data degrade in production, and without metrics, there's no signal until prediction quality drops.
34+
- **Invisible pipeline stalls** — A `get_historical_features` call that normally takes 30 seconds but suddenly takes 10 minutes looks like a "hang" from the orchestrator's perspective. No latency metrics means no alerting until the pipeline times out.
35+
- **Data volume anomalies** — If a typical training query returns 500K rows but suddenly returns 50K, something changed upstream. Without row count tracking, this silently propagates into model training.
36+
37+
### How Feast Solves It
38+
39+
Feast now automatically captures RED metrics (Rate, Errors, Duration) for every offline store retrieval — regardless of the backend. Whether you're running against BigQuery, Redshift, Snowflake, DuckDB, or local files, you get the same three Prometheus metrics out of the box:
40+
41+
- **`feast_offline_store_request_total`** — Counts every retrieval, labeled by success/error. Set an alert and know immediately when training pipelines start failing.
42+
- **`feast_offline_store_request_latency_seconds`** — Latency histogram with buckets tuned for offline workloads (`0.1s` to `10min`). Set SLOs and catch slow queries before pipelines time out.
43+
- **`feast_offline_store_row_count`** — Row count histogram covering `100` to `5M` rows. Detect data volume anomalies before they reach model training.
44+
45+
Metrics collection never interferes with your queries — if the metrics path fails for any reason, your offline retrieval completes normally.
46+
47+
```
48+
# Alert when offline retrievals start failing
49+
- alert: FeastOfflineStoreErrors
50+
expr: rate(feast_offline_store_request_total{status="error"}[15m]) > 0
51+
for: 5m
52+
labels:
53+
severity: critical
54+
annotations:
55+
summary: >
56+
Offline store retrievals are failing ({{ $value }} errors/sec).
57+
Training pipelines may be producing incomplete datasets.
58+
```
59+
60+
## Why SOX Audit Logging Matters
61+
62+
For organizations subject to SOX (Sarbanes-Oxley), GDPR, HIPAA, or other regulatory frameworks, you need to answer questions like:
63+
64+
- *Who accessed customer features at 3:47 PM on March 15th?*
65+
- *Which feature views were involved in the training dataset built yesterday?*
66+
- *How many rows of PII-adjacent data were retrieved by the batch scoring pipeline?*
67+
68+
Before this change, answering these questions required parsing unstructured application logs and correlating timestamps across services. Feature stores sit at the intersection of data access and ML model behavior — yet most have no structured audit trail.
69+
70+
Feast now emits **structured JSON audit entries** for both online and offline retrieval paths, routed to a dedicated `feast.audit` logger that can be independently sent to your SIEM, log aggregator, or compliance sink — without touching your operational log pipeline.
71+
72+
What makes this production-ready:
73+
74+
- **PII-minimized by design.** Entity key *names* are logged, not *values*. A compliance auditor sees "the ML pipeline accessed `user_id` features from `transaction_features` at 3:47 PM" without the log itself containing PII.
75+
- **Dedicated logger.** Audit entries go to `feast.audit`, separate from the application logger. Route them to a SOX-compliant sink (Splunk, ELK with retention policies, S3 with WORM locks) independently.
76+
- **Never breaks your serving path.** Audit logging is best-effort — a broken audit sink never affects feature serving latency or availability.
77+
- **Zero overhead when disabled.** `audit_logging` defaults to `false`. Enable it only when you need it.
78+
79+
## The New Metrics
80+
81+
### Offline Store RED Metrics
82+
83+
| Metric | Type | Labels | What It Answers |
84+
|--------|------|--------|-----------------|
85+
| `feast_offline_store_request_total` | Counter | `method`, `status` | What is my offline retrieval throughput and error rate? |
86+
| `feast_offline_store_request_latency_seconds` | Histogram | `method` | How long are my training data queries taking? |
87+
| `feast_offline_store_row_count` | Histogram | `method` | How much data are my offline retrievals returning? |
88+
89+
The `method` label captures the retrieval type (`to_arrow`), and `status` is `success` or `error`. The latency histogram uses wide buckets tuned for offline workloads: `0.1s, 0.5s, 1s, 5s, 10s, 30s, 60s, 2min, 5min, 10min` — because offline queries can range from sub-second (small entity sets against local files) to minutes (large point-in-time joins against BigQuery or Redshift).
90+
91+
The row count histogram uses exponential buckets: `100, 1K, 10K, 100K, 500K, 1M, 5M` — covering the range from small test retrievals to production training datasets.
92+
93+
### SOX Audit Log Entries
94+
95+
**Online feature request audit entry:**
96+
97+
```json
98+
{
99+
"event": "online_feature_request",
100+
"timestamp": "2026-06-07T14:42:29.739Z",
101+
"requestor_id": "service-account:ml-pipeline",
102+
"entity_keys": ["driver_id"],
103+
"entity_count": 5,
104+
"feature_views": ["driver_hourly_stats"],
105+
"feature_count": 3,
106+
"status": "success",
107+
"latency_ms": 12.45
108+
}
109+
```
110+
111+
**Offline feature retrieval audit entry:**
112+
113+
```json
114+
{
115+
"event": "offline_feature_retrieval",
116+
"timestamp": "2026-06-07T14:42:29.739Z",
117+
"method": "to_arrow",
118+
"start_time": "2026-06-07T14:42:29.697Z",
119+
"end_time": "2026-06-07T14:42:29.739Z",
120+
"feature_views": ["driver_hourly_stats"],
121+
"feature_count": 3,
122+
"row_count": 150000,
123+
"status": "success",
124+
"duration_ms": 42.39
125+
}
126+
```
127+
128+
Each entry is a single JSON line, making it trivial to parse with `jq`, ingest into Elasticsearch, or stream to a Kafka topic for compliance processing.
129+
130+
**Note on accessor identity:** Online audit entries include `requestor_id`, extracted from the Feast authentication layer (SecurityManager). Offline retrievals run as direct SDK calls in the user's own process (a notebook, Airflow task, or training script) — there is no server in the middle to extract auth context. In production SOX environments, offline accessor identity is typically established at the infrastructure level: the Kubernetes service account running the job, the IAM role accessing the data warehouse, or the CI/CD pipeline identity. A future enhancement could optionally capture identity from `os.getenv("USER")` or an explicit SDK parameter.
131+
132+
## Enabling the New Metrics
133+
134+
### YAML Configuration
135+
136+
Add `offline_features` and `audit_logging` to your `feature_store.yaml`:
137+
138+
```yaml
139+
feature_server:
140+
metrics:
141+
enabled: true
142+
resource: true
143+
request: true
144+
online_features: true
145+
push: true
146+
materialization: true
147+
freshness: true
148+
offline_features: true # NEW: Offline store RED metrics
149+
audit_logging: true # NEW: SOX audit log entries
150+
```
151+
152+
`offline_features` defaults to `true` when metrics are enabled (consistent with other categories). `audit_logging` defaults to `false` — it's opt-in because audit entries have a non-trivial cost (JSON serialization + I/O per request) and are only needed in regulated environments.
153+
154+
### CLI
155+
156+
When using `feast serve --metrics`, offline store metrics are enabled by default. Audit logging still requires the YAML toggle since it's opt-in.
157+
158+
### Routing Audit Logs
159+
160+
The `feast.audit` logger is a standard Python logger. Configure it like any other:
161+
162+
```python
163+
import logging
164+
165+
audit_logger = logging.getLogger("feast.audit")
166+
audit_logger.setLevel(logging.INFO)
167+
audit_logger.propagate = False
168+
169+
handler = logging.FileHandler("/var/log/feast/audit.log")
170+
handler.setFormatter(logging.Formatter("%(message)s"))
171+
audit_logger.addHandler(handler)
172+
```
173+
174+
Or route to a JSON-aware sink in production:
175+
176+
```yaml
177+
# logging.yaml for production
178+
loggers:
179+
feast.audit:
180+
level: INFO
181+
propagate: false
182+
handlers: [audit_file, splunk_forwarder]
183+
```
184+
185+
## Key PromQL Queries for Offline Store
186+
187+
**Throughput and errors:**
188+
189+
```promql
190+
# Offline retrieval rate
191+
rate(feast_offline_store_request_total[5m])
192+
193+
# Offline error rate
194+
sum(rate(feast_offline_store_request_total{status="error"}[5m]))
195+
/ sum(rate(feast_offline_store_request_total[5m]))
196+
```
197+
198+
**Latency percentiles:**
199+
200+
```promql
201+
# Offline retrieval p95 latency
202+
histogram_quantile(0.95,
203+
sum(rate(feast_offline_store_request_latency_seconds_bucket[5m])) by (le))
204+
205+
# Average offline retrieval duration
206+
rate(feast_offline_store_request_latency_seconds_sum[5m])
207+
/ rate(feast_offline_store_request_latency_seconds_count[5m])
208+
```
209+
210+
**Row count analysis:**
211+
212+
```promql
213+
# Average rows per retrieval
214+
feast_offline_store_row_count_sum / feast_offline_store_row_count_count
215+
216+
# p95 row count (detect large retrievals)
217+
histogram_quantile(0.95,
218+
sum(rate(feast_offline_store_row_count_bucket[5m])) by (le))
219+
```
220+
221+
## Building Alerts for Offline Store
222+
223+
### Offline Retrieval Failures
224+
225+
```yaml
226+
- alert: FeastOfflineStoreErrors
227+
expr: rate(feast_offline_store_request_total{status="error"}[15m]) > 0
228+
for: 5m
229+
labels:
230+
severity: critical
231+
annotations:
232+
summary: >
233+
Offline store retrievals are failing.
234+
Training pipelines may be producing incomplete datasets.
235+
```
236+
237+
### Slow Offline Queries
238+
239+
```yaml
240+
- alert: FeastOfflineStoreSlowQuery
241+
expr: |
242+
histogram_quantile(0.95,
243+
sum(rate(feast_offline_store_request_latency_seconds_bucket[5m])) by (le)
244+
) > 300
245+
for: 5m
246+
labels:
247+
severity: warning
248+
annotations:
249+
summary: >
250+
Offline store p95 latency is {{ $value | humanizeDuration }}.
251+
Training pipelines may be stalling.
252+
```
253+
254+
### Row Count Anomaly
255+
256+
```yaml
257+
- alert: FeastOfflineStoreRowCountDrop
258+
expr: |
259+
feast_offline_store_row_count_sum / feast_offline_store_row_count_count
260+
< 0.5 * avg_over_time(
261+
(feast_offline_store_row_count_sum / feast_offline_store_row_count_count)[1d:1h])
262+
for: 10m
263+
labels:
264+
severity: warning
265+
annotations:
266+
summary: >
267+
Average rows per offline retrieval dropped by >50%.
268+
Possible upstream data issue.
269+
```
270+
271+
## The Extended Grafana Dashboard
272+
273+
We've extended the existing Feast Grafana dashboard with a dedicated **Offline Store** section containing six new panels:
274+
275+
- **Offline Store Request Rate** — Rate of offline retrievals by method and status
276+
- **Offline Store Total Requests** — Cumulative request counts (stat panel)
277+
- **Offline Store Retrieval Latency (p50/p95/p99)** — Latency percentile time series
278+
- **Offline Store Row Count Distribution** — Row count percentiles over time
279+
- **Avg Offline Retrieval Duration** — Average duration per method
280+
- **Offline Store Error Rate** — Gauge showing current error percentage with threshold coloring
281+
282+
<div class="content-image">
283+
<img src="/images/blog/offline_store_operational_metrics.png" alt="Grafana dashboard showing dedicated offline store containing six new panels" loading="lazy">
284+
</div>
285+
286+
These panels sit alongside the existing online store panels, giving you a single dashboard that covers both serving paths.
287+
288+
For SOX compliance, a separate **Audit Trail** dashboard powered by Loki visualizes:
289+
290+
- **Total Audited Events** — Count of all audited access events
291+
- **Online vs Offline Access Timeline** — Stacked time series showing access patterns
292+
- **Offline Data Volume** — Total rows retrieved over time, flagging bulk data exports
293+
- **Anomaly Detection** — Large row counts and slow queries that may need compliance review
294+
295+
<div class="content-image">
296+
<img src="/images/blog/sox_compliance_and_access.png" alt="Grafana dashboard showing SOX compliance and access containing five new panels" loading="lazy">
297+
</div>
298+
299+
- **Live Audit Log Stream** — Raw structured audit entries, expandable for investigation
300+
301+
<div class="content-image">
302+
<img src="/images/blog/sox_offline_store_audit_logs.png" alt="Grafana dashboard showing audit logs for offline store" loading="lazy">
303+
</div>
304+
305+
306+
## Updated Metrics Summary
307+
308+
| Category | Metric | What It Answers |
309+
|----------|--------|-----------------|
310+
| **Online** Request | `feast_feature_server_request_total` | What is my online throughput and error rate? |
311+
| **Online** Request | `feast_feature_server_request_latency_seconds` | What are my online p50/p99 latencies? |
312+
| **Online** Features | `feast_online_features_entity_count` | What is my online traffic shape? |
313+
| **Online** Store Read | `feast_feature_server_online_store_read_duration_seconds` | Is my online store the bottleneck? |
314+
| ODFV Transform | `feast_feature_server_transformation_duration_seconds` | How expensive are my read-path transforms? |
315+
| ODFV Transform | `feast_feature_server_write_transformation_duration_seconds` | How expensive are my write-path transforms? |
316+
| Push | `feast_push_request_total` | Is my ingestion pipeline sending data? |
317+
| Materialization | `feast_materialization_total` | Are my pipelines succeeding? |
318+
| Materialization | `feast_materialization_duration_seconds` | How long do my pipelines take? |
319+
| Freshness | `feast_feature_freshness_seconds` | How stale is the data my models are using? |
320+
| Resource | `feast_feature_server_cpu_usage / memory_usage` | Is my server healthy? |
321+
| **Offline** Request | `feast_offline_store_request_total` | What is my offline retrieval throughput? |
322+
| **Offline** Latency | `feast_offline_store_request_latency_seconds` | How long are my training queries taking? |
323+
| **Offline** Row Count | `feast_offline_store_row_count` | How much data are retrievals returning? |
324+
| **Audit** | `feast.audit` logger (online) | Who requested which features, when? |
325+
| **Audit** | `feast.audit` logger (offline) | Which training datasets were built, with how much data? |
326+
327+
## How to Try It
328+
329+
### Automated Demo
330+
331+
We've extended the [feast-prometheus-metrics](https://github.com/ntkathole/feast-automated-setups/tree/main/feast-prometheus-metrics) automated demo to include offline store metrics and SOX audit logging. The extended traffic generator exercises both online and offline paths:
332+
333+
```bash
334+
# Clone and run
335+
git clone https://github.com/ntkathole/feast-automated-setups.git
336+
cd feast-automated-setups/feast-prometheus-metrics
337+
338+
# Run setup (uses feast from your environment)
339+
./setup.sh
340+
341+
# Generate extended traffic including offline retrievals
342+
python3 generate_traffic_extended.py \
343+
--url http://localhost:6566 \
344+
--duration 120 \
345+
--repo-path workspace/feast_demo/feature_repo \
346+
--log-dir workspace/logs
347+
```
348+
349+
After traffic generation, check the audit log:
350+
351+
```bash
352+
# View structured audit entries
353+
cat workspace/logs/feast_audit.log | python3 -m json.tool
354+
355+
# Count by event type
356+
cat workspace/logs/feast_audit.log | \
357+
python3 -c "import sys,json; events=[json.loads(l)['event'] for l in sys.stdin]; print({e:events.count(e) for e in set(events)})"
358+
```
359+
360+
### Manual Verification
361+
362+
Verify offline store metrics are being emitted:
363+
364+
```bash
365+
# Check the Prometheus metrics endpoint for offline store metrics
366+
curl -s http://localhost:8000 | grep feast_offline
367+
368+
# Query Prometheus directly
369+
curl -s 'http://localhost:9090/api/v1/query?query=feast_offline_store_request_total'
370+
```
371+
372+
### Enable in Your Deployment
373+
374+
1. **Update `feature_store.yaml`** — Add `offline_features: true` and `audit_logging: true` to the metrics block
375+
2. **Configure audit log routing** — Set up a handler for the `feast.audit` logger in your logging config
376+
3. **Import the updated Grafana dashboard** — Add the offline store panels to your existing dashboard
377+
4. **Set up alerts** — Start with offline retrieval failures and row count anomalies
378+
379+
380+
We're excited to bring full-lifecycle observability to Feast — covering both the real-time serving path and the batch training path — and welcome feedback from the community!
381+
382+
---
383+
384+
*References:*
385+
- *[Existing blog: Monitoring Your Feast Feature Server with Prometheus and Grafana](https://feast.dev/blog/feast-feature-server-monitoring/)*
386+
- *[Feast Prometheus Metrics Demo](https://github.com/ntkathole/feast-automated-setups/tree/main/feast-prometheus-metrics)*
637 KB
Loading
359 KB
Loading
197 KB
Loading

0 commit comments

Comments
 (0)