Skip to content

feat: Time-based chunked materialization to prevent OOM on dense datasets#6277

Open
alan-gauthier-jt wants to merge 1 commit intofeast-dev:masterfrom
alan-gauthier-jt:chunked-materialize
Open

feat: Time-based chunked materialization to prevent OOM on dense datasets#6277
alan-gauthier-jt wants to merge 1 commit intofeast-dev:masterfrom
alan-gauthier-jt:chunked-materialize

Conversation

@alan-gauthier-jt
Copy link
Copy Markdown
Contributor

@alan-gauthier-jt alan-gauthier-jt commented Apr 14, 2026

What this PR does / why we need it:

Problem

FeatureStore.materialize() and FeatureStore.materialize_incremental() load the full requested time range into memory in a single pass. On production deployments with:

  • Large time windows (multi-day or multi-week backfills)
  • High-frequency event timestamps (e.g. 10-minute ETL batches, sub-minute sensor data)
  • Limited worker memory

…this causes out-of-memory (OOM) crashes that are difficult to recover from, with no built-in workaround inside Feast itself. Users currently must write external orchestration scripts to work around this.

Solution

This PR introduces native time-based chunked materialization directly into the Feast SDK. When a chunk_size is configured, each feature view's time range is split into consecutive, non-overlapping windows of the given width, and materialize_single_feature_view is called once per window. This caps peak memory usage to the cost of a single chunk regardless of total range size.

Key design decisions:

  • Backward-compatible: no existing behaviour changes when chunk_size is not set (the default).
  • Three-tier priority for chunk size resolution: call-time argument > feature_store.yaml project default > no chunking.
  • Per-chunk registry.apply_materialization: each successfully materialized chunk immediately updates most_recent_end_time. A crash mid-run allows materialize_incremental to resume from the last committed chunk rather than reprocessing the entire range.
  • CONTINUE failure strategy: an opt-in chunk_failure_strategy: continue setting collects failed chunks and emits a summary warning instead of aborting, enabling partial-success backfills.
  • Watermark gap-safety: when CONTINUE is active, the watermark only advances through the contiguous prefix of successful chunks. If chunk N fails, chunks N+1, N+2, … still run (so the online store receives their data idempotently), but most_recent_end_time is not advanced past the failed window. The next materialize_incremental call therefore retries from the correct start time — no data is permanently lost.

Changes

File Change
feast/feature_view_utils.py New generate_time_chunks(start, end, chunk_size) generator — pure, tested, reusable
feast/repo_config.py New ChunkFailureStrategy enum; extended MaterializationConfig with chunk_size and chunk_failure_strategy fields
feast/feature_store.py New _resolve_chunk_size() helper; chunk_size parameter on both materialize() and materialize_incremental(); chunked inner loops with per-chunk checkpointing
feast/cli/cli.py New --chunk-hours, --chunk-minutes, --chunk-seconds flags on both feast materialize and feast materialize-incremental

Usage examples

Python SDK

from datetime import timedelta

# Call-time override
fs.materialize(
    start_date=datetime(2026, 1, 1),
    end_date=datetime(2026, 3, 1),
    chunk_size=timedelta(hours=6),  # 240 chunks instead of one giant call
)

# Incremental with chunking
fs.materialize_incremental(
    end_date=datetime.utcnow(),
    chunk_size=timedelta(minutes=30),
)

feature_store.yaml project default

materialization:
  chunk_size: 21600          # 6 hours, in seconds
  chunk_failure_strategy: continue   # skip bad chunks, warn at end

CLI flags

feast materialize 2026-01-01T00:00:00 2026-03-01T00:00:00 --chunk-hours 6
feast materialize 2026-01-01T00:00:00 2026-03-01T00:00:00 --chunk-minutes 30
feast materialize-incremental 2026-03-25T00:00:00 --chunk-hours 12

Which issue(s) this PR fixes:

Fixes # (memory exhaustion / OOM during large-range materialization — no existing tracking issue; can be linked once opened)

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Manual tests
  • Testing is not required for this change

New unit tests added

tests/unit/test_feature_view_utils.py — 13 tests for generate_time_chunks:

  • Exact multiple of chunk size
  • Remainder chunk (last chunk smaller than chunk_size)
  • Chunk size larger than total range (single chunk)
  • Minute and second granularity
  • Contiguity and full-coverage invariants
  • Timezone-aware datetimes
  • Empty range (yields nothing)
  • Zero / negative chunk_size raises ValueError
  • Returns a lazy generator (not a list)

tests/unit/test_chunk_materialization.py — 15 tests for FeatureStore integration:

  • _resolve_chunk_size priority: call-time > config > None
  • No chunking → single materialize_single_feature_view call (backward-compatible)
  • N-hour chunk size → correct number of calls
  • Chunk boundaries are contiguous and cover the full range
  • Config chunk_size is used when no call-time override
  • Call-time override supersedes config
  • CONTINUE strategy: failed chunks are skipped with a warning; successful chunks are committed
  • CONTINUE strategy (all succeed): all chunks commit and watermark reaches end_date
  • STOP strategy (default): first failure re-raises immediately
  • registry.apply_materialization is called once per chunk (checkpointing)
  • materialize_incremental with chunk_size splits correctly
  • Regression: CONTINUE watermark does not advance past a failed chunk (data-loss prevention)

Misc

  • This feature is intentionally scoped to the existing local, spark and ray compute engines without any engine-specific changes — chunking is applied at the FeatureStore layer, above the engine abstraction, so all engines benefit automatically.
  • Parallel chunk execution (running multiple chunks concurrently) is a natural follow-up but is out of scope for this PR to keep the change reviewable and testable.
  • The chunk_size field in feature_store.yaml currently accepts a timedelta integer (total seconds). A human-readable string format (e.g. "6h", "30m") is a potential follow-up using a Pydantic validator.

@alan-gauthier-jt alan-gauthier-jt requested a review from a team as a code owner April 14, 2026 13:30
@alan-gauthier-jt alan-gauthier-jt changed the title feat(materialization): add time chunks management feat: Time-based chunked materialization to prevent OOM on dense datasets Apr 14, 2026
github-advanced-security[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants