Skip to content

Fix duplicate filtering path in Arrow task batches#3448

Open
GayathriSrividya wants to merge 3 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault
Open

Fix duplicate filtering path in Arrow task batches#3448
GayathriSrividya wants to merge 3 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault

Conversation

@GayathriSrividya
Copy link
Copy Markdown

Closes #3272

What this changes

This PR updates the Arrow scan path in _task_to_record_batches to avoid redundant filtering when there are no positional deletes.

  • Keeps predicate pushdown in Scanner.from_fragment as the only filter path when positional_deletes is absent.
  • Applies current_batch.filter(pyarrow_filter) only in the positional-delete path, after deletes are applied.
  • Preserves empty-batch handling after both delete application and conditional filtering.

Why

The previous flow could perform an extra table-level refilter even when the scanner already applied the predicate. This change removes that stale workaround path while keeping correct behavior for positional delete scenarios.

Tests

Added regression coverage in tests/io/test_pyarrow.py:

  • test_task_to_record_batches_filter_without_positional_deletes_avoids_table_refilter
  • test_task_to_record_batches_filter_with_positional_deletes_handles_empty_batch

Validated locally:

  • python -m pytest tests/io/test_pyarrow.py -q -k "task_to_record_batches_nanos or filter_without_positional_deletes_avoids_table_refilter or filter_with_positional_deletes_handles_empty_batch"
  • make lint

Comment thread tests/io/test_pyarrow.py Outdated
Replace the two weak tests that passed on the unfixed code with proper
regression coverage:

1. test_task_to_record_batches_does_not_use_table_filter_without_positional_deletes
   Replaces pyarrow.Table with a sentinel class whose from_batches raises
   AssertionError.  A custom metaclass preserves isinstance checks so the
   rest of the code-path is unaffected.  With the fix the sentinel is never
   reached; without the fix it fires immediately, detecting the old
   pa.Table.from_batches / filter / combine_chunks / to_batches[0] path
   that caused the SIGSEGV on Apple Silicon.

2. test_task_to_record_batches_filter_applied_after_positional_deletes
   Uses data [1,2,3,4,5] with positional deletes on positions 1 and 3
   (removing values 2 and 4), then applies filter id > 2.  Expected
   result [3, 5] would be wrong if either deletes or the subsequent
   filter were skipped.
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch @GayathriSrividya This logic must have been messed up somewhere throughout refactoring.

I left one comment to avoid having two blocks for positional_deletes, let me know what you think. This will reduce the number of lines quite a bit.

Comment thread pyiceberg/io/pyarrow.py
- Combine the two positional-delete handling blocks into one: apply
  take(indices) and the row filter together inside the single
  'if positional_deletes' block, removing the redundant mid-loop
  empty-batch check (filtering an empty batch is free).
- Replace the sentinel-based regression test (which passed even when
  the fix was reverted) with a behavioral test that picks a scenario
  where the old bug produces a distinct wrong answer:
    data=[1,2,3,4], pos_delete=2 (value 3), filter=id>2
    correct: [4]   old bug (scanner pre-filters): [3,4]
  The assertion on [4] will fail against any regression.

Addresses review feedback from @ebyhr and @Fokko.
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks @GayathriSrividya 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_task_to_record_batches stale PyArrow workaround causes SIGSEGV on Apple Silicon

3 participants