[ci] test delta restart against MinIO in CI#6131
Conversation
mythical-fred
left a comment
There was a problem hiding this comment.
Solid CI-fixup refactor. The hand-rolled delta-log replay is replaced with DeltaTable(...).to_pyarrow_dataset().count_rows(), which reads the same active-file set the old code computed manually — so the row-count assertion semantics are preserved across both backends.
Two nits inline; not blockers.
ee71ce6 to
4fa6e1c
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Nice rework — the DeltaTestLocation dataclass cleanly factors out the dual-backend storage logic, and log_json_paths() / read_text() make it easy for other tests in this dir to do delta-log inspection against either backend.
One thing from my previous review wasn't actually picked up: the schema-column check in test_modified_view_re_truncates_on_resume is still missing. The test still asserts only that the row count drops to 0 and that 30 new rows arrive — both of which would also pass if the connector simply re-truncated and re-emitted with the old schema. Now that log_json_paths() / read_text() exist, restoring it is ~5 lines. I'm not going to block on it; LGTM.
gz
left a comment
There was a problem hiding this comment.
can't argue with fred saying "Nice rework"
Rework test_delta_output_restart.py to pick its storage backend at
runtime via DeltaTestLocation:
* Local runs use a `file://` URI under /tmp.
* CI runs (CI=true) use the in-cluster MinIO endpoint, so the
pipeline pod and the test runner reach the table over S3.
Move shared CI helpers (DeltaTestLocation, MinIO/Kafka endpoints,
env helpers) into python/tests/utils.py. They are imported only by
the delta test, after `pytest.importorskip("deltalake")`, so jobs
that don't run delta (runtime/workload) never load deltalake.
Row counts come from `DeltaTable.count()`, which reads the per-file
numRecords stats already in the delta log
Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
8a21006 to
e14f9ef
Compare
test_delta_output_restart.pyfailed in k8s because the pipeline pod and the pytest runner do not share/tmp.check commit message for details
Describe Manual Test Plan
ran locally, didn't test against MINIO