adapters: improve ad-hoc query failure diagnostics in multihost by ryzhyk · Pull Request #6553 · feldera/feldera

ryzhyk · 2026-06-28T00:34:00Z

When a multihost ad-hoc scan fails, the underlying cause was previously lost on the pipeline side:

coordination_adhoc_scan can only report an error via HTTP status for the first batch; an error after that truncated the Arrow stream, which the coordinator observed as a generic "Unexpected End of Stream". Log the real error (with table/step/worker context) before the stream terminates.
coordination_adhoc_lease now logs, and includes in its error, the set of retained snapshot steps when the requested step's snapshot is no longer available -- showing how far the coordinator's leased step lagged the host's retained snapshots.

Adds Controller::available_snapshot_steps() for the lease diagnostics.

Describe Manual Test Plan

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

When a multihost ad-hoc scan fails, the underlying cause was previously lost on the pipeline side: - coordination_adhoc_scan can only report an error via HTTP status for the first batch; an error after that truncated the Arrow stream, which the coordinator observed as a generic "Unexpected End of Stream". Log the real error (with table/step/worker context) before the stream terminates. - coordination_adhoc_lease now logs, and includes in its error, the set of retained snapshot steps when the requested step's snapshot is no longer available -- showing how far the coordinator's leased step lagged the host's retained snapshots. Adds Controller::available_snapshot_steps() for the lease diagnostics. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

mihaibudiu · 2026-06-28T00:44:22Z

+                Ok(batch) => batch,
+                Err(error) => {
+                    error!(
+                        "ad-hoc scan of {table} (step {step}, worker {worker}): error \


The worst case would be if the partial stream was interpreted as a complete answer. Is that possible?

Seem unlikely, but I don't know if it's in theory possible. This doesn't make it worse I think. Also, this should only happen if there's a bug in the engine.

blp

Looks correct to me.

mythical-fred

Approving — the logs recover information that was genuinely getting lost.

On Mihai's question (can a truncated stream be interpreted as a complete answer?): I read this as a no, with the same confidence as before this PR. The previous code already truncated mid-stream on error — try_stream! + ? yields the Err and stops emitting bytes, which is exactly what the new explicit yield Err(error.into()); return; does. In both cases the HTTP body terminates abnormally (the actix_web streaming adapter surfaces the Err as a transport-level error, not a clean EOF), and the Arrow StreamReader on the coordinator side then sees the missing EOS continuation+length-0 marker and reports "Unexpected End of Stream" — exactly the symptom this PR is improving the diagnostics for. So this change preserves the existing wire-level behavior; it only adds server-side logging so the real cause isn't lost.

Two small nits, both optional:

available_snapshot_steps materializes the whole keyset. For the retained_steps: {retained:?} log line we only need (min, max, count) — the full Vec could be large on a busy host and we serialize it twice (once for the error! arg, once into the user-facing error string). Returning (Option<Step>, Option<Step>, usize) would be cheaper and the log/error message arguably more readable ("retained 1234 steps between 9000..10234, requested 8500"). Not a blocker; the lock is held briefly and the keyset is bounded by the snapshot table size.
Error-message duplication. The retained-steps list now appears both in the error! log and inlined into the PipelineError::AdHocQueryError text the coordinator surfaces. That's fine for one-shot diagnostics, but if the coordinator already prints/forwards the error text we'll see the same […] Vec twice. Probably worth keeping the user-visible error short (just step + "(N retained, latest=…)") and pushing the full list to the log only. Same comment, really.

Neither blocks landing.

ryzhyk requested a review from blp June 28, 2026 00:34

mihaibudiu reviewed Jun 28, 2026

View reviewed changes

blp approved these changes Jun 28, 2026

View reviewed changes

mythical-fred approved these changes Jun 28, 2026

View reviewed changes

ryzhyk added this pull request to the merge queue Jun 28, 2026

Merged via the queue into main with commit 1267b56 Jun 28, 2026
1 check passed

ryzhyk deleted the adhoc-scan-diagnostics branch June 28, 2026 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapters: improve ad-hoc query failure diagnostics in multihost#6553

adapters: improve ad-hoc query failure diagnostics in multihost#6553
ryzhyk merged 1 commit into
mainfrom
adhoc-scan-diagnostics

ryzhyk commented Jun 28, 2026

Uh oh!

mihaibudiu Jun 28, 2026

Uh oh!

ryzhyk Jun 28, 2026

Uh oh!

blp left a comment

Uh oh!

mythical-fred left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ryzhyk commented Jun 28, 2026

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mihaibudiu Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

ryzhyk Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants