Skip to content

adapters: improve ad-hoc query failure diagnostics in multihost#6553

Merged
ryzhyk merged 1 commit into
mainfrom
adhoc-scan-diagnostics
Jun 28, 2026
Merged

adapters: improve ad-hoc query failure diagnostics in multihost#6553
ryzhyk merged 1 commit into
mainfrom
adhoc-scan-diagnostics

Conversation

@ryzhyk

@ryzhyk ryzhyk commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

When a multihost ad-hoc scan fails, the underlying cause was previously lost on the pipeline side:

  • coordination_adhoc_scan can only report an error via HTTP status for the first batch; an error after that truncated the Arrow stream, which the coordinator observed as a generic "Unexpected End of Stream". Log the real error (with table/step/worker context) before the stream terminates.

  • coordination_adhoc_lease now logs, and includes in its error, the set of retained snapshot steps when the requested step's snapshot is no longer available -- showing how far the coordinator's leased step lagged the host's retained snapshots.

Adds Controller::available_snapshot_steps() for the lease diagnostics.

Describe Manual Test Plan

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

When a multihost ad-hoc scan fails, the underlying cause was previously
lost on the pipeline side:

- coordination_adhoc_scan can only report an error via HTTP status for the
  first batch; an error after that truncated the Arrow stream, which the
  coordinator observed as a generic "Unexpected End of Stream". Log the
  real error (with table/step/worker context) before the stream terminates.

- coordination_adhoc_lease now logs, and includes in its error, the set of
  retained snapshot steps when the requested step's snapshot is no longer
  available -- showing how far the coordinator's leased step lagged the
  host's retained snapshots.

Adds Controller::available_snapshot_steps() for the lease diagnostics.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk requested a review from blp June 28, 2026 00:34
Ok(batch) => batch,
Err(error) => {
error!(
"ad-hoc scan of {table} (step {step}, worker {worker}): error \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case would be if the partial stream was interpreted as a complete answer. Is that possible?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem unlikely, but I don't know if it's in theory possible. This doesn't make it worse I think. Also, this should only happen if there's a bug in the engine.

@blp blp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me.

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the logs recover information that was genuinely getting lost.

On Mihai's question (can a truncated stream be interpreted as a complete answer?): I read this as a no, with the same confidence as before this PR. The previous code already truncated mid-stream on error — try_stream! + ? yields the Err and stops emitting bytes, which is exactly what the new explicit yield Err(error.into()); return; does. In both cases the HTTP body terminates abnormally (the actix_web streaming adapter surfaces the Err as a transport-level error, not a clean EOF), and the Arrow StreamReader on the coordinator side then sees the missing EOS continuation+length-0 marker and reports "Unexpected End of Stream" — exactly the symptom this PR is improving the diagnostics for. So this change preserves the existing wire-level behavior; it only adds server-side logging so the real cause isn't lost.

Two small nits, both optional:

  • available_snapshot_steps materializes the whole keyset. For the retained_steps: {retained:?} log line we only need (min, max, count) — the full Vec could be large on a busy host and we serialize it twice (once for the error! arg, once into the user-facing error string). Returning (Option<Step>, Option<Step>, usize) would be cheaper and the log/error message arguably more readable ("retained 1234 steps between 9000..10234, requested 8500"). Not a blocker; the lock is held briefly and the keyset is bounded by the snapshot table size.

  • Error-message duplication. The retained-steps list now appears both in the error! log and inlined into the PipelineError::AdHocQueryError text the coordinator surfaces. That's fine for one-shot diagnostics, but if the coordinator already prints/forwards the error text we'll see the same […] Vec twice. Probably worth keeping the user-visible error short (just step + "(N retained, latest=…)") and pushing the full list to the log only. Same comment, really.

Neither blocks landing.

@ryzhyk ryzhyk added this pull request to the merge queue Jun 28, 2026
Merged via the queue into main with commit 1267b56 Jun 28, 2026
1 check passed
@ryzhyk ryzhyk deleted the adhoc-scan-diagnostics branch June 28, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants