adapters: improve ad-hoc query failure diagnostics in multihost#6553
Conversation
When a multihost ad-hoc scan fails, the underlying cause was previously lost on the pipeline side: - coordination_adhoc_scan can only report an error via HTTP status for the first batch; an error after that truncated the Arrow stream, which the coordinator observed as a generic "Unexpected End of Stream". Log the real error (with table/step/worker context) before the stream terminates. - coordination_adhoc_lease now logs, and includes in its error, the set of retained snapshot steps when the requested step's snapshot is no longer available -- showing how far the coordinator's leased step lagged the host's retained snapshots. Adds Controller::available_snapshot_steps() for the lease diagnostics. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
| Ok(batch) => batch, | ||
| Err(error) => { | ||
| error!( | ||
| "ad-hoc scan of {table} (step {step}, worker {worker}): error \ |
There was a problem hiding this comment.
The worst case would be if the partial stream was interpreted as a complete answer. Is that possible?
There was a problem hiding this comment.
Seem unlikely, but I don't know if it's in theory possible. This doesn't make it worse I think. Also, this should only happen if there's a bug in the engine.
mythical-fred
left a comment
There was a problem hiding this comment.
Approving — the logs recover information that was genuinely getting lost.
On Mihai's question (can a truncated stream be interpreted as a complete answer?): I read this as a no, with the same confidence as before this PR. The previous code already truncated mid-stream on error — try_stream! + ? yields the Err and stops emitting bytes, which is exactly what the new explicit yield Err(error.into()); return; does. In both cases the HTTP body terminates abnormally (the actix_web streaming adapter surfaces the Err as a transport-level error, not a clean EOF), and the Arrow StreamReader on the coordinator side then sees the missing EOS continuation+length-0 marker and reports "Unexpected End of Stream" — exactly the symptom this PR is improving the diagnostics for. So this change preserves the existing wire-level behavior; it only adds server-side logging so the real cause isn't lost.
Two small nits, both optional:
-
available_snapshot_stepsmaterializes the whole keyset. For theretained_steps: {retained:?}log line we only need(min, max, count)— the full Vec could be large on a busy host and we serialize it twice (once for theerror!arg, once into the user-facing error string). Returning(Option<Step>, Option<Step>, usize)would be cheaper and the log/error message arguably more readable ("retained 1234 steps between 9000..10234, requested 8500"). Not a blocker; the lock is held briefly and the keyset is bounded by the snapshot table size. -
Error-message duplication. The retained-steps list now appears both in the
error!log and inlined into thePipelineError::AdHocQueryErrortext the coordinator surfaces. That's fine for one-shot diagnostics, but if the coordinator already prints/forwards the error text we'll see the same[…]Vec twice. Probably worth keeping the user-visible error short (just step + "(N retained, latest=…)") and pushing the full list to the log only. Same comment, really.
Neither blocks landing.
When a multihost ad-hoc scan fails, the underlying cause was previously lost on the pipeline side:
coordination_adhoc_scan can only report an error via HTTP status for the first batch; an error after that truncated the Arrow stream, which the coordinator observed as a generic "Unexpected End of Stream". Log the real error (with table/step/worker context) before the stream terminates.
coordination_adhoc_lease now logs, and includes in its error, the set of retained snapshot steps when the requested step's snapshot is no longer available -- showing how far the coordinator's leased step lagged the host's retained snapshots.
Adds Controller::available_snapshot_steps() for the lease diagnostics.
Describe Manual Test Plan
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes