Skip to content

Switch ad-hoc queries to arrow_ipc format#4240

Queued
Karakatiza666 wants to merge 4 commits into
mainfrom
webconsole-arrow
Queued

Switch ad-hoc queries to arrow_ipc format#4240
Karakatiza666 wants to merge 4 commits into
mainfrom
webconsole-arrow

Conversation

@Karakatiza666

@Karakatiza666 Karakatiza666 commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

Changes:

  • Switch to arrow-ipc. It also carries column types, so they are now mapped to ISO SQL types to show in the header
  • To return SQL ARRAY in ad-hoc responces, Arrow's List is used; for DeltaLake output LargeList is still used

Testing: manual, added unit test for formatting

Part of #4219: Deprecate the JSON format for ad-hoc queries

@Karakatiza666 Karakatiza666 requested a review from gz June 27, 2025 08:49
@Karakatiza666 Karakatiza666 added Web Console Related to the browser based UI javascript Pull requests that update Javascript code adhoc Issue related to ad hoc query processing labels Jun 27, 2025
@Karakatiza666 Karakatiza666 force-pushed the webconsole-arrow branch 2 times, most recently from 56b7606 to 9c0a916 Compare June 27, 2025 09:04
@gz

gz commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

It doesn't look like it answers queries (I tried fraud-detection), revision 9c0a916
https://github.com/user-attachments/assets/50f2fbc1-e59f-4638-ab97-31acd2746cad

@Karakatiza666

Copy link
Copy Markdown
Contributor Author

I think you encountered this one: #4239

@gz gz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm make sure to do
some testing before merging (different pipelines with different types (variant map etc.) and queries)

@gz

gz commented Jul 2, 2025

Copy link
Copy Markdown
Contributor

@Karakatiza666 can we merge this?

@Karakatiza666

Copy link
Copy Markdown
Contributor Author

Haven't had the chance to do a few tests with complex types, hopefully tomorrow

@Karakatiza666 Karakatiza666 force-pushed the webconsole-arrow branch 2 times, most recently from 94e2e00 to 726dedf Compare July 6, 2025 18:02
@Karakatiza666

Copy link
Copy Markdown
Contributor Author

Blocked by #4287
PR functions correctly, but triggers the above bug

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web-console behavioral changes without tests. Blocked by #4287 per author — but tests are also needed before this lands.

import BigNumber from 'bignumber.js'
import Dayjs from 'dayjs'

const arrowIpcValueToJS = <T extends DataType<Type, any>>(arrowType: Field<T>, value: any) => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Arrow IPC value conversion logic covers a lot of type cases (int64, timestamps, structs, maps, lists). This is exactly the kind of logic that deserves unit tests — each type case can go wrong independently. Per Gerd's directive (2026-03-04): behavioral changes require tests. Setup: npm install -D vitest @testing-library/svelte jsdom. The pure conversion functions here are ideal for unit testing without any DOM or component setup.

@mihaibudiu

Copy link
Copy Markdown
Contributor

@abhizer can we close this?

@abhizer

abhizer commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

This is for the UI, cc @Karakatiza666

gz added a commit that referenced this pull request May 13, 2026
 revert)

stream_arrow_query used a synchronous Arrow IPC StreamWriter to an
async mpsc by spawning one tokio task per std::io::Write::write call:

    let handle = TOKIO.spawn(async move { tx.send(bytes).await });
    self.handles.push(handle);

Each StreamWriter::write(&batch) makes ~6 sequential write_all calls
The spawned tasks have no ordering relation; on a multi-thread tokio
runtime they race to send into the receiver, so bytes arrive in
arbitrary order and the resulting Arrow IPC stream gets corrupted.

The fix is to not call sync Write from inside an async future at
all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains
the buffer between batches via std::mem::take(writer.get_mut()), then
yields a single ordered Bytes chunk per batch. Memory cost is bounded
by one record batch; behaviour matches stream_json_query, which has
always used this shape.

ChannelWriter retains its AsyncFileWriter impl for the parquet path
(AsyncArrowWriter awaits each write future before issuing the next,
so ordering there is already safe); the racy std::io::Write impl, the
handles vec, and the cfg(test) reordering shim are all removed.

Refs: #3923 #3792 #4287 #4226 #5814 #4240
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
gz added a commit that referenced this pull request May 13, 2026
 revert)

stream_arrow_query used a synchronous Arrow IPC StreamWriter to an
async mpsc by spawning one tokio task per std::io::Write::write call:

    let handle = TOKIO.spawn(async move { tx.send(bytes).await });
    self.handles.push(handle);

Each StreamWriter::write(&batch) makes ~6 sequential write_all calls
The spawned tasks have no ordering relation; on a multi-thread tokio
runtime they race to send into the receiver, so bytes arrive in
arbitrary order and the resulting Arrow IPC stream gets corrupted.

The fix is to not call sync Write from inside an async future at
all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains
the buffer between batches via std::mem::take(writer.get_mut()), then
yields a single ordered Bytes chunk per batch. Memory cost is bounded
by one record batch; behaviour matches stream_json_query, which has
always used this shape.

ChannelWriter retains its AsyncFileWriter impl for the parquet path
(AsyncArrowWriter awaits each write future before issuing the next,
so ordering there is already safe); the racy std::io::Write impl, the
handles vec, and the cfg(test) reordering shim are all removed.

Refs: #3923 #3792 #4287 #4226 #5814 #4240
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
gz added a commit that referenced this pull request May 13, 2026
 revert)

stream_arrow_query used a synchronous Arrow IPC StreamWriter to an
async mpsc by spawning one tokio task per std::io::Write::write call:

    let handle = TOKIO.spawn(async move { tx.send(bytes).await });
    self.handles.push(handle);

Each StreamWriter::write(&batch) makes ~6 sequential write_all calls
The spawned tasks have no ordering relation; on a multi-thread tokio
runtime they race to send into the receiver, so bytes arrive in
arbitrary order and the resulting Arrow IPC stream gets corrupted.

The fix is to not call sync Write from inside an async future at
all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains
the buffer between batches via std::mem::take(writer.get_mut()), then
yields a single ordered Bytes chunk per batch. Memory cost is bounded
by one record batch; behaviour matches stream_json_query, which has
always used this shape.

ChannelWriter retains its AsyncFileWriter impl for the parquet path
(AsyncArrowWriter awaits each write future before issuing the next,
so ordering there is already safe); the racy std::io::Write impl, the
handles vec, and the cfg(test) reordering shim are all removed.

Refs: #3923 #3792 #4287 #4226 #5814 #4240
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz

gz commented May 13, 2026

Copy link
Copy Markdown
Contributor

@Karakatiza666 we shoudl be able to fix/merge this now

mihaibudiu pushed a commit to mihaibudiu/dbsp that referenced this pull request May 14, 2026
…eldera#4226 feldera#5814 revert)

stream_arrow_query used a synchronous Arrow IPC StreamWriter to an
async mpsc by spawning one tokio task per std::io::Write::write call:

    let handle = TOKIO.spawn(async move { tx.send(bytes).await });
    self.handles.push(handle);

Each StreamWriter::write(&batch) makes ~6 sequential write_all calls
The spawned tasks have no ordering relation; on a multi-thread tokio
runtime they race to send into the receiver, so bytes arrive in
arbitrary order and the resulting Arrow IPC stream gets corrupted.

The fix is to not call sync Write from inside an async future at
all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains
the buffer between batches via std::mem::take(writer.get_mut()), then
yields a single ordered Bytes chunk per batch. Memory cost is bounded
by one record batch; behaviour matches stream_json_query, which has
always used this shape.

ChannelWriter retains its AsyncFileWriter impl for the parquet path
(AsyncArrowWriter awaits each write future before issuing the next,
so ordering there is already safe); the racy std::io::Write impl, the
handles vec, and the cfg(test) reordering shim are all removed.

Refs: feldera#3923 feldera#3792 feldera#4287 feldera#4226 feldera#5814 feldera#4240
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@Karakatiza666 Karakatiza666 marked this pull request as draft May 15, 2026 17:25
@Karakatiza666 Karakatiza666 force-pushed the webconsole-arrow branch 5 times, most recently from fada2e5 to 8255106 Compare May 19, 2026 00:52
@Karakatiza666 Karakatiza666 requested a review from mihaibudiu May 19, 2026 00:53
@Karakatiza666 Karakatiza666 enabled auto-merge May 19, 2026 12:45
@Karakatiza666 Karakatiza666 added this pull request to the merge queue May 19, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 19, 2026
@Karakatiza666

Karakatiza666 commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Switching from LargeList to List for ad-hoc queries causes invalidation of all checksums we use in QA. I'll try to add support for LargeList to the upstream JS package first

@Karakatiza666

Copy link
Copy Markdown
Contributor Author

Tracked in apache/arrow-js#438

@Karakatiza666 Karakatiza666 force-pushed the webconsole-arrow branch 2 times, most recently from 8af2a11 to 1c0d393 Compare June 5, 2026 19:47
@Karakatiza666 Karakatiza666 added this pull request to the merge queue Jun 6, 2026
Any commits made after this event will not be merged.
@Karakatiza666 Karakatiza666 removed this pull request from the merge queue due to a manual request Jun 6, 2026
@Karakatiza666 Karakatiza666 enabled auto-merge June 6, 2026 17:28
@Karakatiza666 Karakatiza666 added this pull request to the merge queue Jun 6, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 6, 2026
@Karakatiza666 Karakatiza666 added this pull request to the merge queue Jun 6, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 6, 2026
@gz gz added this pull request to the merge queue Jun 6, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 6, 2026
Make ARRAYs serialize as arrow List for ad-hoc queries

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Karakatiza666 and others added 2 commits June 9, 2026 22:17
[web-console] Improve ad-hoc error rendering and formatting (remove message duplication)

DataFusion errors raised during query setup were not reported over HTTP:
the backend had already responded with status 200 and simply interrupted
the stream when the error was thrown.

Start the query before deciding the HTTP status. If it fails to start,
respond with HTTP 400 and the error message so the client can surface it.
If it starts cleanly, respond with 200 OK and stream the results as before.

This adds no significant delay - only a few extra milliseconds for query
planning and SQL types header generation; the initial response does not
wait for the first data bytes.

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Switch the web console's ad-hoc query transport from HTTP streaming to the
WebSocket `/v0/pipelines/{name}/query` endpoint. WebSocket reports a runtime
error raised mid-query as a dedicated text frame followed by an error close,
so errors propagate to the UI without having to be encoded into the arrow-ipc
result stream.

`adHocQuery` opens the socket, sends `{sql, format:'arrow_ipc'}`, and pipes
binary frames into a `ReadableStream<Uint8Array>` consumed by the existing
`RecordBatchReader` path. apache-arrow's ReadableStream adapter swallows a
stream error (it cancels and ends the stream rather than rethrowing), so the
byte stream is always closed cleanly and a query error is reported out of band
via `error()`; `TabAdHocQuery` checks it after draining and renders the message
(an up-front failure such as selecting from a non-materialized source produces
no arrow schema, which is now handled rather than crashing).

Browsers cannot set the `Authorization` header on a WebSocket handshake, so the
bearer token and selected tenant are sent as `feldera-bearer.*` /
`feldera-tenant.*` subprotocols (base64url). A new manager middleware,
`promote_websocket_subprotocol_auth`, promotes them back to the standard headers
ahead of the bearer-auth path, so WebSocket handshakes authenticate through the
exact same validator as every other request and API keys keep working too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Karakatiza666 Karakatiza666 enabled auto-merge June 9, 2026 18:31
@Karakatiza666 Karakatiza666 added this pull request to the merge queue Jun 9, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 9, 2026
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
@Karakatiza666 Karakatiza666 enabled auto-merge June 9, 2026 20:34
@Karakatiza666 Karakatiza666 added this pull request to the merge queue Jun 9, 2026
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Issue related to ad hoc query processing javascript Pull requests that update Javascript code Web Console Related to the browser based UI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants