Skip to content

feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations#4552

Merged
waleedlatif1 merged 28 commits into
stagingfrom
waleedlatif1/abuja-v1
May 12, 2026
Merged

feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations#4552
waleedlatif1 merged 28 commits into
stagingfrom
waleedlatif1/abuja-v1

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Add 5 new data-drain destinations: Google Cloud Storage, Azure Blob Storage, Google BigQuery, Snowflake, and Datadog Logs
  • Each destination implements test() + openSession()/deliver() with provider-spec-correct auth, retry, byte-accurate size guards, and abort-signal forwarding
  • Snowflake: key-pair JWT (with account-suffix stripping), 202-async polling, identifier quoting, PARSE_JSON bindings, 16 MB VARIANT guard
  • BigQuery: tabledata.insertAll with drainId-prefixed insertId dedup, partial-failure surfacing, 401 token refresh + 5xx/429 retry
  • Datadog: v2 logs intake with gzip, per-entry 1 MB + per-request 5 MB / 1000-entry guards, all sites including ap2
  • GCS: JSON API media uploads with shared retry helper, GCS-spec bucket-name validation (rejects goog/google)
  • Azure Blob: @azure/storage-blob SDK with sovereign-cloud endpointSuffix support
  • Settings UI: form specs for each destination, icons, search/UI polish
  • Docs: full destination sections in enterprise/data-drains.mdx
  • 57/57 destination tests pass; `tsc --noEmit` clean

Type of Change

  • New feature

Testing

Tested manually. New unit tests cover schema validation, retry paths, byte-accurate size guards, gzip, sovereign-cloud routing, and partial-failure handling per destination.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 12, 2026 0:25am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

High Risk
High risk: adds multiple new external delivery integrations (cloud storage, warehouses, observability) and expands destination validation/SSRF and credential-handling paths, which can impact data export reliability and security if misconfigured.

Overview
Data Drains can now export to five additional destinations: Google Cloud Storage, Azure Blob Storage, Google BigQuery, Snowflake, and Datadog Logs, including new backend destination implementations (auth, delivery, retries, size/limit guards, and test connection) plus unit test coverage.

The settings UI is extended with new destination forms/options and icons, and the API contracts/schemas are expanded to support the new destination types with significantly stricter validation (bucket/account/table identifier rules, allowlisted Azure endpoint suffixes, BigQuery/Snowflake identifiers).

Separately, S3 and webhook destinations are hardened: S3 key/metadata byte-limit checks and stronger endpoint/region/bucket validation; webhook validation tightens (SSRF-safe URL rules, reserved/safer signature header constraints, longer signing secret, bearer token sanitation) and shares common retry/backoff helpers. Docs are updated to describe all supported destinations, and a DB migration adds the new enum values.

Reviewed by Cursor Bugbot for commit 58a35bf. Configure here.

Comment thread apps/sim/lib/api/contracts/data-drains.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds five new data-drain destinations (GCS, Azure Blob Storage, BigQuery, Snowflake, and Datadog) alongside full API contract schemas, a settings UI, database migration, and comprehensive unit tests. All destinations implement test() + openSession()/deliver() with provider-correct auth, retry backoff, byte-accurate size guards, and abort-signal forwarding.

  • New destinations: each implemented with per-attempt timeouts, abort-signal–aware sleeps, connection-pool-safe body draining, and provider-specific constraints (Snowflake key-pair JWT + 202 async polling; BigQuery streaming inserts with dedup insertId; Datadog gzip + uncompressed/wire size guards; GCS JSON API media uploads; Azure Blob SDK with sovereign-cloud endpoint suffix allowlist).
  • Shared utilities (utils.ts): parseNdjsonObjects, parseServiceAccount/refineServiceAccountJson, backoffWithJitter, parseRetryAfter, and sleepUntilAborted are extracted and reused across all destinations.
  • Contract + UI parity: API contract schemas, runtime Zod schemas, and form isComplete guards are aligned for all destinations including the previously-fixed signingSecret >= 32, endpointSuffix sovereign-cloud field, and accountKey base64 validation.

Confidence Score: 4/5

Safe to merge after resolving the outstanding Snowflake test() validation gap flagged in the prior review round.

All issues identified in the previous review rounds appear fixed in the current HEAD. The one remaining open gap is Snowflake's test() method, which only executes SELECT 1 and does not reference the user-configured database, schema, or table. A misconfigured table passes the connection test silently and surfaces a confusing error only on the first real delivery run.

apps/sim/lib/data-drains/destinations/snowflake.ts — the test() method needs to validate the configured target table, not just connectivity.

Important Files Changed

Filename Overview
apps/sim/lib/data-drains/destinations/snowflake.ts New Snowflake destination with key-pair JWT auth, 202 async polling, and VARIANT bindings. Retry and poll loops use sleepUntilAborted, per-attempt timeouts, and body draining. The test() method only runs SELECT 1 and does not exercise the configured database/schema/table, meaning a misconfigured table silently passes the connection test.
apps/sim/lib/data-drains/destinations/bigquery.ts New BigQuery streaming-insert destination. Uses tables.get probe in test(), drainId-prefixed insertIds for dedup, per-row/per-request size guards, 401 token-refresh recovery, and correct per-attempt timeouts on both postInsertAll and the test() probe.
apps/sim/lib/data-drains/destinations/gcs.ts New GCS JSON API destination. fetchWithRetry uses AbortSignal.any per-attempt timeouts, body draining, and jitter-based backoff. test() uploads and deletes a probe object to validate write access.
apps/sim/lib/data-drains/destinations/azure_blob.ts New Azure Blob destination using the @azure/storage-blob SDK with per-try timeouts and an SSRF-safe endpoint-suffix allowlist. Sovereign-cloud support via endpointSuffix is now wired from the API contract through to the form UI.
apps/sim/lib/data-drains/destinations/datadog.ts New Datadog v2 logs intake destination with gzip, per-entry 1 MB + uncompressed 5 MB / wire 6 MB guards. NDJSON splitting now handles CRLF via /\r?\n/ and sleepUntilAborted is used throughout.
apps/sim/lib/data-drains/destinations/utils.ts New shared utilities: sleepUntilAborted, backoffWithJitter, parseRetryAfter, normalizePrefix, buildObjectKey, parseServiceAccount, refineServiceAccountJson, and parseNdjsonObjects. Used consistently across all five new destinations.
apps/sim/lib/api/contracts/data-drains.ts API contract expanded with per-destination config and credentials schemas for all five new destinations. Azure accountKey enforces exact-88-char base64 and endpointSuffix is validated against the known-suffix allowlist.
apps/sim/ee/data-drains/destinations/registry.tsx UI form specs added for all five new destinations. isComplete guards align with contract minimums. Webhook guard correctly uses >= 32. endpointSuffix optional field exposed for sovereign Azure clouds.
packages/db/migrations/0206_aromatic_veda.sql Adds the five new enum values to the data_drain_destination Postgres type. All inserted BEFORE 'webhook' — safe additive DDL, no existing rows affected.

Reviews (25): Last reviewed commit: "fix(data-drains): cap snowflake poll ret..." | Re-trigger Greptile

Comment thread apps/sim/ee/data-drains/destinations/registry.tsx Outdated
Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts
Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/api/contracts/data-drains.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/api/contracts/data-drains.ts
Comment thread apps/sim/lib/api/contracts/data-drains.ts
Comment thread apps/sim/lib/api/contracts/data-drains.ts
Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated
Comment thread apps/sim/lib/api/contracts/data-drains.ts
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated
Audited every destination against live AWS/GCS/Azure/BigQuery/Snowflake/
Datadog/webhook docs and applied spec-correctness fixes:

- S3: reserved bucket prefix amzn-s3-demo-, suffixes --x-s3/--table-s3;
  metadata byte formula excludes x-amz-meta- prefix per AWS spec
- GCS: reject -./.- adjacency; UTF-8 prefix cap; forbid .well-known/
  acme-challenge/ prefix; ASCII-only x-goog-meta-* enforcement
- BigQuery: insertId is 128 chars (not bytes); split DATASET_RE (ASCII)
  and TABLE_RE (Unicode L/M/N + connectors); UTF-8 byte cap on tableId
- Snowflake: disambiguate org-account vs legacy locator account formats;
  requestId+retry=true for idempotent retries; server-side timeout=600;
  default column DATA uppercase to match unquoted canonical form
- Azure: endpoint suffix allowlist (4 sovereign clouds); accountKey
  length(88) base64
- Webhook: url max(2048); CRLF/NUL rejection on bearer/secret/sig header
… parsing

- snowflake pollStatement: per-attempt timeout via AbortSignal.any, retry on 429/5xx with Retry-After + jitter
- bigquery parseNdjson error messages now 1-indexed
- consolidate parseNdjson variants into shared parseNdjsonLines/parseNdjsonObjects in utils
…ke poll double-sleep

- gcs.fetchWithRetry + bigquery.postInsertAll now use AbortSignal.any with a per-attempt timeout so a hung TCP connection cannot stall the drain
- snowflake.pollStatement skips the next interval sleep when it just slept for retry backoff
…owflake column default UI/docs

- bigquery test() probe now uses AbortSignal.any + per-attempt timeout
- bigquery insertAll retry switches to backoffWithJitter for thundering-herd avoidance
- Snowflake column placeholder + docs say DATA (uppercase) to match the code default
isComplete now requires signingSecret >= 32 to match the contract/runtime
schema so the Save button can't enable on a value that will fail server-side.
Switch Snowflake to parseNdjsonObjects so malformed rows are caught locally
with 1-indexed line numbers instead of failing the whole INSERT server-side.
Re-stringify each parsed object before binding to PARSE_JSON(?).
Drop the now-unused parseNdjsonLines helper.
- Azure: bound retryOptions on BlobServiceClient (SDK default tryTimeoutInMs is per-try unbounded; cap at 30s x 5 tries)
- Webhook contract: mirror runtime — signingSecret.max(512), bearerToken.max(4096) + CRLF/NUL refine, signatureHeader charset + CRLF/NUL refine
- S3 (lib + contract): reject bucket names with dash adjacent to dot; require https:// endpoint at the schema layer
- Snowflake: bind original NDJSON line bytes (re-stringifying a JSON.parse'd value loses bigint precision beyond 2^53-1); check pollStatement 200 body for the SQL error envelope (sqlState/code)
- Datadog: entry builder writes defaults first then user attrs then forced ddtags/message so user rows can't clobber routing fields; validate config.tags as comma-separated key:value pairs
- registry.tsx: tighten isComplete predicates to mirror contract minimums (GCS bucket >= 3, Azure containerName >= 3 / accountKey === 88, BigQuery projectId >= 6, Snowflake account >= 3)
Previous fix placed ddsource/service before ...attrs, leaving them clobberable
by a user row field. Per Datadog docs, service + ddsource pick the processing
pipeline, so a drain's routing config must not be overridable per-row. Spread
attrs first, then force all four reserved fields (ddsource, service, ddtags,
message).
…ertId overflows

Truncating from the left dropped the index suffix, so any overflow would
collapse all rows in a chunk to the same insertId and BigQuery would silently
dedupe them. Path is unreachable today (UUIDs keep raw ~85 chars), but the
overflow branch is now correct: hash the prefix, keep the index intact.
- gcs: rebuild Authorization header per attempt via buildHeaders so token
  refresh from google-auth-library kicks in if a 5xx retry crosses the
  hour-long token lifetime
- azure_blob: pin account-key regex to {0,2} trailing '=' (base64 of 64
  bytes = exactly 88 chars with up to two '=' pad chars)
- gcs: allow 1-char dot-separated bucket components (e.g. "a.bucket")
  to match GCS naming rules — overall name is 3-63 (or up to 222 with
  dots), but per-component minimum is 1 per Google's spec
- bigquery: drain the 401 response body before re-issuing the request
  with a refreshed token so undici can return the socket to the
  keep-alive pool
- snowflake: hoist getJwt() above the perAttempt timer in
  executeStatement so JWT signing doesn't eat the network budget
  (matches the order already used in pollStatement)
…suffix

The account validation rejected `<orgname>-<acctname>.<region>.<cloud>`
because `ACCOUNT_LOCATOR_RE`'s first segment forbade hyphens, while
`ACCOUNT_ORG_RE` forbade dots. `normalizeAccountForJwt` already handles
this composite form. Widen the first segment of `ACCOUNT_LOCATOR_RE` to
allow hyphens so the boundary contract and the runtime schema accept
what the JWT layer was already designed to process.
Mirrors the bigquery 401 fix. Without consuming the body before
sleeping, undici can't return the socket to the keep-alive pool, so
each retry leaks a TCP connection instead of reusing it.
…atus

Mirrors the bigquery/datadog/gcs drains. Long async statements can poll
many times against the same connection; without consuming the body
undici can't return the socket to the keep-alive pool, so each iteration
leaks a connection until GC.
… 200

- gcs: drain the body on success paths so undici can return the socket
  to the keep-alive pool
- snowflake: drain the body on synchronous 200 OK and run the same
  sqlState envelope check pollStatement already does — otherwise a
  statement-level failure that completes synchronously would silently
  return success
Same undici keep-alive issue as the prior fixes: postWithRetries
returned the Response on success without draining (callers only read
headers); the BigQuery `test()` probe returned without consuming the
body. Both now drain before returning.
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/abuja-v1 branch from 8adf19d to 3804d01 Compare May 12, 2026 00:18
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/data-drains/destinations/azure_blob.ts
Comment thread apps/sim/lib/data-drains/destinations/datadog.ts
Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 58a35bf. Configure here.

@waleedlatif1 waleedlatif1 merged commit 39f74aa into staging May 12, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/abuja-v1 branch May 12, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant