fix(webapp): sanitize OTel attributes on ClickHouse JSON parse rejection by 0ski · Pull Request #3659 · triggerdotdev/trigger.dev

0ski · 2026-05-18T14:24:11Z

Before fix:

After fix:

changeset-bot · 2026-05-18T14:24:17Z

⚠️ No Changeset found

Latest commit: 8cc9e85

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-18T14:24:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: deb9d9aa-ec33-4836-ae0d-1a37d0b0fba0

📥 Commits

Reviewing files that changed from the base of the PR and between fa0034d and 8cc9e85.

📒 Files selected for processing (8)

.server-changes/otel-attribute-utf16-sanitization.md
apps/webapp/app/utils/detectBadJsonStrings.ts
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts
apps/webapp/app/v3/otlpExporter.server.ts
apps/webapp/test/detectbadJsonStrings.test.ts
apps/webapp/test/otlpUtf16Sanitization.integration.test.ts
apps/webapp/test/sanitizeRowsOnParseError.test.ts

💤 Files with no reviewable changes (1)

apps/webapp/app/v3/otlpExporter.server.ts

✅ Files skipped from review due to trivial changes (1)

.server-changes/otel-attribute-utf16-sanitization.md

🚧 Files skipped from review as they are similar to previous changes (6)

apps/webapp/app/utils/detectBadJsonStrings.ts
apps/webapp/test/otlpUtf16Sanitization.integration.test.ts
apps/webapp/test/sanitizeRowsOnParseError.test.ts
apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts
apps/webapp/app/v3/eventRepository/clickhouseEventRepository.server.ts
apps/webapp/test/detectbadJsonStrings.test.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: typecheck / typecheck
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
GitHub Check: Analyze (javascript-typescript)

Walkthrough

This PR implements UTF-16 surrogate sanitization for ClickHouse JSON parse failures in OTEL attributes. It broadens the detectBadJsonStrings function to recognize the full low-surrogate range (U+DC00–U+DFFF), introduces a sanitization module with helpers to detect ClickHouse parse errors, extract failing row indices, and recursively replace lone surrogates with a sentinel value, integrates a sanitize-and-retry mechanism into ClickhouseEventRepository that logs and drops unrecoverable batches, and provides unit and integration test coverage for both the detection logic and end-to-end recovery flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. It only contains before/after screenshots without following the provided template structure (missing issue reference, checklist, testing details, and changelog).	Complete the PR description by filling in all template sections: close issue reference, checklist items, testing steps, changelog summary, and ensure context about the fix is documented.
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: implementing UTF-16 surrogate sanitization for OTel attributes when ClickHouse JSON parsing fails.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch oskar/fix-otel-utf16-attributes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

ClickHouse's JSONEachRow parser rejects rows containing unpaired UTF-16 surrogates (`Cannot parse JSON object here ... ParallelParsingBlock InputFormat`), losing the whole 5–10k-row batch through the scheduler's retry path. Locally reproduced with ~10 KB rows; the 100 MB size-stress error is distinct (`Size of JSON object is extremely large`), so the root cause is content quality, not size. `ClickhouseEventRepository.#flushBatch` and `#flushLlmMetricsBatch` now retry once after sanitizing every row in the batch — any string with a lone surrogate is replaced with `"[invalid-utf16]"`. ClickHouse's `at row N` hint is logged for observability but not used to slice; its semantics under `input_format_parallel_parsing` aren't reliable, and a whole-batch scan catches multi-row poisoning in one pass. If the retry also fails: loud error log with sample row, `permanentlyDroppedBatches` increments, return normally — deterministic parse failures don't benefit from the scheduler's transient-retry backoff. Non-parse errors propagate unchanged. Detection reuses `detectBadJsonStrings` via `JSON.stringify(value)`, with a latent regex bug fixed: the low-surrogate nibble matched `[cd]` instead of `[c-f]`, missing U+DE00–U+DFFF and false-flagging common emoji pairs (e.g. 😀). Healthy batches pay zero scan cost — the check only runs when ClickHouse has already rejected.

…ication (#3708) ## Summary On a ClickHouse `Cannot parse JSON object` rejection, `RunsReplicationService` now sanitizes lone UTF-16 surrogates across the failing batch via the existing `sanitizeRows` helper and retries once. If the sanitizer found nothing or the retry also fails, the batch is dropped loudly with a counter increment, so the surrounding `#insertWithRetry` layer doesn't spin three more times on a deterministic failure. Non-parse errors propagate unchanged. Mirrors the pattern from #3659 (for `ClickhouseEventRepository`) — same root cause (lone UTF-16 surrogates in user-provided JSON), same recovery shape, **reusing the same shared helpers** (`sanitizeRows`, `isClickHouseJsonParseError`, `parseRowNumberFromError`). Fixes the customer-facing symptom from [TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755): a single row's poisoned `output` JSON used to take down the `COMPLETED_SUCCESSFULLY` UPDATE events for its 50+ batch-mates, stranding them in `EXECUTING` in ClickHouse forever and inflating "Running" counts on the Tasks page. Confirmed in production this is ongoing — ~120k stale rows accumulated in a single 5-hour burst on 2026-05-18; smaller continuous leak before and after. ## What changed `apps/webapp/app/services/runsReplicationService.server.ts`: - Imports the three helpers from `~/v3/eventRepository/sanitizeRowsOnParseError.server` (no duplication; no move). - New private `#insertWithJsonParseRecovery<T>(rows, doInsert, contextLabel, attempt)` — generic over `TaskRunInsertArray[]` and `PayloadInsertArray[]`, structurally identical to `ClickhouseEventRepository.#insertWithJsonParseRecovery`. Try → on parse error sanitize the whole batch (the `at row N` hint is logged but not used to slice — semantics under `input_format_parallel_parsing` aren't stable) → retry once → drop with loud log if sanitizer found nothing OR retry still fails. - `#insertTaskRunInserts` and `#insertPayloadInserts` extract a `doInsert` closure and hand it to the wrapper. Existing error logging, span recording, and `recordSpanError` are preserved inside the closure. - New `private _permanentlyDroppedBatches = 0` counter with a public getter, for ops dashboards and tests (matches the events-repo convention). One shared counter for both insert sites — granularity comes from the `contextLabel` (`task_runs_v2` / `raw_task_runs_payload_v1`) on every log line. `.server-changes/runs-replication-utf16-recovery.md` — release notes entry. ## Why no new tests The shared helpers already have full unit + real-ClickHouse contract coverage from #3659 (`apps/webapp/test/sanitizeRowsOnParseError.test.ts`, `apps/webapp/test/otlpUtf16Sanitization.integration.test.ts`). The new wrapper is a line-for-line structural port. Adding a parallel integration test would require synthesizing bad data that *escapes* the preemptive `detectBadJsonStrings` check in `#prepareJson` but still trips ClickHouse — non-trivial without hand-crafted fixtures and wouldn't cover any new logic. ## What this does NOT do - Doesn't touch the ~120k existing stale `EXECUTING` rows in production. That needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix #3). - Doesn't sanitize the `error` column path (`runsReplicationService.server.ts:932 const errorData = { data: run.error };`). Reactive recovery will catch it if it ever poisons a batch, but feeding it through `#prepareJson` like `output` is a cheap follow-up. ## Test plan - [x] `pnpm run typecheck --filter webapp` — clean - [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stays at zero (or near-zero) in `/stp/trigger-app-prod/ecs/replication/service-container/process-logs`, and watch for `Sanitizing batch after ClickHouse JSON parse error` warns to confirm recovery is firing on real traffic - [ ] Post-deploy: confirm the rate of new "EXECUTING-but-actually-COMPLETED" zombies in ClickHouse flattens (current rate ≈ tens-to-hundreds per hour platform-wide) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed May 18, 2026

View reviewed changes

0ski force-pushed the oskar/fix-otel-utf16-attributes branch from fa0034d to 8cc9e85 Compare May 18, 2026 15:20

0ski marked this pull request as ready for review May 18, 2026 15:23

ericallam approved these changes May 18, 2026

View reviewed changes

0ski merged commit 02d61af into main May 18, 2026
29 checks passed

0ski deleted the oskar/fix-otel-utf16-attributes branch May 18, 2026 15:42

matt-aitken mentioned this pull request May 22, 2026

fix(webapp): recover from ClickHouse JSON parse failures in runs replication #3708

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webapp): sanitize OTel attributes on ClickHouse JSON parse rejection#3659

fix(webapp): sanitize OTel attributes on ClickHouse JSON parse rejection#3659
0ski merged 1 commit into
mainfrom
oskar/fix-otel-utf16-attributes

0ski commented May 18, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

0ski commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0ski commented May 18, 2026 •

edited

Loading

changeset-bot Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading