fix(webapp): sanitize OTel attributes on ClickHouse JSON parse rejection#3659
Conversation
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (6)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
WalkthroughThis PR implements UTF-16 surrogate sanitization for ClickHouse JSON parse failures in OTEL attributes. It broadens the Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ClickHouse's JSONEachRow parser rejects rows containing unpaired UTF-16 surrogates (`Cannot parse JSON object here ... ParallelParsingBlock InputFormat`), losing the whole 5–10k-row batch through the scheduler's retry path. Locally reproduced with ~10 KB rows; the 100 MB size-stress error is distinct (`Size of JSON object is extremely large`), so the root cause is content quality, not size. `ClickhouseEventRepository.#flushBatch` and `#flushLlmMetricsBatch` now retry once after sanitizing every row in the batch — any string with a lone surrogate is replaced with `"[invalid-utf16]"`. ClickHouse's `at row N` hint is logged for observability but not used to slice; its semantics under `input_format_parallel_parsing` aren't reliable, and a whole-batch scan catches multi-row poisoning in one pass. If the retry also fails: loud error log with sample row, `permanentlyDroppedBatches` increments, return normally — deterministic parse failures don't benefit from the scheduler's transient-retry backoff. Non-parse errors propagate unchanged. Detection reuses `detectBadJsonStrings` via `JSON.stringify(value)`, with a latent regex bug fixed: the low-surrogate nibble matched `[cd]` instead of `[c-f]`, missing U+DE00–U+DFFF and false-flagging common emoji pairs (e.g. 😀). Healthy batches pay zero scan cost — the check only runs when ClickHouse has already rejected.
fa0034d to
8cc9e85
Compare
…ication (#3708) ## Summary On a ClickHouse `Cannot parse JSON object` rejection, `RunsReplicationService` now sanitizes lone UTF-16 surrogates across the failing batch via the existing `sanitizeRows` helper and retries once. If the sanitizer found nothing or the retry also fails, the batch is dropped loudly with a counter increment, so the surrounding `#insertWithRetry` layer doesn't spin three more times on a deterministic failure. Non-parse errors propagate unchanged. Mirrors the pattern from #3659 (for `ClickhouseEventRepository`) — same root cause (lone UTF-16 surrogates in user-provided JSON), same recovery shape, **reusing the same shared helpers** (`sanitizeRows`, `isClickHouseJsonParseError`, `parseRowNumberFromError`). Fixes the customer-facing symptom from [TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755): a single row's poisoned `output` JSON used to take down the `COMPLETED_SUCCESSFULLY` UPDATE events for its 50+ batch-mates, stranding them in `EXECUTING` in ClickHouse forever and inflating "Running" counts on the Tasks page. Confirmed in production this is ongoing — ~120k stale rows accumulated in a single 5-hour burst on 2026-05-18; smaller continuous leak before and after. ## What changed `apps/webapp/app/services/runsReplicationService.server.ts`: - Imports the three helpers from `~/v3/eventRepository/sanitizeRowsOnParseError.server` (no duplication; no move). - New private `#insertWithJsonParseRecovery<T>(rows, doInsert, contextLabel, attempt)` — generic over `TaskRunInsertArray[]` and `PayloadInsertArray[]`, structurally identical to `ClickhouseEventRepository.#insertWithJsonParseRecovery`. Try → on parse error sanitize the whole batch (the `at row N` hint is logged but not used to slice — semantics under `input_format_parallel_parsing` aren't stable) → retry once → drop with loud log if sanitizer found nothing OR retry still fails. - `#insertTaskRunInserts` and `#insertPayloadInserts` extract a `doInsert` closure and hand it to the wrapper. Existing error logging, span recording, and `recordSpanError` are preserved inside the closure. - New `private _permanentlyDroppedBatches = 0` counter with a public getter, for ops dashboards and tests (matches the events-repo convention). One shared counter for both insert sites — granularity comes from the `contextLabel` (`task_runs_v2` / `raw_task_runs_payload_v1`) on every log line. `.server-changes/runs-replication-utf16-recovery.md` — release notes entry. ## Why no new tests The shared helpers already have full unit + real-ClickHouse contract coverage from #3659 (`apps/webapp/test/sanitizeRowsOnParseError.test.ts`, `apps/webapp/test/otlpUtf16Sanitization.integration.test.ts`). The new wrapper is a line-for-line structural port. Adding a parallel integration test would require synthesizing bad data that *escapes* the preemptive `detectBadJsonStrings` check in `#prepareJson` but still trips ClickHouse — non-trivial without hand-crafted fixtures and wouldn't cover any new logic. ## What this does NOT do - Doesn't touch the ~120k existing stale `EXECUTING` rows in production. That needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix #3). - Doesn't sanitize the `error` column path (`runsReplicationService.server.ts:932 const errorData = { data: run.error };`). Reactive recovery will catch it if it ever poisons a batch, but feeding it through `#prepareJson` like `output` is a cheap follow-up. ## Test plan - [x] `pnpm run typecheck --filter webapp` — clean - [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stays at zero (or near-zero) in `/stp/trigger-app-prod/ecs/replication/service-container/process-logs`, and watch for `Sanitizing batch after ClickHouse JSON parse error` warns to confirm recovery is firing on real traffic - [ ] Post-deploy: confirm the rate of new "EXECUTING-but-actually-COMPLETED" zombies in ClickHouse flattens (current rate ≈ tens-to-hundreds per hour platform-wide) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Before fix:


After fix: