fix(webapp): recover from ClickHouse JSON parse failures in runs replication by matt-aitken · Pull Request #3708 · triggerdotdev/trigger.dev

matt-aitken · 2026-05-22T16:13:41Z

Summary

On a ClickHouse Cannot parse JSON object rejection, RunsReplicationService now sanitizes lone UTF-16 surrogates across the failing batch via the existing sanitizeRows helper and retries once. If the sanitizer found nothing or the retry also fails, the batch is dropped loudly with a counter increment, so the surrounding #insertWithRetry layer doesn't spin three more times on a deterministic failure. Non-parse errors propagate unchanged.

Mirrors the pattern from #3659 (for ClickhouseEventRepository) — same root cause (lone UTF-16 surrogates in user-provided JSON), same recovery shape, reusing the same shared helpers (sanitizeRows, isClickHouseJsonParseError, parseRowNumberFromError).

Fixes the customer-facing symptom from TRI-9755: a single row's poisoned output JSON used to take down the COMPLETED_SUCCESSFULLY UPDATE events for its 50+ batch-mates, stranding them in EXECUTING in ClickHouse forever and inflating "Running" counts on the Tasks page. Confirmed in production this is ongoing — ~120k stale rows accumulated in a single 5-hour burst on 2026-05-18; smaller continuous leak before and after.

What changed

apps/webapp/app/services/runsReplicationService.server.ts:

Imports the three helpers from ~/v3/eventRepository/sanitizeRowsOnParseError.server (no duplication; no move).
New private #insertWithJsonParseRecovery<T>(rows, doInsert, contextLabel, attempt) — generic over TaskRunInsertArray[] and PayloadInsertArray[], structurally identical to ClickhouseEventRepository.#insertWithJsonParseRecovery. Try → on parse error sanitize the whole batch (the at row N hint is logged but not used to slice — semantics under input_format_parallel_parsing aren't stable) → retry once → drop with loud log if sanitizer found nothing OR retry still fails.
#insertTaskRunInserts and #insertPayloadInserts extract a doInsert closure and hand it to the wrapper. Existing error logging, span recording, and recordSpanError are preserved inside the closure.
New private _permanentlyDroppedBatches = 0 counter with a public getter, for ops dashboards and tests (matches the events-repo convention). One shared counter for both insert sites — granularity comes from the contextLabel (task_runs_v2 / raw_task_runs_payload_v1) on every log line.

.server-changes/runs-replication-utf16-recovery.md — release notes entry.

Why no new tests

The shared helpers already have full unit + real-ClickHouse contract coverage from #3659 (apps/webapp/test/sanitizeRowsOnParseError.test.ts, apps/webapp/test/otlpUtf16Sanitization.integration.test.ts). The new wrapper is a line-for-line structural port. Adding a parallel integration test would require synthesizing bad data that escapes the preemptive detectBadJsonStrings check in #prepareJson but still trips ClickHouse — non-trivial without hand-crafted fixtures and wouldn't cover any new logic.

What this does NOT do

Doesn't touch the ~120k existing stale EXECUTING rows in production. That needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix [Docs] Add initial files and configuration #3).
Doesn't sanitize the error column path (runsReplicationService.server.ts:932 const errorData = { data: run.error };). Reactive recovery will catch it if it ever poisons a batch, but feeding it through #prepareJson like output is a cheap follow-up.

Test plan

pnpm run typecheck --filter webapp — clean
Post-deploy: confirm permanentlyDroppedBatches counter stays at zero (or near-zero) in /stp/trigger-app-prod/ecs/replication/service-container/process-logs, and watch for Sanitizing batch after ClickHouse JSON parse error warns to confirm recovery is firing on real traffic
Post-deploy: confirm the rate of new "EXECUTING-but-actually-COMPLETED" zombies in ClickHouse flattens (current rate ≈ tens-to-hundreds per hour platform-wide)

🤖 Generated with Claude Code

…ication On a `Cannot parse JSON object` rejection, sanitize lone UTF-16 surrogates across the batch via the existing `sanitizeRows` helper and retry once. Drop the batch loudly if the sanitizer found nothing or the retry also fails, so the surrounding `#insertWithRetry` layer doesn't spin on a deterministic failure. Non-parse errors propagate unchanged. Mirrors the pattern shipped for ClickhouseEventRepository in #3659 — same root cause (lone surrogates in user-provided JSON), same recovery shape, same shared helpers. Fixes the customer-facing symptom from TRI-9755: one row's bad output JSON used to kill the COMPLETED updates for its 50+ batch-mates, stranding them in EXECUTING in ClickHouse forever and inflating "Running" counts on the Tasks page. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

changeset-bot · 2026-05-22T16:13:49Z

⚠️ No Changeset found

Latest commit: bea811a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-22T16:14:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7dbd6035-d58e-4ed6-a0a5-4c3536fd9424

📥 Commits

Reviewing files that changed from the base of the PR and between 8957a01 and bea811a.

📒 Files selected for processing (1)

apps/webapp/app/services/runsReplicationService.server.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: typecheck / typecheck
GitHub Check: Analyze (javascript-typescript)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

apps/webapp/app/services/runsReplicationService.server.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved otherwise, code splitting is needed for performance, or the module must be loaded conditionally at runtime.
Import from @trigger.dev/core using subpaths only - never import from the root.
When writing Trigger.dev tasks, always import from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob.
Add agentcrumbs markers (// @Crumbs or `#region `@crumbs) as you write code, not just when debugging. They stay on the branch throughout development and are stripped by agentcrumbs strip before merge.

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.{js,jsx,ts,tsx,json,md,yml,yaml}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting must be enforced using Prettier before committing

Files:

apps/webapp/app/services/runsReplicationService.server.ts

🧠 Learnings (9)

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-03-26T09:02:07.973Z

Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-04-20T14:50:16.440Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:224-231
Timestamp: 2026-04-20T14:50:16.440Z
Learning: In Trigger.dev’s replication services (e.g., sessionsReplicationService.server.ts and runsReplicationService.server.ts), the “acknowledge-before-flush” behavior is intentional. The `_latestCommitEndLsn` should be updated at Postgres commit time and acknowledged on a periodic interval (via methods like `#acknowledgeLatestTransaction`) without waiting for ClickHouse batch flush to complete. Reviewers should not flag this as a durability/ordering bug; it is an established project-wide at-least-once delivery trade-off used across both runs and sessions replication services.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-04-20T15:08:49.959Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:204-215
Timestamp: 2026-04-20T15:08:49.959Z
Learning: For replication services in `apps/webapp/app/services/*ReplicationService.server.ts`, keep the `ConcurrentFlushScheduler` deduplication key shape consistent across the related services (e.g., sessions vs runs) by using the same `${item.event}_${item.session.id}` / `${item.event}_${item.run.id}` pattern. If the key format ever needs to change (such as keying only by session/run id), make the update in all related replication services together—never in just one—so deduplication behavior stays aligned across services.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-05T09:38:02.512Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-12T21:04:05.815Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

🔇 Additional comments (5)

apps/webapp/app/services/runsReplicationService.server.ts (5)

41-45: LGTM!

137-144: LGTM!

Also applies to: 300-303

680-707: LGTM!

879-884: LGTM!

Also applies to: 913-918

945-1020: LGTM!

Walkthrough

This PR adds UTF-16 JSON parse error recovery to RunsReplicationService. When ClickHouse JSONEachRow inserts fail with parse errors, the service now sanitizes lone UTF-16 surrogates across the batch via the shared sanitizeRows helper, retries once, and if recovery fails, drops the batch permanently with logging and counter tracking. Non-parse errors continue to propagate unchanged. Both task-run and payload insert paths route through a new #insertWithJsonParseRecovery wrapper that handles this flow to prevent infinite retry loops on deterministic failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The linked issue `#3` is about documentation setup for docs.trigger.dev, which is completely unrelated to the ClickHouse JSON parse recovery changes in RunsReplicationService.	The linked issue does not correspond to this PR's objectives. Link the correct issue(s) related to TRI-9755 or the runs replication JSON parse recovery instead.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: adding JSON parse error recovery to the RunsReplicationService for ClickHouse failures.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering summary, changes, motivation, test approach, and implementation details; however, the provided template checklist is not filled out.
Out of Scope Changes check	✅ Passed	All code changes are focused on ClickHouse JSON parse error recovery in RunsReplicationService; release notes addition is within scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tri-9755-runs-replication-silently-drops-batches-on-bad-output-json

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

apps/webapp/app/services/runsReplicationService.server.ts (1)

958-961: 💤 Low value

Consider extracting error message extraction to a small helper.

The same pattern for safely extracting the message from an unknown error appears twice. A tiny helper would reduce duplication and make the intent clearer.

♻️ Optional: Extract helper

+function getErrorMessage(err: unknown): string {
+  if (typeof err === "object" && err !== null && "message" in err) {
+    return String((err as { message?: unknown }).message ?? "");
+  }
+  return String(err);
+}
+
 async `#insertWithJsonParseRecovery`<T extends object>(
   ...
 ): Promise<...> {
   try {
     return { kind: "inserted", insertResult: await doInsert() };
   } catch (firstError) {
     if (!isClickHouseJsonParseError(firstError)) throw firstError;

-    const firstMessage =
-      typeof firstError === "object" && firstError !== null && "message" in firstError
-        ? String((firstError as { message?: unknown }).message ?? "")
-        : String(firstError);
+    const firstMessage = getErrorMessage(firstError);
     ...
     try {
       return { kind: "sanitized", insertResult: await doInsert() };
     } catch (retryError) {
       if (!isClickHouseJsonParseError(retryError)) throw retryError;

       this._permanentlyDroppedBatches += 1;
-      const retryMessage =
-        typeof retryError === "object" && retryError !== null && "message" in retryError
-          ? String((retryError as { message?: unknown }).message ?? "")
-          : String(retryError);
+      const retryMessage = getErrorMessage(retryError);

Also applies to: 999-1002

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/webapp/app/services/runsReplicationService.server.ts` around lines 958 -
961, Extract the repeated safe-error-to-string logic into a small helper (e.g.,
function getErrorMessage(err: unknown): string) and replace the inline
extraction used for firstMessage (which inspects firstError) and the similar
block at the other occurrence (lines ~999-1002) with calls to this helper; the
helper should handle null/undefined, check for object with a message property,
and return a string fallback so callers like firstMessage =
getErrorMessage(firstError) remain concise and type-safe.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/webapp/app/services/runsReplicationService.server.ts`:
- Around line 958-961: Extract the repeated safe-error-to-string logic into a
small helper (e.g., function getErrorMessage(err: unknown): string) and replace
the inline extraction used for firstMessage (which inspects firstError) and the
similar block at the other occurrence (lines ~999-1002) with calls to this
helper; the helper should handle null/undefined, check for object with a message
property, and return a string fallback so callers like firstMessage =
getErrorMessage(firstError) remain concise and type-safe.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ee3836b1-9dd1-48b3-b10d-31d3b24f2389

📥 Commits

Reviewing files that changed from the base of the PR and between 0d4891a and 8957a01.

📒 Files selected for processing (2)

.server-changes/runs-replication-utf16-recovery.md
apps/webapp/app/services/runsReplicationService.server.ts

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: typecheck / typecheck
GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

apps/webapp/app/services/runsReplicationService.server.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved otherwise, code splitting is needed for performance, or the module must be loaded conditionally at runtime.
Import from @trigger.dev/core using subpaths only - never import from the root.
When writing Trigger.dev tasks, always import from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob.
Add agentcrumbs markers (// @Crumbs or `#region `@crumbs) as you write code, not just when debugging. They stay on the branch throughout development and are stripped by agentcrumbs strip before merge.

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

apps/webapp/app/services/runsReplicationService.server.ts

**/*.{js,jsx,ts,tsx,json,md,yml,yaml}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting must be enforced using Prettier before committing

Files:

apps/webapp/app/services/runsReplicationService.server.ts

🧠 Learnings (10)

📚 Learning: 2026-05-14T14:54:39.095Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3545
File: .server-changes/agent-view-sessions.md:10-10
Timestamp: 2026-05-14T14:54:39.095Z
Learning: In the `trigger.dev` repository, do not flag inconsistent dot vs slash notation in route/path strings inside `.server-changes/*.md` files. These markdown files are consumed verbatim into the changelog, so the mixed notation (e.g., `resources.orgs.../runs.$runParam/...`) is intentional and should be preserved as-is.

Applied to files:

.server-changes/runs-replication-utf16-recovery.md

📚 Learning: 2026-03-22T13:26:12.060Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-03-22T19:24:14.403Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-18T08:21:27.694Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-03-26T09:02:07.973Z

Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-04-20T14:50:16.440Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:224-231
Timestamp: 2026-04-20T14:50:16.440Z
Learning: In Trigger.dev’s replication services (e.g., sessionsReplicationService.server.ts and runsReplicationService.server.ts), the “acknowledge-before-flush” behavior is intentional. The `_latestCommitEndLsn` should be updated at Postgres commit time and acknowledged on a periodic interval (via methods like `#acknowledgeLatestTransaction`) without waiting for ClickHouse batch flush to complete. Reviewers should not flag this as a durability/ordering bug; it is an established project-wide at-least-once delivery trade-off used across both runs and sessions replication services.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-04-20T15:08:49.959Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:204-215
Timestamp: 2026-04-20T15:08:49.959Z
Learning: For replication services in `apps/webapp/app/services/*ReplicationService.server.ts`, keep the `ConcurrentFlushScheduler` deduplication key shape consistent across the related services (e.g., sessions vs runs) by using the same `${item.event}_${item.session.id}` / `${item.event}_${item.run.id}` pattern. If the key format ever needs to change (such as keying only by session/run id), make the update in all related replication services together—never in just one—so deduplication behavior stays aligned across services.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-05T09:38:02.512Z

Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2026-05-12T21:04:05.815Z

Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

🔇 Additional comments (8)

.server-changes/runs-replication-utf16-recovery.md (2)

6-20: LGTM!

1-21: ⚡ Quick win

Prettier formatting compliant for .server-changes/runs-replication-utf16-recovery.md
npx prettier --check .server-changes/runs-replication-utf16-recovery.md reports “All matched files use Prettier code style!”, and the changelog content is clear and comprehensive.

apps/webapp/app/services/runsReplicationService.server.ts (6)

41-45: LGTM!

137-144: LGTM!

300-303: LGTM!

859-882: LGTM!

894-918: LGTM!

920-955: LGTM!

Also applies to: 966-1018

…rics Devin caught this: when #insertWithJsonParseRecovery drops a batch (sanitizer no-op, or sanitize-retry still hit a parse error), #insertTaskRunInserts was previously converting `{kind: "dropped"}` to `undefined`, so #insertWithRetry saw `[null, undefined]` (no error) and #flushBatch ticked `_taskRunsInsertedCounter` / `_payloadsInsertedCounter` for rows that never landed in ClickHouse. Fix: return the recovery wrapper's outcome straight through. #flushBatch now reads the outcome and only increments the success counter when both `!err` AND `outcome?.kind !== "dropped"`. Matches the pattern in ClickhouseEventRepository where the caller explicitly bails on `outcome.kind === "dropped"` before downstream success work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This comment was marked as resolved.

Sign in to view

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webapp): recover from ClickHouse JSON parse failures in runs replication#3708

fix(webapp): recover from ClickHouse JSON parse failures in runs replication#3708
matt-aitken wants to merge 2 commits into
mainfrom
feature/tri-9755-runs-replication-silently-drops-batches-on-bad-output-json

matt-aitken commented May 22, 2026

Uh oh!

changeset-bot Bot commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

matt-aitken commented May 22, 2026

Summary

What changed

Why no new tests

What this does NOT do

Test plan

Uh oh!

changeset-bot Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading