fix(replay): Recover when rrweb recording silently stops#21409
Draft
billyvg wants to merge 1 commit into
Draft
Conversation
rrweb recording could die while the rest of the Replay integration kept working (breadcrumbs/network/console continued and segments kept flushing), so it looked like recording just froze. Every rrweb event flows through the `getHandleRecordingEmit` callback, which had no try/catch. Because our rrweb `errorHandler` returns `undefined`, rrweb re-throws anything that escapes the callback and can tear down recording or leave the mutation buffer permanently locked. Separately, the buffer->session conversion tears down rrweb and restarts it; if that restart failed (rrweb's `record()` can throw or silently return `undefined`), `_stopRecording` was left unset with no recovery while the integration stayed enabled. - Wrap the emit callback so an exception can never escape back into rrweb. - Detect a missing recorder after `startRecording()` and surface it. - Add a bounded flush-time watchdog that restarts a recorder which died while the integration is still enabled and unpaused. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
size-limit report 📦
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rrweb session-replay recording could stop entirely while the rest of the Replay integration kept working — breadcrumbs, network, and console events continued and segments kept flushing, so it looked like "recording just froze." This makes recording resilient so a single transient failure can no longer permanently kill it.
Root cause
Every rrweb event type flows through the single
getHandleRecordingEmitcallback, which had notry/catch. Because Sentry's rrweberrorHandlerreturnsundefined, rrweb re-throws any error that escapes the callback (callbackWrapper), which can tear down recording or leave the mutation buffer permanently locked.Separately, the buffer→session conversion (
sendBufferedReplayOrFlush) tears down rrweb and restarts it. If that restart failed — rrweb's ownrecord()swallows internal errors and returnsundefined, andstartRecording()swallowed throws —_stopRecordingwas left unset with no recovery while_isEnabledstayedtrue. Result: all rrweb stops, breadcrumbs keep flowing.Changes
handleRecordingEmit.ts: wrap the emit callback body intry/catch→handleException; never let an exception escape back into rrweb.replay.tsstartRecording(): detect a missing recorder (record()threw or returnedundefined), surface it, and track a bounded failure count.replay.ts: add_ensureRecordingIsRunning(), a flush-time watchdog that restarts a recorder which died while the integration is enabled and unpaused (bounded to avoid hot-looping).Includes unit + integration tests, including a real-rrweb reproduction, covering the escape, the recovery, and the retry bound.
Follow-up
A separate fix in the
@sentry-internal/rrwebfork (wraptakeFullSnapshot's lock/unlock intry/finally, and unlock on the!nodeearly-return) will close the "stuck locked" hazard at the source.🤖 Generated with Claude Code