Skip to content

v3: checkpoint and reliability improvements#1198

Merged
nicktrn merged 66 commits into
mainfrom
v3/checkpoint-reliability
Jul 3, 2024
Merged

v3: checkpoint and reliability improvements#1198
nicktrn merged 66 commits into
mainfrom
v3/checkpoint-reliability

Conversation

@nicktrn

@nicktrn nicktrn commented Jul 3, 2024

Copy link
Copy Markdown
Collaborator

Tasks should now be much more robust and resilient to reconnects during crucial operations and other failure scenarios.

The coordinator now receives dynamic configuration from the webapp, which means it's possible to set the checkpoint threshold in one central place. This could be used by other settings in the future.

Checkpoint thresholds have been unified, in all cases checkpoints will now correctly happen if delay or wait time is >= threshold.

Task runs now have to signal checkpointable state prior to ALL checkpoints. This ensures flushing always happens.

All important socket.io RPCs will now be retried with backoff. Actions relying on checkpoints will be replayed if we haven't been checkpointed and restored as expected, e.g. after reconnect.

Other changes:

  • Fix retry check in shared queue
  • Fix env var sync spinner
  • Heartbeat between retries
  • Fix retry prep
  • Fix prod worker no tasks detection
  • Fail runs above MAX_TASK_RUN_ATTEMPTS
  • Additional debug logs in all places
  • Prevent crashes due to failed socket schema parsing
  • Remove core-apps barrel
  • Upgrade socket.io-client to fix an ACK memleak
  • Additional index failure logs
  • Prevent message loss during reconnect
  • Prevent burst of heartbeats on reconnect
  • Prevent crash on failed cleanup
  • Handle at-least-once lazy execute message delivery
  • Handle uncaught entry point exceptions

@changeset-bot

changeset-bot Bot commented Jul 3, 2024

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 53e13ee

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@nicktrn nicktrn merged commit 14c2bdf into main Jul 3, 2024
@nicktrn nicktrn deleted the v3/checkpoint-reliability branch July 8, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant