Skip to content

v3: checkpoint and reliability improvements #1198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Jul 3, 2024
Merged

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Jul 3, 2024

Tasks should now be much more robust and resilient to reconnects during crucial operations and other failure scenarios.

The coordinator now receives dynamic configuration from the webapp, which means it's possible to set the checkpoint threshold in one central place. This could be used by other settings in the future.

Checkpoint thresholds have been unified, in all cases checkpoints will now correctly happen if delay or wait time is >= threshold.

Task runs now have to signal checkpointable state prior to ALL checkpoints. This ensures flushing always happens.

All important socket.io RPCs will now be retried with backoff. Actions relying on checkpoints will be replayed if we haven't been checkpointed and restored as expected, e.g. after reconnect.

Other changes:

  • Fix retry check in shared queue
  • Fix env var sync spinner
  • Heartbeat between retries
  • Fix retry prep
  • Fix prod worker no tasks detection
  • Fail runs above MAX_TASK_RUN_ATTEMPTS
  • Additional debug logs in all places
  • Prevent crashes due to failed socket schema parsing
  • Remove core-apps barrel
  • Upgrade socket.io-client to fix an ACK memleak
  • Additional index failure logs
  • Prevent message loss during reconnect
  • Prevent burst of heartbeats on reconnect
  • Prevent crash on failed cleanup
  • Handle at-least-once lazy execute message delivery
  • Handle uncaught entry point exceptions

Copy link

changeset-bot bot commented Jul 3, 2024

🦋 Changeset detected

Latest commit: 53e13ee

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@nicktrn nicktrn merged commit 14c2bdf into main Jul 3, 2024
0 of 2 checks passed
@nicktrn nicktrn deleted the v3/checkpoint-reliability branch July 8, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant