Skip to content

v3: checkpoint and reliability improvements #1198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
7a8852f
only checkpoint retries with delays greater than threshold
nicktrn Jun 14, 2024
55544cc
rename checkpoint threshold env var
nicktrn Jun 14, 2024
319a69f
Merge branch 'main' into v3/fix-retry-checkpoints
nicktrn Jun 18, 2024
8271ed7
log task monitor ignores
nicktrn Jun 18, 2024
3180113
crash runs with unbounded attempts
nicktrn Jun 18, 2024
ccae1e9
fix retry check in shared queue consumer
nicktrn Jun 18, 2024
e3e372e
add missing stop for env var sync spinner
nicktrn Jun 18, 2024
441bfef
prod entry point refactor
nicktrn Jun 18, 2024
eb030bf
missing awaits
nicktrn Jun 18, 2024
ca36ed3
more verbose prod flush and exit logs
nicktrn Jun 18, 2024
eac1a83
reduce checkpoint support logs
nicktrn Jun 18, 2024
a1012a8
heartbeat while checkpointing between retries
nicktrn Jun 19, 2024
186a6c8
dynamic coordinator config
nicktrn Jun 19, 2024
a72d5a2
measure lazy attempt creation time in prod
nicktrn Jun 19, 2024
ea0286a
simplify delay threshold
nicktrn Jun 19, 2024
680e060
Merge branch 'main' into v3/fix-retry-checkpoints
nicktrn Jun 25, 2024
7a04d3f
heartbeat clarifications
nicktrn Jun 25, 2024
b51c86c
crash run if it doesn't reach checkpointable state
nicktrn Jun 25, 2024
341aee4
require dynamic config threshold
nicktrn Jun 25, 2024
caa405a
fix retry prep, await previous worker kill
nicktrn Jun 25, 2024
674108d
unify wait mechanics
nicktrn Jun 26, 2024
85e0d86
fix prod worker without tasks error
nicktrn Jun 26, 2024
6bd2f85
ensure worker is ready to be checkpointed for dependency waits
nicktrn Jun 26, 2024
75ab5a3
improve worker attempt creation logging
nicktrn Jun 26, 2024
affae2c
prevent crashes caused by failed socket schema parsing
nicktrn Jun 26, 2024
9adf788
fix dynamic imports in v3 catalog
nicktrn Jun 26, 2024
37e210a
clarify attempt retry mechanics
nicktrn Jun 26, 2024
e56d31f
move backoff helper to core-apps
nicktrn Jun 26, 2024
eae644f
remove core-apps barrel file
nicktrn Jun 26, 2024
450de54
add backoff execute with callback
nicktrn Jun 27, 2024
16ea380
deprecate non-lazy attempt messages
nicktrn Jun 27, 2024
b2a2d98
update socket.io-client to v4.7.5
nicktrn Jun 27, 2024
7f82b33
fix socket.io types for emits with timeout
nicktrn Jun 27, 2024
f9ad254
retry all the things
nicktrn Jun 27, 2024
8f9c6ce
remove todo
nicktrn Jun 27, 2024
2143a9f
fix retry restores
nicktrn Jun 27, 2024
a1eaf3b
improve index failure logs
nicktrn Jun 27, 2024
e122aaf
retry incomplete dependency waits
nicktrn Jun 27, 2024
c8136f1
fix checkpoint in-progress detection
nicktrn Jun 27, 2024
b6a31e3
prevent losing messages during reconnect
nicktrn Jun 28, 2024
df382a4
checkpoint when greater or equal to threshold
nicktrn Jun 28, 2024
d81a023
improve handling of duration wait edge cases
nicktrn Jun 28, 2024
a2fd940
add ready for lazy attempt replay
nicktrn Jun 28, 2024
4aa5eaa
retry attempt completion
nicktrn Jun 28, 2024
272ee35
allow failing runs with unfriendly run id
nicktrn Jun 28, 2024
6b22b2f
fix min max jitter
nicktrn Jun 28, 2024
964c56c
cancel checkpoints on run failure
nicktrn Jun 28, 2024
a9b82f9
improve attempt creation errors
nicktrn Jun 28, 2024
c6d2087
prevent crashing run on failed cleanup
nicktrn Jun 28, 2024
f856272
handle at-least-once execute lazy attempt delivery
nicktrn Jun 28, 2024
f79364d
Merge branch 'main' into v3/checkpoint-reliability
nicktrn Jul 1, 2024
e38c8c0
log exit code on prepare for retry
nicktrn Jul 1, 2024
147d7cf
fix timeout promise
nicktrn Jul 1, 2024
dcb00df
Merge branch 'main' into v3/checkpoint-reliability
nicktrn Jul 1, 2024
22902df
mark some things
nicktrn Jul 1, 2024
0fca83e
chaos monkey superpowers
nicktrn Jul 2, 2024
c31f4c7
refactor checkpointer
nicktrn Jul 2, 2024
b4632ca
set chaos monkey defaults
nicktrn Jul 2, 2024
402bd4b
less chaos
nicktrn Jul 3, 2024
4d3b973
fix backoff
nicktrn Jul 3, 2024
7d2e5a4
handle uncaught entry point exceptions
nicktrn Jul 3, 2024
79f2c8b
only replay rpcs on true reconnects
nicktrn Jul 3, 2024
97875b9
allow resume unless final run status
nicktrn Jul 3, 2024
33c0396
Merge branch 'main' into v3/checkpoint-reliability
nicktrn Jul 3, 2024
ea20f88
add changeset
nicktrn Jul 3, 2024
53e13ee
small fixes
nicktrn Jul 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .changeset/mighty-eggs-grab.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
"@trigger.dev/core-apps": patch
"trigger.dev": patch
"@trigger.dev/core": patch
---

Tasks should now be much more robust and resilient to reconnects during crucial operations and other failure scenarios.

Task runs now have to signal checkpointable state prior to ALL checkpoints. This ensures flushing always happens.

All important socket.io RPCs will now be retried with backoff. Actions relying on checkpoints will be replayed if we haven't been checkpointed and restored as expected, e.g. after reconnect.

Other changes:

- Fix retry check in shared queue
- Fix env var sync spinner
- Heartbeat between retries
- Fix retry prep
- Fix prod worker no tasks detection
- Fail runs above `MAX_TASK_RUN_ATTEMPTS`
- Additional debug logs in all places
- Prevent crashes due to failed socket schema parsing
- Remove core-apps barrel
- Upgrade socket.io-client to fix an ACK memleak
- Additional index failure logs
- Prevent message loss during reconnect
- Prevent burst of heartbeats on reconnect
- Prevent crash on failed cleanup
- Handle at-least-once lazy execute message delivery
- Handle uncaught entry point exceptions
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ COORDINATOR_SECRET=coordinator-secret # generate the actual secret with `openssl
# OBJECT_STORE_BASE_URL="https://{bucket}.{accountId}.r2.cloudflarestorage.com"
# OBJECT_STORE_ACCESS_KEY_ID=
# OBJECT_STORE_SECRET_ACCESS_KEY=
# RUNTIME_WAIT_THRESHOLD_IN_MS=10000
# CHECKPOINT_THRESHOLD_IN_MS=10000

# These control the server-side internal telemetry
# INTERNAL_OTEL_TRACE_EXPORTER_URL=<URL to send traces to>
Expand Down
3 changes: 1 addition & 2 deletions apps/coordinator/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@
"execa": "^8.0.1",
"nanoid": "^5.0.6",
"prom-client": "^15.1.0",
"socket.io": "4.7.4",
"socket.io-client": "4.7.4"
"socket.io": "4.7.4"
},
"devDependencies": {
"@types/node": "^18",
Expand Down
95 changes: 95 additions & 0 deletions apps/coordinator/src/chaosMonkey.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import type { Execa$ } from "execa";
import { setTimeout as timeout } from "node:timers/promises";

class ChaosMonkeyError extends Error {
constructor(message: string) {
super(message);
this.name = "ChaosMonkeyError";
}
}

export class ChaosMonkey {
private chaosEventRate = 0.2;
private delayInSeconds = 45;

constructor(private enabled = false) {
if (this.enabled) {
console.log("🍌 Chaos monkey enabled");
}
}

static Error = ChaosMonkeyError;

enable() {
this.enabled = true;
console.log("🍌 Chaos monkey enabled");
}

disable() {
this.enabled = false;
console.log("🍌 Chaos monkey disabled");
}

async call({
$,
throwErrors = true,
addDelays = true,
}: {
$?: Execa$<string>;
throwErrors?: boolean;
addDelays?: boolean;
} = {}) {
if (!this.enabled) {
return;
}

const random = Math.random();

if (random > this.chaosEventRate) {
// Don't interfere with normal operation
return;
}

const chaosEvents: Array<() => Promise<any>> = [];

if (addDelays) {
chaosEvents.push(async () => {
console.log("🍌 Chaos monkey: Add delay");

if ($) {
await $`sleep ${this.delayInSeconds}`;
} else {
await timeout(this.delayInSeconds * 1000);
}
});
}

if (throwErrors) {
chaosEvents.push(async () => {
console.log("🍌 Chaos monkey: Throw error");

if ($) {
await $`false`;
} else {
throw new ChaosMonkey.Error("🍌 Chaos monkey: Throw error");
}
});
}

if (chaosEvents.length === 0) {
console.error("🍌 Chaos monkey: No events selected");
return;
}

const randomIndex = Math.floor(Math.random() * chaosEvents.length);

const chaosEvent = chaosEvents[randomIndex];

if (!chaosEvent) {
console.error("🍌 Chaos monkey: No event found");
return;
}

await chaosEvent();
}
}
Loading
Loading