[core] Extend flow route duration to "max" and fail runs where replay takes too long#1567
[core] Extend flow route duration to "max" and fail runs where replay takes too long#1567VaguelySerious wants to merge 3 commits intomainfrom
Conversation
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
🦋 Changeset detectedLatest commit: daba9ae The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (60 failed)mongodb (3 failed):
redis (2 failed):
turso (55 failed):
Details by Category✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
workflow with 1 step💻 Local Development
workflow with 10 sequential steps💻 Local Development
workflow with 25 sequential steps💻 Local Development
workflow with 50 sequential steps💻 Local Development
Promise.all with 10 concurrent steps💻 Local Development
Promise.all with 25 concurrent steps💻 Local Development
Promise.all with 50 concurrent steps💻 Local Development
Promise.race with 10 concurrent steps💻 Local Development
Promise.race with 25 concurrent steps💻 Local Development
Promise.race with 50 concurrent steps💻 Local Development
workflow with 10 sequential data payload steps (10KB)💻 Local Development
workflow with 25 sequential data payload steps (10KB)💻 Local Development
workflow with 50 sequential data payload steps (10KB)💻 Local Development
workflow with 10 concurrent data payload steps (10KB)💻 Local Development
workflow with 25 concurrent data payload steps (10KB)💻 Local Development
workflow with 50 concurrent data payload steps (10KB)💻 Local Development
Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
stream pipeline with 5 transform steps (1MB)💻 Local Development
10 parallel streams (1MB each)💻 Local Development
fan-out fan-in 10 streams (1MB each)💻 Local Development
SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
|
Deployment failed with the following error: |
TooTallNate
left a comment
There was a problem hiding this comment.
Good idea — giving the runtime a way to fail a run gracefully before the platform hard-kills the function is a clear reliability win, and bumping the flow route to maxDuration: 'max' is the right complement. A couple of issues to address before merging.
Summary:
- Blocking: Stale comment references wrong maxDuration value
- Blocking: Changeset says "300s" but the constant is 240s
- Non-blocking:
process.exit(1)is a first in the runtime path — considerunref()as a defensive measure - Non-blocking: On Hobby plan (60s max), the 240s timeout never fires — the platform kills the function first with no
run_failedrecorded. This is probably acceptable (VQS retries will eventually hit MAX_DELIVERIES_EXCEEDED), but worth documenting.
|
|
||
| // --- Replay timeout guard --- | ||
| // If the replay takes longer than the timeout, fail the run and exit. | ||
| // This must be lower than the function's maxDuration (180s) to ensure |
There was a problem hiding this comment.
Blocking: This comment says maxDuration (180s) but the flow route is now maxDuration: 'max' (this very PR changes it). The comment should be updated to reflect reality — something like:
// This must be lower than the function's maxDuration to ensure
// the failure is recorded before the platform kills the function.
// With maxDuration: 'max', the platform limit depends on the plan
// (e.g. 300s on Pro). 240s leaves at least 60s of headroom on Pro.
| "@workflow/next": patch | ||
| --- | ||
|
|
||
| Increase flow route limit to max fluid duration and fail run if a single replay exceeds 300s |
There was a problem hiding this comment.
Blocking: The changeset says "exceeds 300s" but REPLAY_TIMEOUT_MS is 240_000 (240s). Should be:
Increase flow route limit to max fluid duration and fail run if a single replay exceeds 240s
| // Best effort — process exits regardless | ||
| } | ||
| process.exit(1); | ||
| }, REPLAY_TIMEOUT_MS); |
There was a problem hiding this comment.
Non-blocking: This is the first use of process.exit(1) in the runtime (as opposed to CLI code). It's acceptable here since the timeout is a last resort, but consider adding replayTimeout.unref() after this line (or on line 196 itself). An unref'd timer won't keep the Node.js event loop alive, so if the normal workflow path completes but the .finally(() => clearTimeout(...)) somehow doesn't execute (e.g., an uncaught exception in the promise chain), the process can still exit naturally instead of hanging for 240s.
This is purely defensive — the .finally() should always run in practice — but it's cheap and eliminates a class of potential hangs.
| // This must be lower than the function's maxDuration to ensure the | ||
| // timeout handler has time to post the run_failed event before the platform | ||
| // kills the function. | ||
| export const REPLAY_TIMEOUT_MS = 240_000; |
There was a problem hiding this comment.
Non-blocking: Worth noting that on the Hobby plan, maxDuration: 'max' resolves to 60s, so the 240s timeout will never fire — the platform hard-kills the function first, leaving no run_failed event. The run will eventually fail via MAX_QUEUE_DELIVERIES (48 retries), but there won't be a clear REPLAY_TIMEOUT error code.
This is probably acceptable — Hobby plan workflows are expected to be short-lived — but a short comment here explaining the plan-dependent behavior would help future readers. For example:
// Note: On plans where maxDuration < REPLAY_TIMEOUT_MS (e.g., Hobby at 60s),
// the platform will kill the function before this fires. In that case, VQS
// retries handle the failure via MAX_QUEUE_DELIVERIES.
No description provided.