[core] [world] Lazy run creation on start#1537
Conversation
🦋 Changeset detectedLatest commit: 7b51efd The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests💻 Local Development (1 failed)nitro-stable (1 failed):
📦 Local Production (3 failed)nitro-stable (1 failed):
nuxt-stable (1 failed):
sveltekit-stable (1 failed):
🌍 Community Worlds (63 failed)mongodb (4 failed):
redis (3 failed):
turso (56 failed):
Details by Category✅ ▲ Vercel Production
❌ 💻 Local Development
❌ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
…creation Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
| // Track not-found retries separately: when run_created fails and the | ||
| // resilient start path hasn't created the run yet, runs.get throws | ||
| // WorkflowRunNotFoundError. We retry up to 3 times with back-off | ||
| // (1s, 3s, 6s = 10s total) to give the queue time to deliver. |
There was a problem hiding this comment.
This isn't optimal. If just the initial call fails but the start ends up succeeding within seconds, it's nice that the user doesn't notice, but in other cases, when you really have the wrong run ID, your calls shouldn't be waiting for multiple seconds. What's the right UX here?
| ### Base64 encoding for queue transport | ||
|
|
||
| `Uint8Array` values (the serialized workflow input) don't survive JSON serialization | ||
| through the queue — they get corrupted to `{0: 72, 1: 101, ...}` objects. The `runInput` |
There was a problem hiding this comment.
This is a hack, trying to remove this right now by devaluing binary, but might not be the right approach. WDYT?
TooTallNate
left a comment
There was a problem hiding this comment.
This is an ambitious PR — resilient start, TTFB optimization via preloaded events, and the parallel start() flow are all solid ideas. The design doc in resilient-start.mdx is thorough and well-structured. However, there are several correctness and policy issues that need addressing before this is mergeable.
Summary
Blocking:
- Changeset uses
minorfor@workflow/core— must bepatchper repo policy - Postgres preloaded events missing
eventData ||= eventDataJsonlegacy fallback - Base64 encode/decode is a brittle hack — should use a binary-native queue transport (CBOR) instead
- Base64 decode in runtime.ts is overly broad — will corrupt non-binary string inputs
- Postgres resilient start: run + run_created event not in a transaction
EntityConflictErrorsilently dropped in catch block — behavior change needs validation
Non-blocking:
- World-local filesystem race condition on concurrent resilient starts (acceptable for dev-only backend)
- Already-running path returns no preloaded events (by design, documented in design doc)
- Duplicate
startedAtcheck after try/catch block - PR has open TODOs in the description (base64 hack,
await getRunUX for 404s)
| "@workflow/world-vercel": patch | ||
| "@workflow/world-local": patch | ||
| "@workflow/world": patch | ||
| "@workflow/core": minor |
There was a problem hiding this comment.
Blocking: AGENTS.md states: "All changes should be marked as 'patch'. Never use 'major' or 'minor' modes."
This should be:
"@workflow/core": patch
| .from(Schema.events) | ||
| .where(eq(Schema.events.runId, effectiveRunId)) | ||
| .orderBy(Schema.events.eventId); | ||
| allEvents = eventRows.map((e) => EventSchema.parse(compact(e))); |
There was a problem hiding this comment.
Blocking: Missing eventData ||= eventDataJson legacy fallback — same bug as in the extracted PR #1569.
Every other event-reading path in this file applies this fallback (events.get, events.list, events.listByCorrelationId). Without it, legacy events stored only in the payload (jsonb) column will have eventData: undefined when preloaded, causing silent data loss during replay.
| allEvents = eventRows.map((e) => EventSchema.parse(compact(e))); | |
| allEvents = eventRows.map((e) => { | |
| e.eventData ||= e.eventDataJson; | |
| return EventSchema.parse(compact(e)); | |
| }); |
| const encodedInput = | ||
| workflowArguments instanceof Uint8Array | ||
| ? btoa(String.fromCharCode(...workflowArguments)) | ||
| : workflowArguments; |
There was a problem hiding this comment.
Blocking: The base64 encoding/decoding approach is a known hack (flagged in the PR TODOs and the author's own comment on the design doc). Rather than shipping this and fixing it later, VQS supports configurable transports — we should implement a binary-native transport (e.g. CBOR) so Uint8Array values survive serialization without the base64 tax.
Beyond the performance overhead, the current approach has a correctness issue: the decode side in runtime.ts uses typeof runInput.input === 'string' as a heuristic for "this was base64-encoded binary." But if dehydrateWorkflowArguments ever returns a plain string (e.g. v1Compat path, or a future serialization format), it will be incorrectly decoded as base64, producing garbage binary data. The encode/decode contract is implicit and fragile.
If a binary transport is not feasible in this PR's timeline, at minimum add a discriminant field (e.g. inputEncoding: 'base64' | 'raw') to RunInputSchema so the decode side can distinguish encoded binary from a plain string value.
| ? { | ||
| eventData: { | ||
| input: | ||
| typeof runInput.input === 'string' |
There was a problem hiding this comment.
Blocking: This decodes ANY string as base64 binary. If runInput.input is a plain string value (not base64-encoded binary), atob() will either throw or produce garbage.
The encode side in start.ts only base64-encodes when workflowArguments instanceof Uint8Array, but this decode side has no way to distinguish "base64-encoded Uint8Array" from "a string that was always a string." The typeof === 'string' check is not a reliable discriminant.
See my comment on start.ts for the recommended fix (binary transport or discriminant field).
| executionContext: runInputData.executionContext, | ||
| }, | ||
| specVersion: effectiveSpecVersion, | ||
| }); |
There was a problem hiding this comment.
Blocking: The run insert (L411, onConflictDoNothing) and this event insert are not in a transaction. If the run insert succeeds but the event insert fails (transient DB error), you'll have a run entity with no run_created event in the log — breaking the event-sourcing invariant.
Since this is the resilient start fallback (already a degraded path), a missing run_created event could cause subtle issues downstream (e.g., observability gaps, replay inconsistencies). These two operations should be wrapped in a transaction:
await drizzle.transaction(async (tx) => {
const [createdRun] = await tx.insert(Schema.runs)...
if (createdRun) {
await tx.insert(events)...
currentRun = { ... };
}
});| `Workflow run "${runId}" has no "startedAt" timestamp` | ||
| ); | ||
| } | ||
| } catch (err) { |
There was a problem hiding this comment.
Blocking (behavior change): The old code caught both EntityConflictError and RunExpiredError here. The new code only catches RunExpiredError. This means if events.create('run_started') throws EntityConflictError (e.g., duplicate eventId from a concurrent request), it will now propagate to the queue handler and cause a retry — previously it was silently consumed.
Is this intentional? The design doc says already-running returns { run } without throwing, but EntityConflictError can come from other sources (e.g., DB unique constraint on the event ID). If intentional, add a comment explaining why EntityConflictError is no longer expected here. If not, it should be re-added.
| throw err; | ||
| } | ||
|
|
||
| if (!workflowRun.startedAt) { |
There was a problem hiding this comment.
Non-blocking: This workflowRun.startedAt check is duplicated — it already exists at line 230 inside the try block. Since workflowRun doesn't change between the two checks, this second one is unreachable (the first would have already thrown WorkflowRuntimeError, caught by the else if (err instanceof WorkflowRuntimeError) branch). You can remove this one.
| // Track not-found retries separately: when run_created fails and the | ||
| // resilient start path hasn't created the run yet, runs.get throws | ||
| // WorkflowRunNotFoundError. We retry up to 3 times with back-off | ||
| // (1s, 3s, 6s = 10s total) to give the queue time to deliver. |
There was a problem hiding this comment.
Non-blocking: The author's own comment here flags the UX concern — this is listed as an open TODO in the PR description. The 10s total wait (1s + 3s + 6s) is reasonable for the resilient start path, but for the "wrong run ID" case, the user pays 10s of latency before getting an error.
One option: start() could set a flag on the Run instance (e.g. this.resilientStart = true) when run_created failed, and only retry on WorkflowRunNotFoundError when that flag is set. That way, runs that were created normally fail fast on 404.
| createdAt: now, | ||
| updatedAt: now, | ||
| }; | ||
| await writeJSON( |
There was a problem hiding this comment.
Non-blocking: If two concurrent run_started calls arrive for the same non-existent run, both will see currentRun === null and both will call writeJSON here. Depending on writeJSON's semantics, the second write may overwrite the first or throw. Since world-local is for local development only, this is acceptable — just noting the race condition for awareness.
This PR:
startresilient: ensures that as long as the queue is up, the world storage layer being down will not affect run creations, only defer themrun_created, and postingrun_startedcalls into oneSee the live docs
TODO:
await getRunUX for 404s