Conversation
🦋 Changeset detectedLatest commit: 89828d1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
TL;DR — Fixes three independent bugs in durable workflow mode: user-scoped MCP credentials were hardcoded to Key changes
Summary | 6 files | 2 commits | base: User-scoped credentials in durable workflow steps
The
Resilient MCP tool loading with
|
…loading Three fixes for tool approval and MCP tool loading in durable execution mode: 1. Propagate userId and credentialStoreRegistry to durable workflow steps. The durable buildAgentForStep was hardcoding userId: undefined and credentialStoreRegistry: undefined, which prevented user-scoped MCP tools (e.g. Notion with OAuth) from resolving credentials. Classic mode worked because generateTaskHandler passed both values. Now both AgentExecutionPayload and AgentExecutionStepPayload carry userId, and buildAgentForStep instantiates a CredentialStoreRegistry for credential resolution. 2. Use Promise.allSettled for MCP tool loading so a single MCP server failure (auth error, timeout) no longer crashes the entire agent. Failed tools are logged and skipped; the agent continues with the remaining tools. 3. Write error to stream on generation failure in callLlmStep. Previously, when agent.generate() threw (e.g. MCP connection failure), the durable stream received zero bytes and the client hung indefinitely. Now the catch block writes an error operation to the stream before rethrowing.
cce2511 to
0359e36
Compare
There was a problem hiding this comment.
Clean, well-scoped fix for three distinct durable-mode bugs. All three changes are correct and the error handling is properly structured.
userId + credentialStoreRegistry propagation — The root cause (hardcoded undefined values in buildAgentForStep) is fixed at all three injection points: getMcpToolById, new Agent(...), and buildTransferRelationConfig. The userId extraction via getUserIdFromContext at the route level and its threading through the payload types is consistent.
Promise.allSettled resilience — Straightforward and correct. The error logging includes tool name/id for debuggability. The VercelDataStreamHelper.complete() is idempotent via isCompleted guard; the SSEStreamHelper.complete() is not, but the catch block already wraps it safely.
Error-to-stream on generation failure — The isTerminal flag is correctly set only when the stream write succeeds, so the finally block picks the right cleanup path (close vs releaseLock). The re-throw preserves the original error for the durable workflow's markWorkflowFailedStep handler.
Claude Opus | 𝕏
There was a problem hiding this comment.
PR Review Summary
(4) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
🟠 1) mcp-tools.ts:27-44 Silent partial degradation without user notification
Issue: When MCP servers fail to connect or load tools (e.g., OAuth expired, server unreachable), the user receives no indication that their configured tools are unavailable. The agent continues with a reduced tool set, and the user may wonder why certain capabilities are missing.
Why: This is particularly problematic for user-scoped OAuth tools (like Notion) where the failure is likely due to credential issues the user could fix. The error is logged server-side so operators can debug, but the end user has no visibility into why their tools aren't working. The codebase already has an established pattern for this in AgentMcpManager.ts via reportEmptyToolSet which emits both a span and calls agentSessionManager.recordEvent to surface issues to users.
Fix: After logging the failure, emit a user-facing event similar to the reportEmptyToolSet pattern. This would require passing streamRequestId into getMcpTools or accessing it via context. Example approach:
// After the logger.error call
if (streamRequestId) {
agentSessionManager.recordEvent(streamRequestId, 'warning', ctx.config.id, {
message: `MCP tool "${mcpTools[i].name}" failed to load and is temporarily unavailable`,
code: 'mcp_tool_load_failed',
severity: 'warning',
context: { toolName: mcpTools[i].name, toolId: mcpTools[i].id },
});
}Refs:
Inline Comments:
- 🟠 Major:
mcp-tools.ts:35ERROR log level may cause alert fatigue
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
mcp-tools.ts:36-40Missing structured error context for incident correlation
💭 Consider (1) 💭
💭 1) agentExecutionSteps.ts:87 CredentialStoreRegistry created per-step
Issue: A new CredentialStoreRegistry(createDefaultCredentialStores()) is created inside buildAgentForStep, which is called for every callLlmStep and executeToolStep invocation.
Why: In contrast, the non-durable path creates the registry once at app startup via factory.ts and passes it through middleware context. While this correctly addresses the root cause (credentials were undefined), creating a new registry per step may have minor performance overhead if createDefaultCredentialStores() performs any initialization work.
Fix: This is acceptable for the fix and likely not a real performance concern for durable workflows. If it becomes an issue, consider passing the registry through the workflow payload similar to how userId is now passed.
Refs:
💡 APPROVE WITH SUGGESTIONS
Summary: This PR correctly fixes the root cause of user-scoped MCP credentials not working in durable workflow mode — the credentialStoreRegistry and userId were hardcoded to undefined. The Promise.allSettled change for resilient tool loading is a solid improvement. The main suggestion is to add user-facing feedback when tools fail to load (following the existing reportEmptyToolSet pattern) and to consider using WARN level instead of ERROR for expected operational conditions.
The changeset-bot flagged that no changeset was included — this looks like a bug fix that should have a patch changeset for @inkeep/agents-api.
Discarded (2)
| Location | Issue | Reason Discarded |
|---|---|---|
| PR description | Claims "Write error to stream on callLlmStep generation failure" | The diff shows no changes to error handling in callLlmStep — this code already existed. The PR description overstates the scope. This is not a code issue, just documentation inaccuracy. |
agentExecutionSteps.ts |
Stream error writing | Pre-existing code not changed in this PR. |
Reviewers (4)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-errors |
1 | 1 | 0 | 0 | 0 | 0 | 0 |
pr-review-precision |
3 | 0 | 1 | 0 | 0 | 0 | 2 |
pr-review-sre |
3 | 0 | 0 | 0 | 2 | 0 | 0 |
| Total | 7 | 1 | 1 | 0 | 2 | 0 | 2 |
Note: The SRE and Errors reviewers identified overlapping issues (silent degradation, log level). Merged into single findings.
Preview URLsUse these stable preview aliases for testing this PR:
These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find. Raw Vercel deployment URLs
|
|
Re: per-step CredentialStoreRegistry — acknowledged. The stores are stateless HTTP clients (Nango, Composio, keychain), so instantiating fresh is safe. Durable steps can't share in-memory singletons across step boundaries. If stores ever gain caching/pooling, a module-level lazy singleton in step.cjs would avoid repeated instantiation. |
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Medium
🟠⚠️ Major (0) 🟠⚠️
No new Major findings in the delta.
🕐 Pending Recommendations (1)
- 🟠
mcp-tools.ts:27-44Silent partial degradation without user notification — when MCP tools fail to load, the user has no visibility; consider emitting a user-facing event viaagentSessionManager.recordEvent
💡 APPROVE WITH SUGGESTIONS
Summary: The delta commit successfully addresses both inline review findings from the prior review — log level downgraded to warn and structured context IDs added. The changeset is also properly included now. 🎉
The one remaining suggestion from the prior review (silent degradation notification) is still worth considering for future UX improvement, but isn't blocking — the current behavior (log + skip) is functional and the PR's core fixes (user-scoped credentials, resilient tool loading) are solid.
Discarded (0)
No items discarded.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Note: Delta review — sub-reviewers not dispatched since the changes were strictly review-feedback-only (log level + context + changeset).

Summary
userIdandcredentialStoreRegistryto durable workflow steps so user-scoped MCP tools (e.g. Notion with OAuth) resolve credentials correctly — previously hardcoded toundefined, causing durable mode to silently lose all authenticated MCP toolsPromise.allSettledfor MCP tool loading so a single server failure no longer crashes the entire agent — failed tools are logged and skippedcallLlmStepgeneration failure — previously the client received zero bytes and hung indefinitely