Fix durable workflow: user-scoped MCP credentials and resilient tool loading by anubra266 · Pull Request #2989 · inkeep/agents

anubra266 · 2026-04-02T22:55:23Z

Summary

Propagate userId and credentialStoreRegistry to durable workflow steps so user-scoped MCP tools (e.g. Notion with OAuth) resolve credentials correctly — previously hardcoded to undefined, causing durable mode to silently lose all authenticated MCP tools
Use Promise.allSettled for MCP tool loading so a single server failure no longer crashes the entire agent — failed tools are logged and skipped
Write error to stream on callLlmStep generation failure — previously the client received zero bytes and hung indefinitely

changeset-bot · 2026-04-02T22:55:29Z

🦋 Changeset detected

Latest commit: 89828d1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages

Name	Type
@inkeep/agents-api	Patch
@inkeep/agents-manage-ui	Patch
@inkeep/agents-cli	Patch
@inkeep/agents-core	Patch
@inkeep/agents-email	Patch
@inkeep/agents-mcp	Patch
@inkeep/agents-sdk	Patch
@inkeep/agents-work-apps	Patch
@inkeep/ai-sdk-provider	Patch
@inkeep/create-agents	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-04-02T22:55:29Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Apr 2, 2026 11:11pm
agents-docs	Ready	Preview, Comment	Apr 2, 2026 11:11pm
agents-manage-ui	Ready	Preview, Comment	Apr 2, 2026 11:11pm

pullfrog · 2026-04-02T22:56:57Z

TL;DR — Fixes three independent bugs in durable workflow mode: user-scoped MCP credentials were hardcoded to undefined (breaking all authenticated MCP tools like Notion OAuth), a single MCP server failure crashed the entire agent, and LLM generation errors caused the client to hang with zero bytes instead of receiving an error.

Key changes

Propagate userId and credentialStoreRegistry through durable workflow steps — durable mode now resolves user-scoped MCP credentials the same way classic mode does, fixing authenticated tools (e.g. Notion with OAuth)
Switch MCP tool loading to Promise.allSettled — a single server failure is logged and skipped instead of taking down every tool in the agent
Write error to stream on callLlmStep generation failure — clients now receive an error operation instead of hanging indefinitely on a dead stream

_{Summary ｜ 6 files ｜ 2 commits ｜ base: main ← fix/durable-tool-approval-and-mcp-resilience}

User-scoped credentials in durable workflow steps

Before: credentialStoreRegistry and userId were hardcoded to undefined inside buildAgentForStep, so durable mode could never resolve OAuth tokens for user-scoped MCP tools.
After: Both values are extracted from the authenticated session context, threaded through the workflow payload, and passed into every credential lookup call.

The userId is pulled via getUserIdFromContext in both the chat.ts and chatDataStream.ts route handlers, added to AgentExecutionPayload, and forwarded into buildAgentForStep. A new CredentialStoreRegistry is instantiated per step so MCP tool resolution can locate the correct OAuth credentials.

agentExecutionSteps.ts · chat.ts · chatDataStream.ts · agentExecution.ts

Resilient MCP tool loading with `Promise.allSettled`

Before: Promise.all on MCP tool sets meant one failing server (e.g. expired OAuth token) rejected the entire batch — the agent started with zero tools.
After: Promise.allSettled catches individual failures, logs them at warn level with tool name, ID, and context identifiers (tenantId, projectId, agentId), and continues with the tools that loaded successfully.

mcp-tools.ts

Stream error on `callLlmStep` generation failure

Before: If the LLM generation threw inside callLlmStep, the error propagated without writing anything to the SSE stream — the client hung waiting for data that never arrived.
After: A catch block writes an errorOp to the stream and calls complete() before re-throwing, so the client always receives a termination signal.

What happens if the stream write itself fails?
A nested try/catch logs a warning and lets the original error propagate. The stream is cleaned up in the existing finally block regardless.

agentExecutionSteps.ts

^{｜ View workflow run ｜ Triggered by Pullfrog ｜ Using Claude Opus ｜ 𝕏}

…loading Three fixes for tool approval and MCP tool loading in durable execution mode: 1. Propagate userId and credentialStoreRegistry to durable workflow steps. The durable buildAgentForStep was hardcoding userId: undefined and credentialStoreRegistry: undefined, which prevented user-scoped MCP tools (e.g. Notion with OAuth) from resolving credentials. Classic mode worked because generateTaskHandler passed both values. Now both AgentExecutionPayload and AgentExecutionStepPayload carry userId, and buildAgentForStep instantiates a CredentialStoreRegistry for credential resolution. 2. Use Promise.allSettled for MCP tool loading so a single MCP server failure (auth error, timeout) no longer crashes the entire agent. Failed tools are logged and skipped; the agent continues with the remaining tools. 3. Write error to stream on generation failure in callLlmStep. Previously, when agent.generate() threw (e.g. MCP connection failure), the durable stream received zero bytes and the client hung indefinitely. Now the catch block writes an error operation to the stream before rethrowing.

pullfrog

Clean, well-scoped fix for three distinct durable-mode bugs. All three changes are correct and the error handling is properly structured.

userId + credentialStoreRegistry propagation — The root cause (hardcoded undefined values in buildAgentForStep) is fixed at all three injection points: getMcpToolById, new Agent(...), and buildTransferRelationConfig. The userId extraction via getUserIdFromContext at the route level and its threading through the payload types is consistent.

Promise.allSettled resilience — Straightforward and correct. The error logging includes tool name/id for debuggability. The VercelDataStreamHelper.complete() is idempotent via isCompleted guard; the SSEStreamHelper.complete() is not, but the catch block already wraps it safely.

Error-to-stream on generation failure — The isTerminal flag is correctly set only when the stream write succeeds, so the finally block picks the right cleanup path (close vs releaseLock). The re-throw preserves the original error for the durable workflow's markWorkflowFailedStep handler.

^{｜ View workflow run ｜ Using Claude Opus ｜ 𝕏}

agents-api/src/domains/run/workflow/steps/agentExecutionSteps.ts

claude

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) mcp-tools.ts:27-44 Silent partial degradation without user notification

Issue: When MCP servers fail to connect or load tools (e.g., OAuth expired, server unreachable), the user receives no indication that their configured tools are unavailable. The agent continues with a reduced tool set, and the user may wonder why certain capabilities are missing.

Why: This is particularly problematic for user-scoped OAuth tools (like Notion) where the failure is likely due to credential issues the user could fix. The error is logged server-side so operators can debug, but the end user has no visibility into why their tools aren't working. The codebase already has an established pattern for this in AgentMcpManager.ts via reportEmptyToolSet which emits both a span and calls agentSessionManager.recordEvent to surface issues to users.

Fix: After logging the failure, emit a user-facing event similar to the reportEmptyToolSet pattern. This would require passing streamRequestId into getMcpTools or accessing it via context. Example approach:

// After the logger.error call
if (streamRequestId) {
  agentSessionManager.recordEvent(streamRequestId, 'warning', ctx.config.id, {
    message: `MCP tool "${mcpTools[i].name}" failed to load and is temporarily unavailable`,
    code: 'mcp_tool_load_failed',
    severity: 'warning',
    context: { toolName: mcpTools[i].name, toolId: mcpTools[i].id },
  });
}

Refs:

AgentMcpManager.ts:261-301 — reportEmptyToolSet pattern

Inline Comments:

🟠 Major: mcp-tools.ts:35 ERROR log level may cause alert fatigue

🟡 Minor (1) 🟡

Inline Comments:

🟡 Minor: mcp-tools.ts:36-40 Missing structured error context for incident correlation

💭 Consider (1) 💭

💭 1) agentExecutionSteps.ts:87 CredentialStoreRegistry created per-step

Issue: A new CredentialStoreRegistry(createDefaultCredentialStores()) is created inside buildAgentForStep, which is called for every callLlmStep and executeToolStep invocation.

Why: In contrast, the non-durable path creates the registry once at app startup via factory.ts and passes it through middleware context. While this correctly addresses the root cause (credentials were undefined), creating a new registry per step may have minor performance overhead if createDefaultCredentialStores() performs any initialization work.

Fix: This is acceptable for the fix and likely not a real performance concern for durable workflows. If it becomes an issue, consider passing the registry through the workflow payload similar to how userId is now passed.

Refs:

factory.ts:49-50 — singleton registry pattern

💡 APPROVE WITH SUGGESTIONS

Summary: This PR correctly fixes the root cause of user-scoped MCP credentials not working in durable workflow mode — the credentialStoreRegistry and userId were hardcoded to undefined. The Promise.allSettled change for resilient tool loading is a solid improvement. The main suggestion is to add user-facing feedback when tools fail to load (following the existing reportEmptyToolSet pattern) and to consider using WARN level instead of ERROR for expected operational conditions.

The changeset-bot flagged that no changeset was included — this looks like a bug fix that should have a patch changeset for @inkeep/agents-api.

Discarded (2)

Location	Issue	Reason Discarded
PR description	Claims "Write error to stream on callLlmStep generation failure"	The diff shows no changes to error handling in callLlmStep — this code already existed. The PR description overstates the scope. This is not a code issue, just documentation inaccuracy.
`agentExecutionSteps.ts`	Stream error writing	Pre-existing code not changed in this PR.

Reviewers (4)

Reviewer	Returned	Main Findings	Consider	Inline Comments	Discarded
`pr-review-standards`	0	0	0	0	0
`pr-review-errors`	1	1	0	0	0
`pr-review-precision`	3	0	1	0	2
`pr-review-sre`	3	0	0	2	0
Total	7	1	1	2	2

Note: The SRE and Errors reviewers identified overlapping issues (silent degradation, log level). Merged into single findings.

agents-api/src/domains/run/agents/tools/mcp-tools.ts

github-actions · 2026-04-02T23:06:20Z

Preview URLs

Use these stable preview aliases for testing this PR:

UI: https://pr-2989-ui.preview.inkeep.com
API: https://pr-2989-api.preview.inkeep.com
API health: https://pr-2989-api.preview.inkeep.com/health

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

UI deployment: https://agents-manage-dmes05z1r.preview.inkeep.com
API deployment: https://agents-ls5vu45i1.preview.inkeep.com

…changeset

anubra266 · 2026-04-02T23:08:40Z

Re: per-step CredentialStoreRegistry — acknowledged. The stores are stateless HTTP clients (Nango, Composio, keychain), so instantiating fresh is safe. Durable steps can't share in-memory singletons across step boundaries. If stores ever gain caching/pooling, a module-level lazy singleton in step.cjs would avoid repeated instantiation.

claude

PR Review Summary

(1) Total Issues | Risk: Medium

🟠⚠️ Major (0) 🟠⚠️

No new Major findings in the delta.

🕐 Pending Recommendations (1)

🟠 mcp-tools.ts:27-44 Silent partial degradation without user notification — when MCP tools fail to load, the user has no visibility; consider emitting a user-facing event via agentSessionManager.recordEvent

💡 APPROVE WITH SUGGESTIONS

Summary: The delta commit successfully addresses both inline review findings from the prior review — log level downgraded to warn and structured context IDs added. The changeset is also properly included now. 🎉

The one remaining suggestion from the prior review (silent degradation notification) is still worth considering for future UX improvement, but isn't blocking — the current behavior (log + skip) is functional and the PR's core fixes (user-scoped credentials, resilient tool loading) are solid.

Discarded (0)

No items discarded.

Reviewers (1)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
`orchestrator`	0	0	0	0	0	1	0
Total	0	0	0	0	0	1	0

Note: Delta review — sub-reviewers not dispatched since the changes were strictly review-feedback-only (log level + context + changeset).

anubra266 force-pushed the fix/durable-tool-approval-and-mcp-resilience branch from cce2511 to 0359e36 Compare April 2, 2026 22:57

vercel bot temporarily deployed to Preview – agents-docs April 2, 2026 22:57 Inactive

pullfrog bot reviewed Apr 2, 2026

View reviewed changes

agents-api/src/domains/run/workflow/steps/agentExecutionSteps.ts Show resolved Hide resolved

vercel bot deployed to Preview – agents-manage-ui April 2, 2026 23:00 View deployment

vercel bot deployed to Preview – agents-api April 2, 2026 23:00 View deployment

claude bot reviewed Apr 2, 2026

View reviewed changes

agents-api/src/domains/run/agents/tools/mcp-tools.ts Outdated Show resolved Hide resolved

agents-api/src/domains/run/agents/tools/mcp-tools.ts Show resolved Hide resolved

github-actions bot deleted a comment from claude bot Apr 2, 2026

Address review: downgrade MCP skip log to warn, add context IDs, add …

89828d1

…changeset

claude bot reviewed Apr 2, 2026

View reviewed changes

github-actions bot deleted a comment from claude bot Apr 2, 2026

vercel bot deployed to Preview – agents-docs April 2, 2026 23:11 View deployment

vercel bot deployed to Preview – agents-manage-ui April 2, 2026 23:11 View deployment

vercel bot deployed to Preview – agents-api April 2, 2026 23:11 View deployment

amikofalvy approved these changes Apr 3, 2026

View reviewed changes

anubra266 added this pull request to the merge queue Apr 3, 2026

anubra266 removed this pull request from the merge queue due to a manual request Apr 3, 2026

anubra266 added this pull request to the merge queue Apr 3, 2026

Merged via the queue into main with commit cc56dda Apr 3, 2026
28 checks passed

anubra266 deleted the fix/durable-tool-approval-and-mcp-resilience branch April 3, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix durable workflow: user-scoped MCP credentials and resilient tool loading#2989

Fix durable workflow: user-scoped MCP credentials and resilient tool loading#2989
anubra266 merged 2 commits intomainfrom
fix/durable-tool-approval-and-mcp-resilience

anubra266 commented Apr 2, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

vercel bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

pullfrog bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

pullfrog bot left a comment

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

anubra266 commented Apr 2, 2026

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anubra266 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

changeset-bot bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pullfrog bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

User-scoped credentials in durable workflow steps

Resilient MCP tool loading with Promise.allSettled

Stream error on callLlmStep generation failure

Uh oh!

pullfrog bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (2) 🟠⚠️

🟡 Minor (1) 🟡

💭 Consider (1) 💭

💡 APPROVE WITH SUGGESTIONS

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview URLs

Uh oh!

anubra266 commented Apr 2, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (0) 🟠⚠️

🕐 Pending Recommendations (1)

💡 APPROVE WITH SUGGESTIONS

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anubra266 commented Apr 2, 2026 •

edited

Loading

changeset-bot bot commented Apr 2, 2026 •

edited

Loading

vercel bot commented Apr 2, 2026 •

edited

Loading

pullfrog bot commented Apr 2, 2026 •

edited

Loading

Resilient MCP tool loading with `Promise.allSettled`

Stream error on `callLlmStep` generation failure

github-actions bot commented Apr 2, 2026 •

edited

Loading