Skip to content

Fix durable workflow: user-scoped MCP credentials and resilient tool loading#2989

Merged
anubra266 merged 2 commits intomainfrom
fix/durable-tool-approval-and-mcp-resilience
Apr 3, 2026
Merged

Fix durable workflow: user-scoped MCP credentials and resilient tool loading#2989
anubra266 merged 2 commits intomainfrom
fix/durable-tool-approval-and-mcp-resilience

Conversation

@anubra266
Copy link
Copy Markdown
Contributor

@anubra266 anubra266 commented Apr 2, 2026

Summary

  • Propagate userId and credentialStoreRegistry to durable workflow steps so user-scoped MCP tools (e.g. Notion with OAuth) resolve credentials correctly — previously hardcoded to undefined, causing durable mode to silently lose all authenticated MCP tools
  • Use Promise.allSettled for MCP tool loading so a single server failure no longer crashes the entire agent — failed tools are logged and skipped
  • Write error to stream on callLlmStep generation failure — previously the client received zero bytes and hung indefinitely

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 2, 2026

🦋 Changeset detected

Latest commit: 89828d1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-core Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 2, 2026 11:11pm
agents-docs Ready Ready Preview, Comment Apr 2, 2026 11:11pm
agents-manage-ui Ready Ready Preview, Comment Apr 2, 2026 11:11pm

Request Review

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Apr 2, 2026

TL;DR — Fixes three independent bugs in durable workflow mode: user-scoped MCP credentials were hardcoded to undefined (breaking all authenticated MCP tools like Notion OAuth), a single MCP server failure crashed the entire agent, and LLM generation errors caused the client to hang with zero bytes instead of receiving an error.

Key changes

  • Propagate userId and credentialStoreRegistry through durable workflow steps — durable mode now resolves user-scoped MCP credentials the same way classic mode does, fixing authenticated tools (e.g. Notion with OAuth)
  • Switch MCP tool loading to Promise.allSettled — a single server failure is logged and skipped instead of taking down every tool in the agent
  • Write error to stream on callLlmStep generation failure — clients now receive an error operation instead of hanging indefinitely on a dead stream

Summary | 6 files | 2 commits | base: mainfix/durable-tool-approval-and-mcp-resilience


User-scoped credentials in durable workflow steps

Before: credentialStoreRegistry and userId were hardcoded to undefined inside buildAgentForStep, so durable mode could never resolve OAuth tokens for user-scoped MCP tools.
After: Both values are extracted from the authenticated session context, threaded through the workflow payload, and passed into every credential lookup call.

The userId is pulled via getUserIdFromContext in both the chat.ts and chatDataStream.ts route handlers, added to AgentExecutionPayload, and forwarded into buildAgentForStep. A new CredentialStoreRegistry is instantiated per step so MCP tool resolution can locate the correct OAuth credentials.

agentExecutionSteps.ts · chat.ts · chatDataStream.ts · agentExecution.ts


Resilient MCP tool loading with Promise.allSettled

Before: Promise.all on MCP tool sets meant one failing server (e.g. expired OAuth token) rejected the entire batch — the agent started with zero tools.
After: Promise.allSettled catches individual failures, logs them at warn level with tool name, ID, and context identifiers (tenantId, projectId, agentId), and continues with the tools that loaded successfully.

mcp-tools.ts


Stream error on callLlmStep generation failure

Before: If the LLM generation threw inside callLlmStep, the error propagated without writing anything to the SSE stream — the client hung waiting for data that never arrived.
After: A catch block writes an errorOp to the stream and calls complete() before re-throwing, so the client always receives a termination signal.

What happens if the stream write itself fails? A nested try/catch logs a warning and lets the original error propagate. The stream is cleaned up in the existing finally block regardless.

agentExecutionSteps.ts

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

…loading

Three fixes for tool approval and MCP tool loading in durable execution mode:

1. Propagate userId and credentialStoreRegistry to durable workflow steps.
   The durable buildAgentForStep was hardcoding userId: undefined and
   credentialStoreRegistry: undefined, which prevented user-scoped MCP
   tools (e.g. Notion with OAuth) from resolving credentials. Classic mode
   worked because generateTaskHandler passed both values. Now both
   AgentExecutionPayload and AgentExecutionStepPayload carry userId,
   and buildAgentForStep instantiates a CredentialStoreRegistry for
   credential resolution.

2. Use Promise.allSettled for MCP tool loading so a single MCP server
   failure (auth error, timeout) no longer crashes the entire agent.
   Failed tools are logged and skipped; the agent continues with the
   remaining tools.

3. Write error to stream on generation failure in callLlmStep. Previously,
   when agent.generate() threw (e.g. MCP connection failure), the durable
   stream received zero bytes and the client hung indefinitely. Now the
   catch block writes an error operation to the stream before rethrowing.
@anubra266 anubra266 force-pushed the fix/durable-tool-approval-and-mcp-resilience branch from cce2511 to 0359e36 Compare April 2, 2026 22:57
@vercel vercel bot temporarily deployed to Preview – agents-docs April 2, 2026 22:57 Inactive
Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix for three distinct durable-mode bugs. All three changes are correct and the error handling is properly structured.

userId + credentialStoreRegistry propagation — The root cause (hardcoded undefined values in buildAgentForStep) is fixed at all three injection points: getMcpToolById, new Agent(...), and buildTransferRelationConfig. The userId extraction via getUserIdFromContext at the route level and its threading through the payload types is consistent.

Promise.allSettled resilience — Straightforward and correct. The error logging includes tool name/id for debuggability. The VercelDataStreamHelper.complete() is idempotent via isCompleted guard; the SSEStreamHelper.complete() is not, but the catch block already wraps it safely.

Error-to-stream on generation failure — The isTerminal flag is correctly set only when the stream write succeeds, so the finally block picks the right cleanup path (close vs releaseLock). The re-throw preserves the original error for the durable workflow's markWorkflowFailedStep handler.

Pullfrog  | View workflow run | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) mcp-tools.ts:27-44 Silent partial degradation without user notification

Issue: When MCP servers fail to connect or load tools (e.g., OAuth expired, server unreachable), the user receives no indication that their configured tools are unavailable. The agent continues with a reduced tool set, and the user may wonder why certain capabilities are missing.

Why: This is particularly problematic for user-scoped OAuth tools (like Notion) where the failure is likely due to credential issues the user could fix. The error is logged server-side so operators can debug, but the end user has no visibility into why their tools aren't working. The codebase already has an established pattern for this in AgentMcpManager.ts via reportEmptyToolSet which emits both a span and calls agentSessionManager.recordEvent to surface issues to users.

Fix: After logging the failure, emit a user-facing event similar to the reportEmptyToolSet pattern. This would require passing streamRequestId into getMcpTools or accessing it via context. Example approach:

// After the logger.error call
if (streamRequestId) {
  agentSessionManager.recordEvent(streamRequestId, 'warning', ctx.config.id, {
    message: `MCP tool "${mcpTools[i].name}" failed to load and is temporarily unavailable`,
    code: 'mcp_tool_load_failed',
    severity: 'warning',
    context: { toolName: mcpTools[i].name, toolId: mcpTools[i].id },
  });
}

Refs:

Inline Comments:

  • 🟠 Major: mcp-tools.ts:35 ERROR log level may cause alert fatigue

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: mcp-tools.ts:36-40 Missing structured error context for incident correlation

💭 Consider (1) 💭

💭 1) agentExecutionSteps.ts:87 CredentialStoreRegistry created per-step

Issue: A new CredentialStoreRegistry(createDefaultCredentialStores()) is created inside buildAgentForStep, which is called for every callLlmStep and executeToolStep invocation.

Why: In contrast, the non-durable path creates the registry once at app startup via factory.ts and passes it through middleware context. While this correctly addresses the root cause (credentials were undefined), creating a new registry per step may have minor performance overhead if createDefaultCredentialStores() performs any initialization work.

Fix: This is acceptable for the fix and likely not a real performance concern for durable workflows. If it becomes an issue, consider passing the registry through the workflow payload similar to how userId is now passed.

Refs:


💡 APPROVE WITH SUGGESTIONS

Summary: This PR correctly fixes the root cause of user-scoped MCP credentials not working in durable workflow mode — the credentialStoreRegistry and userId were hardcoded to undefined. The Promise.allSettled change for resilient tool loading is a solid improvement. The main suggestion is to add user-facing feedback when tools fail to load (following the existing reportEmptyToolSet pattern) and to consider using WARN level instead of ERROR for expected operational conditions.

The changeset-bot flagged that no changeset was included — this looks like a bug fix that should have a patch changeset for @inkeep/agents-api.

Discarded (2)
Location Issue Reason Discarded
PR description Claims "Write error to stream on callLlmStep generation failure" The diff shows no changes to error handling in callLlmStep — this code already existed. The PR description overstates the scope. This is not a code issue, just documentation inaccuracy.
agentExecutionSteps.ts Stream error writing Pre-existing code not changed in this PR.
Reviewers (4)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-errors 1 1 0 0 0 0 0
pr-review-precision 3 0 1 0 0 0 2
pr-review-sre 3 0 0 0 2 0 0
Total 7 1 1 0 2 0 2

Note: The SRE and Errors reviewers identified overlapping issues (silent degradation, log level). Merged into single findings.

@github-actions github-actions bot deleted a comment from claude bot Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Preview URLs

Use these stable preview aliases for testing this PR:

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

@anubra266
Copy link
Copy Markdown
Contributor Author

Re: per-step CredentialStoreRegistry — acknowledged. The stores are stateless HTTP clients (Nango, Composio, keychain), so instantiating fresh is safe. Durable steps can't share in-memory singletons across step boundaries. If stores ever gain caching/pooling, a module-level lazy singleton in step.cjs would avoid repeated instantiation.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: Medium

🟠⚠️ Major (0) 🟠⚠️

No new Major findings in the delta.

🕐 Pending Recommendations (1)

  • 🟠 mcp-tools.ts:27-44 Silent partial degradation without user notification — when MCP tools fail to load, the user has no visibility; consider emitting a user-facing event via agentSessionManager.recordEvent

💡 APPROVE WITH SUGGESTIONS

Summary: The delta commit successfully addresses both inline review findings from the prior review — log level downgraded to warn and structured context IDs added. The changeset is also properly included now. 🎉

The one remaining suggestion from the prior review (silent degradation notification) is still worth considering for future UX improvement, but isn't blocking — the current behavior (log + skip) is functional and the PR's core fixes (user-scoped credentials, resilient tool loading) are solid.

Discarded (0)

No items discarded.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 0 0 0 0 0 1 0
Total 0 0 0 0 0 1 0

Note: Delta review — sub-reviewers not dispatched since the changes were strictly review-feedback-only (log level + context + changeset).

@anubra266 anubra266 added this pull request to the merge queue Apr 3, 2026
@anubra266 anubra266 removed this pull request from the merge queue due to a manual request Apr 3, 2026
@anubra266 anubra266 added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit cc56dda Apr 3, 2026
28 checks passed
@anubra266 anubra266 deleted the fix/durable-tool-approval-and-mcp-resilience branch April 3, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants