Skip to content

fix: preserve hedge model redirect audit per attempt#992

Merged
ding113 merged 2 commits intodevfrom
fix/hedge-model-redirect
Apr 4, 2026
Merged

fix: preserve hedge model redirect audit per attempt#992
ding113 merged 2 commits intodevfrom
fix/hedge-model-redirect

Conversation

@ding113
Copy link
Copy Markdown
Owner

@ding113 ding113 commented Apr 4, 2026

Summary

  • Keep model redirect audit bound to the provider attempt that actually used it instead of mutating the last provider-chain item
  • Carry redirect snapshots across hedge shadow-session cloning and winner sync, and preserve loser/retry-failed audit metadata
  • Add hedge regression coverage for winner, loser, retry_failed, snapshot cloning, and stale redirect clearing

Problem

When streaming requests use the hedge-based provider race (introduced in #894), each attempt may apply a different model redirect for its assigned provider. However, the ModelRedirector.apply() was mutating the last item in the provider chain regardless of which provider attempt triggered the redirect. In a hedge scenario with shadow sessions, this caused:

  1. Cross-contamination: A shadow session's redirect would overwrite the initial provider's redirect on the shared chain
  2. Lost audit metadata: When a shadow attempt won the race (hedge_winner) or failed (retry_failed), its chain entry would lack modelRedirect since the redirect was attached to a different chain item
  3. Shared references: The currentModelRedirect snapshot was not deep-cloned when creating shadow sessions, allowing mutation to leak across attempts

Solution

Introduce a per-attempt redirect snapshot model that decouples the redirect audit from the chain's last item:

  1. ProxySession.currentModelRedirect - New snapshot field that stores the redirect info keyed by providerId, independent of chain position
  2. setCurrentModelRedirect() / getCurrentModelRedirect() / clearCurrentModelRedirect() - New session methods to manage the snapshot lifecycle
  3. attachCurrentModelRedirectToLastChainItem() - Opportunistically attaches the snapshot to the chain if the last item matches the provider; otherwise defers until the real chain entry is created
  4. getAttemptModelRedirect() in forwarder - Lazily captures and caches the redirect snapshot per StreamingHedgeAttempt, ensuring each attempt retains its own redirect
  5. Deep clone on shadow creation/sync - createStreamingShadowSession and syncWinningAttemptSession now structuredClone() the redirect snapshot so shadow and original sessions never share object references
  6. Clear stale snapshots - ModelRedirector.apply() now calls clearCurrentModelRedirect() when a provider has no matching redirect rule, preventing stale data from leaking to subsequent chain entries

Changes

Core Changes

  • session.ts (+45): Add currentModelRedirect snapshot field and accessor methods (setCurrentModelRedirect, getCurrentModelRedirect, clearCurrentModelRedirect, attachCurrentModelRedirectToLastChainItem); wire modelRedirect into addProviderToChain metadata
  • model-redirector.ts (+18/-15): Replace direct chain-mutation with snapshot-based approach; call session.setCurrentModelRedirect() and attachCurrentModelRedirectToLastChainItem() instead of mutating lastDecision; add clearCurrentModelRedirect() calls when no redirect applies
  • forwarder.ts (+55/-5): Add modelRedirect to StreamingHedgeAttempt type; introduce getAttemptModelRedirect() helper for lazy snapshot capture; attach redirect to hedge_loser_cancelled, retry_failed, hedge_winner, request_success, and error chain entries; snapshot active attempts' redirects when a winner settles; deep-clone currentModelRedirect in createStreamingShadowSession and syncWinningAttemptSession

Test Coverage

  • proxy-forwarder-hedge-first-byte.test.ts (+367): 5 new test cases:
    • Shadow session redirect isolation + winner keeps its own redirect
    • Snapshot cloning (not shared reference) between session and shadow
    • Stale redirect clearing when switching to provider without redirect rules
    • Full hedge path: winner and loser both carry correct redirect metadata
    • Full hedge path: retry_failed shadow attempt retains redirect audit

Related

Verification

  • bunx vitest run tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
  • bun run lint:fix
  • bun run lint
  • bun run typecheck
  • bun run build
  • bun run test

Greptile Summary

This PR fixes a redirect-audit correctness bug in the hedge streaming path: model redirect metadata was attached to whichever chain item happened to be last rather than to the attempt that actually performed the redirect, causing cross-contamination between concurrent shadow sessions.

The solution introduces a per-attempt currentModelRedirect snapshot on ProxySession with providerId-scoped accessors, a lazy-capture helper (getAttemptModelRedirect) that caches the snapshot per StreamingHedgeAttempt, and structuredClone at shadow-creation and winner-sync boundaries to prevent reference aliasing. Test coverage is thorough, exercising winner, loser, retry_failed, snapshot isolation, and stale-clearing paths.

Confidence Score: 5/5

Safe to merge; the snapshot model, pre-capture loop timing, and structuredClone boundaries are all correct across every hedge race path.

All findings are P2 style/hardening suggestions. The core correctness of the per-attempt redirect snapshot, the providerId-scoped fallback in addProviderToChain, and the pre-capture loop ordering (before syncWinningAttemptSession) is sound. No data-loss, security, or runtime-error risks identified.

No files require special attention; the two P2 comments on forwarder.ts are non-blocking.

Important Files Changed

Filename Overview
src/app/v1/_lib/proxy/session.ts Adds currentModelRedirect snapshot field and four lifecycle methods (set/get/clear/attach); wires modelRedirect into addProviderToChain via providerId-gated fallback
src/app/v1/_lib/proxy/model-redirector.ts Replaces direct chain-mutation with snapshot-based approach using setCurrentModelRedirect and attachCurrentModelRedirectToLastChainItem; adds clearCurrentModelRedirect on no-redirect paths to prevent stale snapshots
src/app/v1/_lib/proxy/forwarder.ts Adds modelRedirect field to StreamingHedgeAttempt, getAttemptModelRedirect lazy-capture helper with pre-capture loop in commitWinner, structuredClone on shadow create/sync, and explicit redirect propagation to all chain entry points
tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts Five new regression tests: shadow redirect isolation, snapshot clone vs reference, stale redirect clearing, full hedge winner/loser redirect audit, and retry_failed redirect retention

Sequence Diagram

sequenceDiagram
    participant S as Main Session
    participant F as Forwarder (hedge)
    participant R as ModelRedirector
    participant Sh as Shadow Session

    F->>R: apply(session, fireworks)
    R->>S: setCurrentModelRedirect(fireworks.id, fw_redirect)
    R->>S: attachCurrentModelRedirectToLastChainItem(fireworks.id)
    Note over S: chain[initial_selection].modelRedirect = fw_redirect

    F->>Sh: createStreamingShadowSession(session, minimax)
    Note over Sh: currentModelRedirect = structuredClone(fw_redirect)

    F->>R: apply(shadow, minimax)
    R->>Sh: setCurrentModelRedirect(minimax.id, mm_redirect)
    Note over Sh: attachToLastChainItem → no-op (last item id is fireworks)

    F->>F: minimax first chunk arrives → commitWinner(minimaxAttempt)
    Note over F: Pre-capture loop: getAttemptModelRedirect(fwAttempt)<br/>reads + caches fw_redirect BEFORE session sync
    F->>S: syncWinningAttemptSession(session, shadow)
    Note over S: currentModelRedirect = structuredClone(mm_redirect)

    F->>S: addProviderToChain(minimax, hedge_winner, mm_redirect)
    F->>F: abortAllAttempts → abortAttempt(fwAttempt)
    F->>S: addProviderToChain(fireworks, hedge_loser_cancelled, cached fw_redirect)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 2998-3008

Comment:
**`undefined` not cached in `getAttemptModelRedirect`**

When the attempt's provider has no redirect configured, `redirect` is `undefined` and `attempt.modelRedirect` stays unset on every call. After `syncWinningAttemptSession` overwrites `session.currentModelRedirect`, a subsequent call for a loser attempt whose `attemptSession` is the main session will re-query the now-mutated session. Correctness is preserved today by the `providerId` guard in `getCurrentModelRedirect`, but caching the absent case explicitly makes the intent clearer and guards against future refactors removing that guard.

```suggestion
    const getAttemptModelRedirect = (attempt: StreamingHedgeAttempt) => {
      if (attempt.modelRedirect !== undefined) {
        return attempt.modelRedirect;
      }

      const redirect = attempt.session.getCurrentModelRedirect(attempt.provider.id);
      // Cache even the absent case so post-sync re-queries cannot observe stale state.
      attempt.modelRedirect = redirect ? structuredClone(redirect) : null;

      return attempt.modelRedirect ?? undefined;
    };
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 3689

Comment:
**Shallow chain item copy allows mutation through shadow session**

`shadowState.providerChain = [...sourceState.providerChain]` is a shallow array copy: the `ProviderChainItem` objects themselves are shared references with the original session. `attachCurrentModelRedirectToLastChainItem` mutates `lastItem.modelRedirect` in-place, so if a shadow's provider id ever matched an existing chain item's id, both chains would be silently affected. This is prevented today by the `launchedProviderIds` uniqueness invariant, but that invariant is not encoded here. A comment documenting the assumption (or a defensive `structuredClone` of the items) would help future maintainers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "test(proxy): cover hedge redirect audit ..." | Re-trigger Greptile

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

在代理会话与流式对冲逻辑中引入并传播“当前模型重定向(currentModelRedirect)”快照;在会话层增加管理接口,并将每个对冲尝试的 modelRedirect 延迟物化与记录到对冲审计链和阴影会话的克隆/合并流程中。

Changes

Cohort / File(s) Summary
对冲转发器
src/app/v1/_lib/proxy/forwarder.ts
StreamingHedgeAttempt 添加可选 modelRedirect 字段与惰性获取函数 getAttemptModelRedirect;在多处对冲审计/决策链条条目中包含 modelRedirect;确保清算前物化所有尝试的重定向;将 modelRedirect 传入重试失败链项。
会话状态与链路记账
src/app/v1/_lib/proxy/session.ts
新增 currentModelRedirect 状态与公有方法 setCurrentModelRedirect/clearCurrentModelRedirect/getCurrentModelRedirect/attachCurrentModelRedirectToLastChainItem;扩展 addProviderToChain(..., metadata?) 接收可选 modelRedirect 并将其持久化到 ProviderChainItem
模型重定向应用器
src/app/v1/_lib/proxy/model-redirector.ts
当无匹配重定向时调用 clearCurrentModelRedirect();不再直接变异链项,而是创建不可变 redirectInfo 快照并通过会话方法记录为“当前提供商尝试”,随后将其附加到最后链项;更新调试日志语句。
阴影会话/合并形状
src/app/v1/_lib/proxy/...
阴影/合并会话状态形状扩展为包含 `currentModelRedirect: { providerId, redirect }
单元测试
tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
新增多项测试,验证首字节对冲路径中模型重定向的隔离与克隆语义、切换到无重定向提供商时的清除行为、以及赢家/失败者和重试失败路径中 provider-chain 上的 modelRedirect 保留。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed PR title clearly describes the main fix: preserving model redirect audit per provider attempt in hedge flows, directly addressing the core issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description clearly relates to the changeset - it documents the model redirect audit fixes for hedge streaming paths across four modified files with specific technical implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/hedge-model-redirect

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working area:provider area:session size/L Large PR (< 1000 lines) labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to track and persist model redirects within proxy sessions, ensuring that redirect metadata is correctly associated with provider attempts and recorded in audit logs. It adds a currentModelRedirect snapshot to ProxySession and updates ProxyForwarder to handle redirect propagation during hedging and retries. Feedback was provided regarding a potential performance bottleneck in the loop used to capture redirects for all active attempts, suggesting an opportunity for optimization.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🧪 测试结果

测试类型 状态
代码质量
单元测试
集成测试
API 测试

总体结果: ✅ 所有测试通过

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 2998-3009: The client_abort bulk settlement path currently calls
session.addProviderToChain(...) directly on attempts and therefore can miss
snapshotting shadow attempts' modelRedirect; update the global client-abort
listener to use the existing getAttemptModelRedirect(attempt:
StreamingHedgeAttempt) helper to ensure each attempt (including shadow ones) has
its modelRedirect populated (call getAttemptModelRedirect for each attempt
before invoking session.addProviderToChain) so the audit trail includes the
redirect snapshot; also add a hedge client_abort test that verifies shadow
attempts' modelRedirects are preserved in the settlement/audit chain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a6bbd83-37fb-49f9-aa2c-882c5633101f

📥 Commits

Reviewing files that changed from the base of the PR and between de62b49 and 7dfadb5.

📒 Files selected for processing (4)
  • src/app/v1/_lib/proxy/forwarder.ts
  • src/app/v1/_lib/proxy/model-redirector.ts
  • src/app/v1/_lib/proxy/session.ts
  • tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

No significant issues identified in this PR.

PR Size: L

  • Lines changed: 505
  • Files changed: 4
  • Split suggestions: Consider splitting into (1) ProxySession/ModelRedirector state tracking, (2) ProxyForwarder hedge audit plumbing, (3) hedge regression tests.

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean
  • Type safety - Clean
  • Documentation accuracy - Clean
  • Test coverage - Adequate
  • Code clarity - Good

Automated review by Codex AI

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Reviewed PR #992 and applied label size/L.
  • Posted the required review summary via gh pr review --comment.
  • No inline review comments were submitted (no diff-line issues met the reporting threshold).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🧪 测试结果

测试类型 状态
代码质量
单元测试
集成测试
API 测试

总体结果: ✅ 所有测试通过

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR introduces a well-designed per-attempt redirect snapshot model that correctly decouples model redirect audit from chain position. The snapshot-based approach, deep-cloning on shadow creation/sync, and lazy caching in getAttemptModelRedirect are all solid architectural choices. Five comprehensive test cases cover the critical paths.

One inconsistency found: the signal handler's client_abort path was not updated to include redirect metadata for shadow attempts, while the equivalent handleAttemptFailure path was fixed.

PR Size: L

  • Lines changed: 505
  • Files changed: 4

Issues Found

Category Critical High Medium Low
Logic/Bugs 0 0 1 0
Security 0 0 0 0
Error Handling 0 0 0 0
Types 0 0 0 0
Comments/Docs 0 0 0 0
Tests 0 0 0 0
Simplification 0 0 0 0

Medium Priority Issues

1. Signal handler client_abort path missing redirect metadata for shadow attempts

File: src/app/v1/_lib/proxy/forwarder.ts (~line 3539, unchanged code)

Confidence: 82

Why this is a problem: The PR adds modelRedirect: getAttemptModelRedirect(attempt) to the handleAttemptFailure CLIENT_ABORT path (line 3248), correctly capturing per-attempt redirect snapshots. However, the clientAbortSignal event handler (~line 3539) has an equivalent client_abort code path that was NOT updated. These two paths are mutually exclusive for a given attempt:

  • Signal handler path (fires first when client disconnects): adds client_abort entries for ALL unsettled attempts, then calls abortAllAttempts(undefined, "client_abort"). Since abortAttempt sets attempt.settled = true, subsequent handleAttemptFailure calls bail early.
  • handleAttemptFailure CLIENT_ABORT path (fires when an attempt's error is categorized as CLIENT_ABORT before the signal): adds entry with redirect metadata.

The signal handler uses session.addProviderToChain(attempt.provider, {...}) on the main session. The addProviderToChain fallback (this.getCurrentModelRedirect(provider.id)) reads from the main session's snapshot. For the initial attempt, this works (main session has the initial provider's redirect). But for shadow attempts (whose redirect lives on the shadow session, not the main session), the main session's getCurrentModelRedirect(shadowProvider.id) returns undefined because the providerId doesn't match.

This contradicts the PR's stated goal: "its chain entry would lack modelRedirect since the redirect was attached to a different chain item".

Suggested fix:

// In the clientAbortSignal handler (~line 3539 in PR version):
for (const attempt of Array.from(attempts)) {
  if (!attempt.settled) {
    session.addProviderToChain(attempt.provider, {
      ...attempt.endpointAudit,
      reason: "client_abort",
      attemptNumber: attempt.sequence,
      errorMessage: "Client aborted request",
      modelRedirect: getAttemptModelRedirect(attempt),  // ADD THIS
    });
  }
}

Review Coverage

  • Logic and correctness
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean
  • Type safety - Clean
  • Documentation accuracy - Clean
  • Test coverage - Adequate (5 new tests cover critical paths; signal handler client_abort path not tested)
  • Code clarity - Good

Automated review by Claude AI

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/v1/_lib/proxy/forwarder.ts (1)

96-102: ⚠️ Potential issue | 🟠 Major

把空快照也缓存下来。

当前 helper 只在拿到 redirect 时才写回 attempt.modelRedirect。如果这个 attempt 本来就没有 redirect,后续再次调用还是会重新读取 session 当前值,无法保持“每个 attempt 只截一次快照”的语义。这样后面的 hedge_loser / retry_failed / client_abort / request_success 仍然可能读到后来状态,而不是这个 attempt 当时的结果。

建议修正
 type StreamingHedgeAttempt = {
   provider: Provider;
   session: ProxySession;
   baseUrl: string;
   endpointAudit: { endpointId: number | null; endpointUrl: string };
   modelRedirect?: ProviderChainItem["modelRedirect"];
+  modelRedirectCaptured?: boolean;
   responseController: AbortController | null;
   clearResponseTimeout: (() => void) | null;
   firstByteTimeoutMs: number;
   sequence: number;
   requestAttemptCount: number;
@@
 const getAttemptModelRedirect = (attempt: StreamingHedgeAttempt) => {
-  if (attempt.modelRedirect !== undefined) {
+  if (attempt.modelRedirectCaptured) {
     return attempt.modelRedirect;
   }

   const redirect = attempt.session.getCurrentModelRedirect(attempt.provider.id);
-  if (redirect) {
-    attempt.modelRedirect = structuredClone(redirect);
-  }
-
+  attempt.modelRedirect = redirect ? structuredClone(redirect) : undefined;
+  attempt.modelRedirectCaptured = true;
   return attempt.modelRedirect;
 };

Also applies to: 2998-3009

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 96 - 102, 当前 helper 只在有
redirect 时才把 snapshot 写回 attempt.modelRedirect,导致没有 redirect 的 attempt
以后再次读取会看到后续变更;请在负责设置快照的函数(修改 StreamingHedgeAttempt 的代码路径——即把 modelRedirect 写回
attempt 的 helper,另外也修改在 2998-3009 区块的同样逻辑)中无论是否有 redirect 都显式写回一个快照:例如把
attempt.modelRedirect 赋值为 session.modelRedirect 的当前值(若无则显式赋 null 或空对象),保证每个
attempt 只截取一次快照并持久化到 attempt.modelRedirect。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/app/v1/_lib/proxy/forwarder.ts`:
- Around line 96-102: 当前 helper 只在有 redirect 时才把 snapshot 写回
attempt.modelRedirect,导致没有 redirect 的 attempt 以后再次读取会看到后续变更;请在负责设置快照的函数(修改
StreamingHedgeAttempt 的代码路径——即把 modelRedirect 写回 attempt 的 helper,另外也修改在
2998-3009 区块的同样逻辑)中无论是否有 redirect 都显式写回一个快照:例如把 attempt.modelRedirect 赋值为
session.modelRedirect 的当前值(若无则显式赋 null 或空对象),保证每个 attempt 只截取一次快照并持久化到
attempt.modelRedirect。

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d1522c2-1aaf-4840-83e7-39eb7f43cca5

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfadb5 and caf0e5d.

📒 Files selected for processing (2)
  • src/app/v1/_lib/proxy/forwarder.ts
  • tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/proxy/proxy-forwarder-hedge-first-byte.test.ts

@ding113 ding113 merged commit 7ea7ba4 into dev Apr 4, 2026
12 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in Claude Code Hub Roadmap Apr 4, 2026
@coderabbitai coderabbitai bot mentioned this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider area:session bug Something isn't working size/L Large PR (< 1000 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant