Conversation
minpeter
commented
Jan 6, 2026
- feat: add context management with compaction and middleware support\n\n- Add context tracker for monitoring token usage\n- Implement auto-compaction when context threshold reached\n- Add middleware for trimming leading newlines in AI responses\n- Add /help command and model switching capabilities\n- Add includeUsage flag to friendliai client\n- Add debug logging for context usage
- Update default model from LGAI-EXAONE/K-EXAONE-236B-A23B to zai-org/GLM-4.6 and add support for aborting ongoing conversations via ESC key
- feat: context management follow-ups
…\n- Add context tracker for monitoring token usage\n- Implement auto-compaction when context threshold reached\n- Add middleware for trimming leading newlines in AI responses\n- Add /help command and model switching capabilities\n- Add includeUsage flag to friendliai client\n- Add debug logging for context usage
zai-org/GLM-4.6 and add support for aborting ongoing conversations via ESC key - Replace default model in documentation and code - Add abort functionality to Agent class with AbortController - Modify chat method to return aborted status - Update command handler to support abort signals in streaming - Add ESC key interrupt support in input handling - Implement /context and /compact commands for monitoring and managing context usage
Summary of ChangesHello @minpeter, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the agent's ability to manage conversation context, ensuring more stable and efficient long-running interactions. It introduces automatic and manual mechanisms for keeping the conversation within token limits, provides visibility into context usage, and improves user control by allowing the interruption of AI responses. These changes aim to make the agent more robust and user-friendly for extended coding sessions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive context management system for the agent, including token tracking, automatic context compaction, and manual controls via new commands. It also adds the ability to abort ongoing conversations with the ESC key and updates the default model. The implementation is robust, with good error handling and asynchronous control flow. I've identified a couple of areas for improvement: one to fix the logic for fallback context size estimation, and another to refactor a function for better readability. Overall, this is an excellent and significant feature addition.
| function shouldContinueAfterTools(messages: ModelMessage[]): boolean { | ||
| let lastToolIndex = -1; | ||
| for (let i = 0; i < messages.length; i += 1) { | ||
| if (messages[i]?.role === "tool") { | ||
| lastToolIndex = i; | ||
| } | ||
| } | ||
| if (lastToolIndex === -1) { | ||
| return false; | ||
| } | ||
| for (let i = lastToolIndex + 1; i < messages.length; i += 1) { | ||
| if (assistantMessageHasText(messages[i])) { | ||
| return false; | ||
| } | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
The shouldContinueAfterTools function can be simplified for better readability and maintainability by using modern array methods like findLastIndex and some. The current implementation with for loops is correct but more verbose than necessary.
function shouldContinueAfterTools(messages: ModelMessage[]): boolean {
const lastToolIndex = messages.findLastIndex((msg) => msg.role === "tool");
if (lastToolIndex === -1) {
return false;
}
// Check if there is any assistant message with text after the last tool message.
const subsequentMessages = messages.slice(lastToolIndex + 1);
return !subsequentMessages.some(assistantMessageHasText);
}| export class ContextTracker { | ||
| private readonly config: ContextConfig; | ||
| private totalInputTokens = 0; | ||
| private totalOutputTokens = 0; | ||
| private stepCount = 0; | ||
| private currentContextTokens: number | null = null; | ||
|
|
||
| constructor(config: Partial<ContextConfig> = {}) { | ||
| this.config = { ...DEFAULT_CONFIG, ...config }; | ||
| } | ||
|
|
||
| setMaxContextTokens(tokens: number): void { | ||
| this.config.maxContextTokens = tokens; | ||
| } | ||
|
|
||
| setCompactionThreshold(threshold: number): void { | ||
| if (threshold < 0 || threshold > 1) { | ||
| throw new Error("Compaction threshold must be between 0 and 1"); | ||
| } | ||
| this.config.compactionThreshold = threshold; | ||
| } | ||
|
|
||
| updateUsage(usage: LanguageModelUsage): void { | ||
| this.totalInputTokens += usage.inputTokens ?? 0; | ||
| this.totalOutputTokens += usage.outputTokens ?? 0; | ||
| this.stepCount++; | ||
| } | ||
|
|
||
| /** | ||
| * Set the exact current context token count. | ||
| */ | ||
| setContextTokens(tokens: number): void { | ||
| this.currentContextTokens = Math.max(0, Math.round(tokens)); | ||
| } | ||
|
|
||
| /** | ||
| * Set total usage directly (useful after compaction or when loading state) | ||
| */ | ||
| setTotalUsage(inputTokens: number, outputTokens: number): void { | ||
| this.totalInputTokens = inputTokens; | ||
| this.totalOutputTokens = outputTokens; | ||
| } | ||
|
|
||
| /** | ||
| * Get estimated current context size | ||
| * Note: This is an approximation based on accumulated usage | ||
| */ | ||
| getEstimatedContextTokens(): number { | ||
| // The input tokens from the last request roughly represents | ||
| // the current context size (system prompt + conversation history) | ||
| return this.totalInputTokens > 0 | ||
| ? Math.round(this.totalInputTokens / Math.max(this.stepCount, 1)) | ||
| : 0; | ||
| } | ||
|
|
||
| getStats(): ContextStats { | ||
| const totalTokens = | ||
| this.currentContextTokens ?? this.getEstimatedContextTokens(); | ||
| const usagePercentage = totalTokens / this.config.maxContextTokens; | ||
| const shouldCompact = usagePercentage >= this.config.compactionThreshold; | ||
|
|
||
| return { | ||
| totalTokens, | ||
| inputTokens: this.totalInputTokens, | ||
| outputTokens: this.totalOutputTokens, | ||
| maxContextTokens: this.config.maxContextTokens, | ||
| usagePercentage, | ||
| shouldCompact, | ||
| }; | ||
| } | ||
|
|
||
| shouldCompact(): boolean { | ||
| return this.getStats().shouldCompact; | ||
| } | ||
|
|
||
| reset(): void { | ||
| this.totalInputTokens = 0; | ||
| this.totalOutputTokens = 0; | ||
| this.stepCount = 0; | ||
| this.currentContextTokens = 0; | ||
| } | ||
|
|
||
| /** | ||
| * Called after compaction to adjust token counts | ||
| * @param newInputTokens The token count of the compacted context | ||
| */ | ||
| afterCompaction(newInputTokens: number): void { | ||
| this.totalInputTokens = newInputTokens; | ||
| this.totalOutputTokens = 0; | ||
| this.stepCount = 1; | ||
| this.currentContextTokens = Math.max(0, Math.round(newInputTokens)); | ||
| } | ||
|
|
||
| getConfig(): ContextConfig { | ||
| return { ...this.config }; | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of getEstimatedContextTokens calculates the average input tokens per step, which doesn't accurately reflect the current context size as the conversation grows. The comment for the function correctly states that the input tokens from the last request are a better estimate. This can be fixed by storing the last input token count in the tracker. This change improves the accuracy of the fallback context size estimation, making the tracker more robust when the precise token measurement API fails.
export class ContextTracker {
private readonly config: ContextConfig;
private totalInputTokens = 0;
private totalOutputTokens = 0;
private stepCount = 0;
private currentContextTokens: number | null = null;
private lastInputTokens = 0;
constructor(config: Partial<ContextConfig> = {}) {
this.config = { ...DEFAULT_CONFIG, ...config };
}
setMaxContextTokens(tokens: number): void {
this.config.maxContextTokens = tokens;
}
setCompactionThreshold(threshold: number): void {
if (threshold < 0 || threshold > 1) {
throw new Error("Compaction threshold must be between 0 and 1");
}
this.config.compactionThreshold = threshold;
}
updateUsage(usage: LanguageModelUsage): void {
this.totalInputTokens += usage.inputTokens ?? 0;
this.totalOutputTokens += usage.outputTokens ?? 0;
this.stepCount++;
this.lastInputTokens = usage.inputTokens ?? 0;
}
/**
* Set the exact current context token count.
*/
setContextTokens(tokens: number): void {
this.currentContextTokens = Math.max(0, Math.round(tokens));
}
/**
* Set total usage directly (useful after compaction or when loading state)
*/
setTotalUsage(inputTokens: number, outputTokens: number): void {
this.totalInputTokens = inputTokens;
this.totalOutputTokens = outputTokens;
}
/**
* Get estimated current context size
* Note: This is an approximation based on accumulated usage
*/
getEstimatedContextTokens(): number {
// The input tokens from the last request roughly represents
// the current context size (system prompt + conversation history)
return this.lastInputTokens;
}
getStats(): ContextStats {
const totalTokens =
this.currentContextTokens ?? this.getEstimatedContextTokens();
const usagePercentage = totalTokens / this.config.maxContextTokens;
const shouldCompact = usagePercentage >= this.config.compactionThreshold;
return {
totalTokens,
inputTokens: this.totalInputTokens,
outputTokens: this.totalOutputTokens,
maxContextTokens: this.config.maxContextTokens,
usagePercentage,
shouldCompact,
};
}
shouldCompact(): boolean {
return this.getStats().shouldCompact;
}
reset(): void {
this.totalInputTokens = 0;
this.totalOutputTokens = 0;
this.stepCount = 0;
this.currentContextTokens = 0;
this.lastInputTokens = 0;
}
/**
* Called after compaction to adjust token counts
* @param newInputTokens The token count of the compacted context
*/
afterCompaction(newInputTokens: number): void {
this.totalInputTokens = newInputTokens;
this.totalOutputTokens = 0;
this.stepCount = 1;
this.currentContextTokens = Math.max(0, Math.round(newInputTokens));
this.lastInputTokens = newInputTokens;
}
getConfig(): ContextConfig {
return { ...this.config };
}
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6476bf824d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const onLine = (line: string) => { | ||
| rl.removeListener("close", onClose); | ||
| rl.pause(); | ||
| resolve(line); |
There was a problem hiding this comment.
Resume stdin before streaming to enable ESC abort
The new ESC abort handler depends on emitKeypressEvents(process.stdin), but readline.pause() also calls process.stdin.pause(), which stops data events (and thus keypress). Because stdin is paused after each line and never resumed before agent.chat, pressing ESC during streaming won't trigger agent.abort(), so the interrupt feature silently fails. Consider resuming stdin (or skipping the pause) before starting the streaming chat loop.
Useful? React with 👍 / 👎.
…tool result clearing BackgroundMemoryExtractor system (Claude Code's #1 missing feature): - BackgroundMemoryExtractor class: periodic LLM-based memory extraction with configurable thresholds (token growth + turn count), single-flight guard, and getStructuredState() for compaction integration - MemoryStore interface with InMemoryStore and FileMemoryStore impls - Two built-in presets: CHAT_MEMORY_PRESET (user facts) and CODE_MEMORY_PRESET (Claude Code-style session notes) Tool Result MicroCompact extension: - clearToolResults option to replace old tool_result content - keepRecentToolResults to preserve N most recent results - clearableToolNames filter for selective clearing - Complements existing assistant text shrinking 24 test files, 418 tests passing.
…on fixes, benchmark tasks (#94) * feat(minimal-agent): enable speculative compaction with tuned thresholds Tune compaction config for 2000-token context to support speculative compaction during multi-turn chatbot conversations: - Add speculativeStartRatio (0.75) to trigger background compaction at 750 tokens - Set explicit thresholdRatio (0.5) for blocking compaction at 1000 tokens - Lower reserveTokens from 500 to 400 (chatbot responses are shorter) - Lower keepRecentTokens from 500 to 350 (~3-4 turns preserved) - Add contextLimit to compaction config directly - Add benchmark script (benchmark.ts) for automated 30-turn memory retention testing with probe questions every 5 turns, metrics table, and ASCII context usage chart * feat(minimal-agent): upgrade to 4096 context for 82% memory retention Increase context budget from 2000 to 4096 tokens and retune thresholds to minimize compaction cycles: - contextLimit: 2000 → 4096 - reserveTokens: 400 → 512 - keepRecentTokens: 350 → 800 (~8-10 turns preserved) - thresholdRatio: 0.5 → 0.65 (blocking at 2662 tokens) - speculativeStartRatio: 0.75 → 0.8 (speculative at 2130 tokens) Benchmark results (30-turn chatbot, GLM-5): - Memory retention: 41% → 82% (14/17 probes passed) - Compaction cycles: 3 → 0 (all 30 turns fit in context) - All targeted probes (turns 5-20, 30) pass at 100% - Turn 25 comprehensive probe: 3/6 (model response quality, not context loss) * fix(minimal-agent): add temperature:0 to benchmark for reproducible results Set temperature to 0 in generateText calls to eliminate model nondeterminism. Verified 82% (14/17) retention at 4096 context is reproducible across runs. * feat(minimal-agent): add JSON output and visualization script Add --output flag to benchmark for JSON result export. Add visualize.py (matplotlib) that generates 4 charts from JSON results: - retention_curve: Memory retention % vs context size - token_usage: Context token usage over 30 turns - probe_heatmap: Per-probe recall scores - summary: 3-panel overview Usage: python3 visualize.py results/*.json --output charts/ * feat(minimal-agent): add multi-provider benchmark support (anthropic + friendli) Add --provider flag to benchmark supporting 'anthropic' and 'friendli'. Refactor callModel to accept LanguageModel interface for provider-agnostic benchmarking. Add @ai-sdk/anthropic dependency. Opus benchmark result at 4096 context: 94% (16/17) vs GLM-5's 82% (14/17). Key difference: Turn 25 comprehensive recall 6/6 (vs 3/6 with GLM-5), confirming the 82% ceiling was model response quality, not compaction. * feat(minimal-agent): optimize system prompt and compaction prompt for chatbot Replace generic system prompt with fact-retention-aware version that instructs the model to remember personal information and list ALL known facts when asked to recall. Replace code-agent-oriented compaction prompt (Files & Changes, Technical Discoveries) with chatbot-specific version that prioritizes: - User Profile extraction (all personal details as bullet points) - Conversation Highlights (topics, advice, decisions) - Current Topic (for continuity) Impact on memory retention (GLM-5): - 2000 tokens: 53% → 71% (+18pp, compaction preserves user facts) - 4096 tokens: 82% → 82% (no change, compaction not triggered) * feat(minimal-agent): extend benchmark to 50 turns with baseline comparison Extend conversation from 30 to 50 turns (10 probes total) to force compaction at 4096 context. Add --baseline flag to benchmark for A/B testing against the default code-agent compaction prompt. Key result at 4096 context, 50 turns: - Chatbot prompt: Turn 35 (post-compaction) scores 4/4 on pet recall - Baseline prompt: Turn 35 scores 0/4 — pet info lost in compaction - Overall: 54% vs 51% (chatbot vs baseline) * feat(minimal-agent): apply 4 compaction techniques from Claude Code analysis Upgrade CHATBOT_COMPACTION_PROMPT with techniques learned from Claude Code: 1. Analysis scratchpad: <analysis> block for think-before-summarize (harness already strips it, only <summary> content is kept) 2. All User Messages list: explicit section preserving user intent trail 3. Previous-summary fact preservation: carry forward ALL facts from prior compaction, never drop information across cycles 4. Partial compact awareness: focus summary on older messages since recent ones are preserved separately via keepRecentTokens 50-turn benchmark at 4096 context (GLM-5): - Before: 54% (20/37) - After: 62% (23/37) — +8pp improvement - Turn 40 (post-compaction recall): 2/4 → 4/4 (perfect) * feat(harness): add Circuit Breaker, MicroCompact, and Session Memory Three new shared compaction modules inspired by Claude Code's context management architecture: 1. CompactionCircuitBreaker (compaction-circuit-breaker.ts) - Tracks consecutive compaction failures, opens after 3 (configurable) - Auto-closes after cooldown period (default 60s) - Prevents infinite retry loops on irrecoverable context overflow 2. microCompactMessages (micro-compact.ts) - Pre-compaction step that shrinks old long assistant responses - Protects recent messages (configurable token window) - Preserves user messages and summary messages - Immutable: returns new array without modifying input 3. SessionMemoryTracker (session-memory.ts) - Structured key-value memory persisting across compaction cycles - Categorized facts: identity, preferences, relationships, context - getStructuredState() callback for CompactionConfig integration - extractFactsFromSummary() to parse User Profile from summaries - JSON serialization for persistence (toJSON/fromJSON) All modules exported from @ai-sdk-tool/harness. 21 test files, 399 tests passing. * feat(harness,cea,tui,headless,minimal-agent): wire Phase 2 integration Connect all 3 Phase 1 modules into the compaction pipeline: Circuit Breaker → CompactionOrchestrator: - New optional circuitBreaker param in constructor - checkAndCompact() skips when circuit is open - recordSuccess/recordFailure on compaction outcome - manualCompact() ignores circuit breaker (user intent) - getState() exposes circuitBreakerOpen status MicroCompact → CheckpointHistory: - New microCompact option in CompactionConfig (boolean or options) - Pre-compaction step: shrinks old assistant responses before summarization - Reduces summarizer input tokens → better summary quality - COMPACTION_DEBUG logging for tokensSaved/messagesModified Session Memory → minimal-agent: - SessionMemoryTracker instance wired via getStructuredState callback - extractFactsFromSummary called on compaction completion - Structured user profile injected into every compaction prompt TUI + Headless compactionCallbacks: - Both runners now accept compactionCallbacks in config - Chains external callbacks with internal ones (both fire) - Enables minimal-agent to hook into compaction lifecycle Adaptive Thresholds → harness (from CEA): - Moved computeAdaptiveThresholdRatio, computeCompactionMaxTokens, computeSpeculativeStartRatio from CEA to harness/compaction-policy - CEA now imports from harness (backwards-compatible re-exports) * feat(minimal-agent): activate CircuitBreaker + MicroCompact Enable all 3 harness compaction features in minimal-agent: - CircuitBreaker: passed to TUI/headless via new config option - MicroCompact: enabled via microCompact: true in compaction config - SessionMemory: already wired (getStructuredState + extractFactsFromSummary) Also expose circuitBreaker option in TUI and headless runner configs so any consuming agent can pass one through. * feat(harness): add BackgroundMemoryExtractor, MemoryStore, presets + tool result clearing BackgroundMemoryExtractor system (Claude Code's #1 missing feature): - BackgroundMemoryExtractor class: periodic LLM-based memory extraction with configurable thresholds (token growth + turn count), single-flight guard, and getStructuredState() for compaction integration - MemoryStore interface with InMemoryStore and FileMemoryStore impls - Two built-in presets: CHAT_MEMORY_PRESET (user facts) and CODE_MEMORY_PRESET (Claude Code-style session notes) Tool Result MicroCompact extension: - clearToolResults option to replace old tool_result content - keepRecentToolResults to preserve N most recent results - clearableToolNames filter for selective clearing - Complements existing assistant text shrinking 24 test files, 418 tests passing. * feat: wire BackgroundMemoryExtractor + tool result clearing into all agents Integration of new harness modules into consuming packages: minimal-agent: - Replace SessionMemoryTracker with BackgroundMemoryExtractor (chat preset) - Aggressive thresholds for small context (300 tokens, 2 turns) - Fire-and-forget onTurnComplete for non-blocking extraction benchmark: - Same BME integration as agent for fair comparison - Each turn triggers extraction check CEA: - Enable tool result clearing: microCompact.clearToolResults = true - Keep 5 most recent tool results intact TUI + headless: - New onTurnComplete callback in config interface - Called after each model turn with messages + usage - Non-blocking: doesn't delay main agent loop * fix(harness): prevent BME from injecting empty template into compaction Fix BackgroundMemoryExtractor returning template text via getStructuredState before any extraction has occurred. This wasted tokens in small contexts (2000 tokens: 65% → 46% regression). Changes: - getStructuredState returns undefined until first successful extraction - Raise default thresholds: minTokenGrowth 300→500, minTurns 2→5 - Cap maxExtractionTokens at 500 for chat preset - Update test expectation for pre-extraction state * fix(minimal-agent): revert to SessionMemoryTracker, BME hurts small models BackgroundMemoryExtractor degraded retention with GLM-5 at all context sizes (2k: 65%→54%, 4k: 59%→57%). Root cause: GLM-5 produces poor quality memory extractions, and the extraction overhead wastes context. Revert minimal-agent to SessionMemoryTracker which extracts facts from compaction summaries (zero overhead, no extra LLM calls). BME remains in harness library for larger model agents (CEA with Claude) where extraction quality justifies the overhead. * fix: sync benchmark with agent config, wire CEA circuit breaker Oracle verification fixes: 1. benchmark.ts: Replace BME with SessionMemoryTracker to match index.ts - Add microCompact: true to benchmark compaction config - Add extractFactsFromSummary on compaction complete 2. CEA main.ts: Add CompactionCircuitBreaker to orchestrator - Prevents infinite compaction retry loops in production * chore: save verified benchmark artifacts (2k: 38%, 4k: 65%) Final benchmark results with synced config (SessionMemoryTracker + MicroCompact + CircuitBreaker + chatbot compaction prompts): - 2000 tokens: 38% (14/37), 3 compactions - 4096 tokens: 65% (24/37), 2 compactions These are the definitive results for the current configuration. * feat(harness,minimal-agent): real-time fact extraction from user messages Add extractFactsFromUserMessage() to SessionMemoryTracker — parses user messages for personal facts using 16 regex patterns (name, job, location, pets, family, favorites, age, etc.) with zero LLM overhead. Previously memory was empty until AFTER first compaction. Now facts are extracted on EVERY user message, so getStructuredState() provides useful context from the very first compaction. Wire into minimal-agent via onTurnComplete hook — every turn parses all user messages for facts. Also applied in benchmark.ts for fair testing. * fix(harness): improve fact extraction patterns - Fix name extraction capturing trailing words ('Alice and' → 'Alice') - Add pet keyword prefix for 'I have a X named Y' pattern - Add adopted/just adopted to pet detection - Add family member patterns (my sister/brother/partner X) - Add pet-related keywords to relationship category - Use top-level regex constants to avoid per-call allocation * feat(minimal-agent): extend benchmark to 80 turns, verify 4096 compaction 80-turn benchmark forces compaction at 4096 context (peak 2959 tokens). Result: 71% retention (44/62 probes), 1 compaction cycle. Turn 80 comprehensive recall scores 8/10 — remembers name, job, city, both pets, partner, sister, food, and programming language after 80 turns and compaction. Real-time fact extraction via extractFactsFromUserMessage provides structured memory to compaction prompt, preserving user identity across compaction cycles. * feat(harness): add computeContextBudget and getContextPressureLevel Close gaps 7-9 from Claude Code comparison: - computeContextBudget(): calculates effective context window by reserving tokens for compaction output (10% of context, max 20K, min 500) - ContextBudget type: autoCompactAt, warningAt, hardLimitAt, speculativeStartAt - getContextPressureLevel(): returns normal/elevated/warning/critical based on current token usage vs budget thresholds This matches Claude Code's approach of reserving tokens for the compaction API call itself (p99=17.3K measured) rather than using the raw context limit. * feat(harness): close all 12 gaps vs Claude Code context management Close every identified gap from the Claude Code comparison: Gap 1: Session Memory Compaction path - compact() checks getStructuredState() FIRST, uses it directly if available - Skips LLM summarizeFn call entirely when session memory exists - CompactionResult.compactionMethod indicates which path was used Gap 2: API Context Management (api-context-management.ts) - Provider-agnostic ContextManagementConfig interface - buildContextManagementConfig() with trigger/keep thresholds - isContextManagementSupported() helper for provider detection Gap 3: Context Collapse (context-collapse.ts) - collapseConsecutiveOps() groups sequential read/search tool results - Replaces content with '[Collapsed: N file reads]' summaries - Preserves tool_use/tool_result structure, protects recent messages Gap 4+5: Context Analysis + Suggestions - analyzeContextTokens(): per-role breakdown, tool stats, duplicate detection - generateContextSuggestions(): warnings at 80%+, tool optimization hints Gap 6: Tool Pair Validation (tool-pair-validation.ts) - adjustSplitIndexForToolPairs() prevents orphaned tool_result blocks - Integrated into CheckpointHistory split calculation Gap 7-9: Context Budget (compaction-policy.ts) - computeContextBudget(): effective window with compaction output reserve - getContextPressureLevel(): normal/elevated/warning/critical Gap 10: Circuit Breaker session scope - resetForNewSession() method - cooldownMs=0 mode for session-scoped behavior (no auto-recovery) Gap 11: Partial Compaction bidirectional - compactionDirection: 'keep-recent' | 'keep-prefix' in CompactionConfig - keep-prefix: preserves old messages, summarizes recent (cache-friendly) Gap 12: Post-Compact Restoration (post-compact-restoration.ts) - PostCompactRestorer: tracks files/skills, builds restoration message - Priority-based selection within token budget 30 test files, 457 tests, 5 packages passing. * fix: wire all gap modules into actual runtime execution paths Oracle verification found most gap modules were export+test only with no runtime callers. Wire every one into the hot path: 1. computeContextBudget → CompactionOrchestrator + CheckpointHistory threshold decisions now use effectiveContextWindow (raw - reserve) 2. analyzeContextTokens + generateContextSuggestions → TUI footer shows pressure level with color coding + optimization suggestions 3. collapseConsecutiveOps → checkpoint-history compact() pipeline runs before microCompact as pre-compaction step 4. PostCompactRestorer → minimal-agent + CEA compaction callbacks injects restoration message after successful compaction 5. resetForNewSession → called on new-session in both agents 6. adjustSplitIndexForToolPairs → keep-prefix compaction path both directions now have tool pair safety All modules are now LIVE in production code paths, not just tests. * chore: save final benchmark after all 12 gaps closed 80-turn benchmark with all Claude Code parity features active: - 2000 tokens: 60% (37/62), 0 compactions — context collapse + microCompact keep 80 turns under threshold without triggering compaction - 4096 tokens: 58% (36/62), 0 compactions — same effect at larger context * docs: add Claude Code parity matrix and benchmark results Add CONTEXT-MANAGEMENT-PARITY.md with feature-by-feature comparison showing all 12 gaps, implementation status, and runtime wiring status. Notes clarify Gap 2 (provider adapter needed) and Gap 12 (CEA-specific). Add BENCHMARK-RESULTS.md with progression from baseline (53%) through prompt optimization (62%), fact extraction (59%), to final state (60%) with reproduction commands. * refactor: remove dead api-context-management, wire BME into CEA Delete api-context-management.ts — provider-specific (Anthropic only), no runtime callers, dead code. Keep BackgroundMemoryExtractor and wire into CEA: - AgentManager creates BME with code preset in buildCompactionConfig() - Combines BME's getStructuredState() with file tracking state - onTurnComplete fires BME extraction in both headless and TUI paths - CEA uses 200K context models where BME extraction quality is high BME stays out of minimal-agent (GLM-5 too small for quality extraction). 29 test files, 450 tests passing. * feat(minimal-agent): add --bme flag to benchmark for BME A/B testing Enables BackgroundMemoryExtractor in benchmark when --bme is passed. Uses chat preset with 1000 token growth / 5 turn threshold. Replaces SessionMemoryTracker's getStructuredState with BME's. Calls BME.onTurnComplete after each model response. * feat: close final 7 gaps — round grouping, time MC, file persistence, skills, /compact, incremental BME 1. API Round Grouping — compaction split adjusted to assistant→user boundaries (within 20% distance limit), applied in both directions 2. Time-based MicroCompact — clearOlderThanMs option triggers tool result clearing based on message timestamp 3. DISABLE_AUTO_COMPACT=1 — env var skips auto compaction and speculative start while manual /compact still works 4. FileMemoryStore for BME — CEA session memory persisted to .plugsuits/sessions/{id}/session-memory.md, survives restart 5. Skill re-injection — SkillsEngine load listener tracks skills in PostCompactRestorer (priority 8), re-injected after compaction 6. /compact command — CommandAction extended with 'compact' type, TUI handles via manualCompact() on orchestrator 7. BME incremental updates — section-level <update> tags parsed and merged instead of full overwrite, only recent messages sent for extraction (lastExtractionMessageIndex tracking) GitHub issue #95 created for plan file re-attachment (blocked by missing plan system). 29 test files, 461 tests, 5 packages passing. * fix: apply 5 missing config items from audit 1. CEA: clearOlderThanMs: 3_600_000 (60min time-based MC) 2. MA: add /compact command to LOCAL_COMMANDS 3. MA: remove unused PostCompactRestorer (no tools to track) 4. CEA: add pressure level labels to footer ([elevated]/[WARNING]/[CRITICAL]) 5. Both: explicit compactionDirection: 'keep-recent' * fix: 3 bugs found by Oracle verification 1. CEA circuitBreaker was created but never passed to runHeadless/createAgentTUI → Now passed via circuitBreaker config option in both paths 2. MA SessionMemoryTracker not cleared on new-session → Added sessionMemoryTracker.clear() in new-session handler 3. CEA footer budget used default params instead of actual config → formatContextUsage now accepts optional reserveTokens/thresholdRatio * feat: boundary-aware SM compaction + attachment-based restoration Two major Claude Code parity upgrades: 1. Session Memory Compaction is now boundary-aware: - Tracks lastExtractionMessageIndex from BME - Keep-window rules: minKeepTokens (2000), minKeepMessages (3), maxKeepTokens (40% of context) - SM summary replaces ONLY covered messages, recent uncovered messages kept verbatim alongside the summary - adjustSplitIndexForToolPairs applied to keep boundary - CEA passes getLastExtractionMessageIndex to compaction config 2. PostCompactRestorer upgraded to attachment-based: - filterAgainstKeptMessages() deduplicates against kept context - Per-item truncation (80% of maxItemTokens + [... truncated]) - Structured XML-like tags: <restored-file>, <restored-skill> - buildRestorationMessages() returns proper message format Also: fix DISABLE_AUTO_COMPACT tests using vi.hoisted() for env mock 29 test files, 466 tests, 5 packages passing. * feat(headless): redesign TrajectoryEvent types for ATIF-v1.6 native compat * fix(headless): remove sessionId from ErrorEvent, fix step_id sequencing * test(headless): update existing tests for ATIF-v1.6 event format * docs(headless): update event protocol docs for ATIF-v1.6 * feat(headless): emit compaction lifecycle events via emitEvent * test(headless): add comprehensive ATIF-v1.6 event type tests Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) * feat(headless): add --atif mode with trajectory.json generation + update exports * feat(benchmark): rewrite harbor_agent.py as ultra-thin shell * test(benchmark): add ATIF trajectory validation test * feat(benchmark): add compaction stress benchmark tasks * feat(benchmark): add trajectory analysis scorer * feat(benchmark): add search-heavy compaction stress task (32K context) * refactor(benchmark): restructure tasks to Harbor v1.1 format * fix(benchmark): install Node.js in agent setup when base image lacks it * fix(benchmark): cd to /agent before running node to resolve tsx * fix(benchmark): clone plugsuits repo with current branch in Docker install * fix(benchmark): resolve Docker path traversal and trajectory.json path issues * improve(agent): strengthen path handling guidance in system prompt * fix(benchmark): cd to /agent in test.sh to match agent CWD for relative paths * fix(harness): shorten context suggestion messages to prevent TUI truncation * chore: add work/ to gitignore (benchmark test artifacts) * fix(tui): disable context suggestions in footer by default (opt-in via CONTEXT_SUGGESTIONS=1) * refactor(tui): replace process.env with config option for context suggestions, update AGENTS.md with t3-env rule * revert(tui): remove context suggestions from footer, restore v2.2.0 footer behavior * fix(tui): pass compaction callbacks via CompactionOrchestratorOptions.callbacks to fix lost callbacks The CompactionOrchestrator constructor detects circuitBreaker field and treats the second argument as CompactionOrchestratorOptions, expecting callbacks inside a nested 'callbacks' property. Previously, callbacks were spread at the top level alongside circuitBreaker, causing isCompactionOrchestratorOptions to match and extract value.callbacks (undefined). This broke: onApplied (no notice), onBlockingChange (no spinner), onJobStatus (no background indicator), and all other compaction callbacks. * fix(tui): use estimated tokens in onApplied notice to avoid stale actualUsage race * fix(tui): always run compactBeforeNextTurnIfNeeded regardless of probe success * test(harness): update tests for refreshEstimatedUsage (actualUsage never null) * fix(harness): never null actualUsage — refreshEstimatedUsage after every message mutation Replace all 7 instances of 'this.actualUsage = null' with 'this.refreshEstimatedUsage()' which computes getEstimatedTokens() + systemPromptTokens and sets actualUsage to this value. The next API probe (measureUsage) will correct it to the exact value. This eliminates the window where getCurrentUsageTokens() returns a stale or inconsistent value after compact/addMessage/clear, which caused: inaccurate footer display, stale onApplied notice values, and hard limit checks seeing wrong token counts. * fix(harness): truncate tool results when addModelMessages exceeds context budget After adding model messages, if estimated token usage exceeds the compaction threshold (contextLimit * thresholdRatio), the largest tool-result parts are progressively truncated until usage is within budget. This prevents context from ever exceeding the limit between compaction cycles. * test(harness): adjust tests for tool result truncation on context overflow * fix(harness): trigger tool result truncation after updateActualUsage for accurate enforcement The estimated token count underestimates actual usage by 70-90% for tool results (6 chars/token estimate vs ~3 chars/token reality). By also triggering truncateToolResultsIfOverBudget after updateActualUsage sets the real token count from the API, the truncation now operates on accurate data instead of estimates. * fix(headless): use local default for ATIF output path, mkdir -p before write Default ATIF_OUTPUT_PATH changed from /logs/agent/trajectory.json (Docker-only) to trajectory.json (works locally). Also mkdirSync the parent directory before writing to prevent ENOENT. * fix(harness): raise truncation ceiling to 90% of context limit to prevent garbled context * fix(harness): preserve tool-result output structure during truncation When truncating tool results for context budget, the output field was replaced with a plain string. If the original output was an object ({ type, value } or { text }), this broke the Vercel AI SDK message schema validation causing InvalidPromptError on the next API call. Now truncation mutates value/text fields inside the object instead of replacing the entire output. * fix(cea): dynamic tool output budget based on remaining context tokens Before each agent.stream() call, compute remaining context tokens and pass to setContextBudgetForTools(). Tool output truncation limits are now min(defaultLimit, remainingBudget/2) instead of a fixed 32KB. This prevents parallel tool calls from collectively exceeding the context limit. * revert(cea): remove dynamic tool output budget — tools must return consistent results Tool behavior should not change based on remaining context. Context enforcement is a system concern (history truncation + compaction), not a tool concern. * fix: four compaction and ATIF correctness bugs P1-1: Increment step_id only after processStream succeeds. Previously, step_id was consumed before processStream ran, so retries after mid-stream failures (NoOutputGenerated, context overflow) would skip a number, breaking sequential ATIF step_id validation. P1-2: Cap restoration payload to active context window. The post- compaction restorer now limits total tokens to 50% of remaining context budget (capped at 50K), preventing restoration from undoing compaction savings on small context windows. P2-1: Ignore duplicate-suppressed read_file outputs when caching restoration data. Suppression notices and truncation markers are now filtered out so they don't overwrite real file contents in the restoration cache. P2-2: Filter restoration items against messages that survived compaction. handleCompactionComplete now calls filterAgainstKeptMessages before building the restoration message, preventing duplicate injection of content that already exists in the post-compaction history. * chore: remove debug fetch interceptor script * fix: three compaction callback and restoration bugs P1: Wrap headless compaction callbacks in callbacks: {} option. Same issue as the TUI fix (bd91257) — circuitBreaker property causes isCompactionOrchestratorOptions to match, dropping all flattened callbacks. Headless sessions now emit ATIF compaction events and fire handleCompactionComplete for restoration. P2-1: Use getActiveMessages() instead of getAll() for restoration filtering. getAll() includes summarized-away messages that the model can no longer see, causing filterAgainstKeptMessages to incorrectly mark all tracked items as 'already kept'. Exposed getActiveMessages() as public on CheckpointHistory. P2-2: Restore hasExtractedAtLeastOnce when reopening existing session-memory.md. Without this flag, getStructuredState() returns undefined after mid-session config rebuild, forcing compaction to fall back to LLM summarization until a new extraction completes. * fix(harness): immutable tool result truncation with consistent inner field text extraction - Clone tool-result parts and content arrays before truncation to prevent mutation of previously exposed message snapshots - Extract inner field text (.value/.text) consistently in both collectToolResultEntries and truncateSingleToolResult so charsToFree math operates on the same text basis as token estimates - Invalidate actualUsage when systemPromptTokens changes to prevent stale usage data from masking the new system prompt cost - Add immutability tests verifying prior snapshots remain unmodified after truncation triggers * fix(harness): use boundary-based label matching in post-compact restoration - Replace naive substring includes() with textContainsLabel() that checks word boundaries before and after the match, preventing false positives like 'index.ts' matching inside 'index.tsx' - Handle dot and hyphen as continuation chars only when followed/preceded by a word char, so 'file.ts.' at end-of-sentence still matches correctly - Add setMaxTotalTokens dynamic budget tests - Add boundary matching tests for .ts/.tsx distinction, hyphenated labels, and trailing punctuation edge cases * fix(harness): auto-reset circuit breaker on cooldown expiry and track non-benign failures - Reset circuit breaker state when cooldown period has expired instead of staying open indefinitely until manual reset - Classify benign compaction failure reasons (disabled, no messages, etc.) and only record actual failures in the circuit breaker to prevent false-positive tripping from expected no-op compaction results * fix(harness): fallback to message token estimation when usage reports zero - When resolveUsageTokens returns 0, estimate tokens from the last message to avoid stalling extraction triggers indefinitely - Add updateModel() method to allow callers to swap the underlying model without recreating the entire extractor instance * fix(cea): reuse BackgroundMemoryExtractor across agent rebuilds and tune restoration budget - Cache and reuse BME instance when the store path hasn't changed, calling updateModel() instead of recreating to preserve extraction state across model/provider switches - Use conservative 0.3 restoration budget ratio when context usage source is 'estimated' to avoid over-allocating from inaccurate data * chore(cea): update benchmark event parsing for step events and default to main branch - Parse 'step' event type with source='agent' instead of legacy 'assistant' type to match current headless JSONL output format - Switch default AGENT_BRANCH from feature branch to main * feat(harness): prevent infinite compaction loops with per-turn cap and task-aware summaries Small context limits (e.g. 32k) could enter an infinite compaction loop when a user asked for broad codebase exploration: each compaction reclaimed tokens, tool calls refilled the context, and the cycle repeated until the process stalled or a blocking compaction fired at the hard limit. Changes: - Add per-turn cap (maxAcceptedCompactionsPerTurn, default 10) that combines accepted + ineffective compactions. When the cap is hit, no further compaction runs this turn. - Relax the acceptance gate so only fitsBudget failures reject a compaction attempt; belowTriggerThreshold/meetsMinSavings are kept as observability signals but no longer block compaction. - Track turn boundaries via notifyNewUserTurn() wired from TUI and headless runtime so the per-turn cap resets on each user turn. - Add opt-in task-aware 2-step compaction: extract the current user turn's task intent before summarizing history, then include the intent in the compacted user-turn content. Enabled in CEA (taskAwareCompaction: true) to preserve the work context and stop compaction from erasing concrete task details. - Fix CompactionCircuitBreaker.getState() consistency: extract tryTransitionToHalfOpen() so a cooldown-triggered reset doesn't mix pre-reset failures with post-reset nulls in the snapshot. - Fix isCompactionOrchestratorOptions guard missing the new maxAcceptedCompactionsPerTurn key. Includes compaction-loop-prevention.test.ts (21 tests) and compaction-integration.test.ts (7 scenarios simulating 32k-context investigations, verbose usage, and multi-turn flows) that previously produced blocking compactions and now complete with 0 blocking events. * fix(harness): silence unhandled rejections on createAgent stream promises When streamText() rejects its internal DelayedPromise fields (for example with NoOutputGeneratedError after an empty provider stream), the totalUsage promise was never awaited by downstream consumers and caused a process-level unhandledRejection crash in CEA's dev runtime. createAgent.stream() eagerly invokes four getters (finishReason, response, usage, totalUsage) to populate the AgentStreamResult. Vercel AI SDK's DelayedPromise materializes _promise on first getter call, so all four promise instances exist by the time flush() tries to reject them. Production consumers (TUI, headless, CEA wrapper) only await response, finishReason, and usage, leaving totalUsage as a floating rejected promise. Attach no-op rejection handlers (.then(undefined, swallow)) to all four promise fields before returning. The original promise instances are returned unchanged, so consumers awaiting them still receive rejections normally - the silencers only prevent Node's unhandledRejection escalation when a consumer does not await a given field. Used .then(undefined, fn) instead of .catch(fn) because the SDK types the fields as PromiseLike<T>, which does not expose .catch() at the type level. Adds per-field isolation regression tests (4 tests) plus a combined test verifying that: - Zero unhandled rejections fire when each field independently rejects - Rejections still propagate to any caller that awaits the field Mutation-verified: removing any single guard causes the corresponding per-field isolation test to fail, proving each guard is independently necessary. * fix(cea): guard continuation wrapper promise fan-out from unhandled rejections buildAgentStreamWithTodoContinuation wraps stream.finishReason in an async IIFE and then derives a new finishReason promise via .then() for the returned RunnableAgent result. When the base stream rejects, these wrapper promises form independent chains: continuationDecision, response, and the derived finishReason. Callers using Promise.all short-circuit on the first rejection, leaving the other branches unawaited and producing floating unhandled rejections. Attach no-op .catch() guards to the three wrapper-created promises while still returning the same instances, so consumers who do await them still receive rejections. Defense-in-depth alongside the harness createAgent silencer fix. * chore(changeset): bump plugsuits to minor for compaction loop prevention feature Harness/tui/headless remain patch since the public API additions are internal enhancements. The user-facing feature set (task-aware compaction, per-turn cap) is surfaced through CEA's opt-in configuration, so plugsuits (CEA) is the appropriate package for the minor version bump. * chore(minimal-agent): add trailing newlines to benchmark chart JSON files ultracite formatter requires trailing newlines on JSON files; CI lint was failing on 10 chart files missing them. * chore(benchmark): replace realistic fake credentials with explicit fixture markers The compaction-stress-search benchmark task seeds a fake codebase for the agent to explore. Two placeholder credentials looked realistic enough to trigger GitGuardian's secret scanner: - JWT_SECRET = "super-secret-jwt-key-2024-prod" - ADMIN_DEFAULT_PASSWORD = "admin123!@#" Replace with BENCHMARK_FIXTURE_FAKE_* markers that make the fixture nature obvious to secret scanners and future readers. * Write README and fix compaction edge cases and benchmark stability * fix: address PR #94 review feedback — deduplicate compaction config, add input validation - Remove redundant setContextLimit() calls in minimal-agent index.ts and benchmark.ts (CheckpointHistory constructor already sets contextLimit from compaction config) - Extract shared compaction constants into compaction-config.ts to prevent silent drift - Add --context-limit positive integer validation in benchmark CLI - Add --provider allowlist validation (friendli | anthropic) in benchmark CLI - Normalize thresholdRatio in computeContextBudget() to guard against invalid values * fix: address PR #94 review feedback — robustness, docs, and dependency fixes - install-agent.sh.j2: replace curl pipe with download-then-execute to avoid masking failures - compact.ts: wrap compact() in try/catch to propagate failure status - main.ts: use logical OR for ATIF_OUTPUT_PATH to handle empty strings - cea-memory-bench.sh: remove || true that swallows benchmark failures, fix grep double-zero - package.json: move @ai-sdk/anthropic to devDependencies (benchmark-only) - BENCHMARK-RESULTS.md: fix probe count (16 → 17) - compaction-orchestrator.ts: fix JSDoc default (3 → 10) - compaction-types.ts: document currently-unused rejection reason variants - benchmark/AGENTS.md: fix uniq -c expected output format * fix: update lockfile for @ai-sdk/anthropic devDependency move * fix: update compact command test to match new success message * fix: address PR #94 round-2 review feedback - checkpoint-history: scope collectToolResultEntries to active messages only - compaction-types: fix CompactionEffectiveness doc to match actual behavior - system-prompt: clarify path rule exception for generated scripts - minimal-agent: guard onTurnComplete slice against compaction-shrunk history * fix: address PR #94 round-3 review feedback - scorer.py: guard against divide-by-zero when total_prompt_tokens is 0 - env.ts: add ATIF_OUTPUT_PATH to validated env schema - main.ts: read ATIF_OUTPUT_PATH from env instead of process.env directly