diff --git a/website/blog/2026-04-16-when-session-data-lies.md b/website/blog/2026-04-16-when-session-data-lies.md new file mode 100644 index 0000000..f8d490f --- /dev/null +++ b/website/blog/2026-04-16-when-session-data-lies.md @@ -0,0 +1,97 @@ +--- +slug: /2026-04-16-when-session-data-lies +canonical_url: https://dfberry.github.io/blog/2026-04-16-when-session-data-lies +custom_edit_url: null +sidebar_label: "2026-04-16 When session data lies" +title: "When Session Data Lies: Knowing What to Ignore in Agent Memory" +description: "Session history is powerful input for AI agents — until it isn't. Here's when to distrust it, filter it, or throw it out entirely." +published: false +tags: + - GitHub Copilot + - AI agents + - session management + - developer workflow + - Copilot CLI +keywords: + - copilot cli session trust + - ai agent adversarial input + - session data quality + - agent memory pitfalls +updated: 2026-04-16 00:00 PST +--- + +# When Session Data Lies: Knowing What to Ignore in Agent Memory + + + +> Companion post to [Exploring Copilot CLI Session Management to Improve Squad](/blog/2026-04-15-session-storage-decision-guide). That post was about what you can *gain* from session data. This one is about what you should *ignore*. + +## The Setup + +In the previous post, I argued that Copilot session data is underused telemetry — agents could mine it for tool failure rates, developer preferences, and intent-vs-outcome drift. All true. But there's a flip side: **not all session data is signal.** Some of it is noise, some is stale, and some is actively dangerous to trust. + +If you're building an agent that learns from session history, you need a filter — not just a firehose. + +## Outline + +### 1. Adversarial Strings in Session History + +- Users (and other agents) can put anything into a session — including prompt injection attempts, test payloads, and deliberately misleading instructions +- If an agent mines session transcripts to extract patterns or skills, it could ingest adversarial content as "learned behavior" +- Example: a session where someone tested SQL injection patterns — an agent that learns "the user frequently writes SQL like this" would draw exactly the wrong conclusion +- **Mitigation ideas:** Sanitization layers, treating session-mined suggestions as untrusted input (same as user input), requiring human confirmation before encoding patterns into skills or charters + +### 2. Stale Context: When the Codebase Has Moved On + +- Session data reflects the codebase *at the time of the session* — file paths change, APIs get refactored, dependencies upgrade +- An agent that says "last time you worked on this file, you used pattern X" might be referencing code that no longer exists +- The older the session, the less reliable the context +- **Mitigation ideas:** Weight recent sessions heavily, cross-reference session suggestions against current file state, expire stale session references automatically + +### 3. Reviews Without Session Context + +- A code reviewer looking at a PR doesn't have access to the session that produced it — they see the *output* but not the *reasoning* +- If an agent surfaces session context during review ("the author tried three approaches before landing on this one"), it could bias the reviewer toward accepting suboptimal code +- Conversely, *lacking* session context means reviewers might reject valid decisions they don't understand +- **The tension:** Session context can help or hurt reviews depending on when and how it's surfaced +- **Mitigation ideas:** Separate "why was this approach chosen" (useful) from "how many attempts did it take" (biasing). Let the author opt in to sharing reasoning, not the agent. + +### 4. Confirmation Bias from Past Sessions + +- If an agent sees you've done something the same way five times, it assumes that's your preference — even if you were wrong all five times +- Session history reinforces existing patterns, including bad ones +- **Example:** You always manually configure auth instead of using the framework's built-in auth. The agent learns this as a preference and keeps suggesting manual auth, entrenching a mistake. +- **Mitigation ideas:** Distinguish frequency from correctness, surface alternative approaches alongside learned patterns, flag patterns that contradict framework best practices + +### 5. Multi-User Confusion + +- Squad is a team tool — multiple people (and agents) contribute to the same repo +- If session data from different users gets blended, patterns become unreliable ("this repo prefers tabs" — no, *one contributor* prefers tabs) +- **Mitigation ideas:** Always scope session analysis to the current user unless explicitly asked for team patterns, label session-derived suggestions with their source + +### 6. The Ephemeral Session Problem + +- Some sessions are exploratory — the user was experimenting, prototyping, or debugging and doesn't want those patterns learned +- Not every session represents intent; some are just noise +- **Mitigation ideas:** Let users tag sessions as "exploratory" or "don't learn from this," respect session deletion as a signal, weight committed-code sessions higher than abandoned ones + +## The Filter Framework + +A decision matrix for when to trust session data: + +| Signal | Trust level | Use it for | Don't use it for | +|--------|------------|------------|-----------------| +| Tool call success/failure rates | High | Adjusting agent tool strategy | Judging code quality | +| Files touched frequently | Medium | Suggesting relevant context | Assuming ownership | +| Patterns repeated across sessions | Medium | Skill candidates | Assuming correctness | +| Single-session patterns | Low | In-session context only | Cross-session learning | +| Content of user messages | Low | Understanding intent | Extracting as training data | +| Sessions > 30 days old | Low | Historical curiosity | Current recommendations | + +## The Bottom Line + + + +Session data is powerful input — but it's *input*, not *truth*. The best agents will treat it like any other untrusted source: validate before encoding, expire what's stale, and always let the human override the pattern. + + diff --git a/website/blog/2026-04-17-agent-coordination-copilot-sdk.md b/website/blog/2026-04-17-agent-coordination-copilot-sdk.md new file mode 100644 index 0000000..91b624c --- /dev/null +++ b/website/blog/2026-04-17-agent-coordination-copilot-sdk.md @@ -0,0 +1,176 @@ +--- +slug: /2026-04-17-agent-coordination-copilot-sdk +canonical_url: https://dfberry.github.io/blog/2026-04-17-agent-coordination-copilot-sdk +custom_edit_url: null +sidebar_label: "2026-04-17 Agent coordination in Copilot SDK" +title: "Agent Coordination in Copilot CLI: What Custom Agents Like Squad Actually Are" +description: "I dug into what a 'custom agent' really means in Copilot CLI, how the SDK handles multiple agents, and what's possible — and missing — for agent builders." +published: false +tags: + - GitHub Copilot + - AI agents + - Copilot SDK + - agent coordination + - Squad +keywords: + - copilot cli custom agent + - copilot sdk multiple agents + - agent coordination patterns + - multi-agent copilot + - CustomAgentConfig +updated: 2026-04-17 00:00 PST +--- + +# Agent Coordination in Copilot CLI: What Custom Agents Like Squad Actually Are + + + +> Part 3 of a series. Previously: [Exploring Copilot CLI Session Management](/blog/2026-04-15-session-storage-decision-guide) and [When Session Data Lies](/blog/2026-04-16-when-session-data-lies). + +## The Question + +I've been using [Squad](https://github.com/bradygaster/squad), an AI team framework built on top of Copilot CLI, and I realized I didn't fully understand *what Squad actually is* from Copilot's perspective. Is it a plugin? An extension? A session with a long system prompt? And when Squad spawns its team members — a lead, a tester, a backend dev — are those separate agents in Copilot's eyes, or just one agent pretending to be many? + +I went digging into the Copilot SDK to find out. What I found has implications for anyone building agents on top of Copilot. + +## Outline + +### 1. What Is a Custom Agent in Copilot CLI? + +**The file-based path:** Drop a `.github/agents/{name}.agent.md` file in your repo. It has YAML frontmatter (name, description) and a markdown body that becomes the system prompt. That's it — Copilot loads it automatically. Squad's entire coordinator is a single 84KB markdown file at `.github/agents/squad.agent.md`. + +**The SDK path:** The `CustomAgentConfig` interface defines an agent programmatically: + +```typescript +interface CustomAgentConfig { + name: string; + displayName?: string; + description?: string; + tools?: string[] | null; // which tools this agent can use + prompt: string; // the system prompt + mcpServers?: Record; // agent-specific MCP servers + infer?: boolean; // available for model inference +} +``` + +**Key insight:** A custom agent is really just a named system prompt + a tool/MCP scope. There's no special runtime, no container, no sandboxing. The agent IS the prompt. Everything else — coordination, memory, boundaries — is up to you. + +### 2. One Agent at a Time? What the CLI Actually Does + +In the Copilot CLI TUI, it *appears* you can only use one agent at a time. You `@squad` to activate it, and Squad takes over. But the SDK tells a more nuanced story. + +**The SDK exposes agent-switching RPC methods:** + +```typescript +session.rpc.agent.list() // list available agents +session.rpc.agent.getCurrent() // which agent is active +session.rpc.agent.select(...) // switch to a different agent +session.rpc.agent.deselect() // go back to default +``` + +And `SessionConfig` accepts an **array** of agents: + +```typescript +const session = await client.createSession({ + customAgents: [agentA, agentB, agentC], // all loaded, one active + onPermissionRequest: approveAll, +}); +``` + +**So the platform supports multiple agents per session** — you register several, and the active one determines the system prompt and tool scope. The CLI TUI just doesn't expose the switching UI. + +### 3. How Squad Does Multi-Agent: The Two Patterns + +Squad doesn't use the `customAgents[]` array to load its team. Instead, it uses a fundamentally different pattern — **one Copilot session per team member.** + +**Pattern A — Agent switching (SDK built-in):** +- Register multiple agents in one session +- Switch between them with `agent.select()` +- Shared context window, shared conversation history +- Like rotating who's at the helm of one boat + +**Pattern B — Session-per-agent (Squad's approach):** +- Coordinator creates separate `CopilotClient.createSession()` calls per agent +- Each agent gets its own system prompt (compiled from their charter) +- Each has its own context window, own conversation history +- Parallel execution via `Promise.allSettled()` +- Like a fleet of specialist boats, each dispatched to different waters + +**Why Squad chose Pattern B:** +- **Isolation** — a tester's context doesn't pollute the developer's context +- **Parallelism** — agents work simultaneously, not sequentially +- **Charter boundaries** — each agent's system prompt is their entire worldview +- **Error isolation** — one agent crashing doesn't take down the others + +Squad wraps this in a `SessionPool` (max 10 concurrent, 5-min idle timeout, 30-sec health checks) and an `EventBus` that gives the coordinator visibility across all running sessions. + +### 4. How Different Charters Produce Better Outcomes + +This is the part that surprised me. Squad's agents aren't just "the same model with different titles." Their charters fundamentally change what they notice, what they produce, and what they challenge. + +**Examples from real Squad interactions:** + +- **A tester agent** catches edge cases a developer agent didn't consider — not because it's smarter, but because its charter says "think about what could go wrong" while the developer's says "make it work" +- **A docs agent** forces clearer API design — it can't explain a confusing interface, so it pushes back, and the design improves +- **A lead agent** notices architectural drift across multiple agents' outputs because its charter scopes it to "coherence across the whole system" + +**The mechanism:** Each agent reads `.squad/decisions.md` before starting (shared team memory), but interprets its task through its charter's lens. Same information, different perspective. The charter acts as a cognitive filter — constraining what the agent pays attention to. + +**What this means for agent builders:** The value isn't in having more agents. It's in having agents with *different cognitive scopes*. A system prompt that says "you are a security reviewer" produces genuinely different analysis than one that says "you are a performance engineer" — even on the same code, same model, same context. + +### 5. What the SDK Gives You (and What's Missing) + +**What's there — building blocks for coordination:** + +| SDK Primitive | What it enables | How Squad uses it | +|---|---|---| +| `customAgents[]` | Multiple named agents per session | Not used — Squad prefers session-per-agent | +| `SystemMessageConfig` | Append or replace system prompts | Charter compilation per agent | +| `SessionHooks` | Pre/post tool use, session start/end, error handling, prompt interception | Governance layer (file guards, PII scrub, rate limits) | +| `Tool` registration | Custom tools with typed handlers | Agent-specific tool scoping | +| `mcpServers` per agent | Agent-specific external tool servers | Not yet used — opportunity | +| `InfiniteSessionConfig` | Auto-compaction for long sessions | Context management for long-running agents | +| `session.getMessages()` | Full event history of a session | Could enable cross-agent learning (not used today) | +| `client.listSessions()` | Browse/filter all sessions | Session pool management | + +**What's missing — gaps I see for agent builders:** + +1. **No agent-to-agent messaging.** Agents can't send messages to each other. Squad works around this with shared files (decisions.md, history.md), but there's no SDK primitive for "Agent A wants to tell Agent B something." You have to build your own mailbox. + +2. **No shared tool state across sessions.** If Agent A's tool call produces data that Agent B needs, there's no built-in way to pass it. Squad uses the filesystem. The SDK could offer a shared key-value store scoped to a session group. + +3. **No cross-session event streaming.** The SDK's `session.on()` only covers events within ONE session. Squad built its own `EventBus` to aggregate events across agent sessions. A built-in cross-session event subscription would make coordination much easier. + +4. **No agent composition primitives.** You can't say "run Agent A, then feed its output to Agent B" declaratively. Squad's coordinator handles this imperatively in code. A pipeline/workflow abstraction would help. + +5. **No charter-aware routing.** The SDK has no concept of "which agent is best suited for this task." Squad builds this with `routing.md` rules compiled into regex patterns. An SDK-level capability-matching system (agents declare capabilities, platform routes by match) would reduce boilerplate. + +6. **No agent identity across sessions.** When Squad's tester agent runs in session X and then again in session Y, those are unrelated sessions from the SDK's perspective. There's no "this is the same agent, continuing its work." Squad tracks this in its own registry. The SDK could support named agent instances with persistent identity. + +### 6. What I'd Tell an Agent Builder + +If you're building a custom agent on Copilot CLI today: + +**Start simple:** One `.agent.md` file gets you surprisingly far. Squad's entire coordinator — routing, casting, governance, memory — is a single markdown file. Don't over-engineer the agent registration. + +**Choose your session model early:** +- **Single session + agent switching** — simpler, shared context, good for agents that take turns +- **Session-per-agent** — isolated, parallel, better for agents that work simultaneously on different things + +**Invest in the charter, not the plumbing.** The biggest quality difference comes from well-scoped system prompts, not from clever orchestration. A tester agent with a great charter outperforms a generic agent with a sophisticated tool chain. + +**Use hooks for governance, not coordination.** `SessionHooks` are great for guardrails (block dangerous tool calls, scrub PII, rate-limit). They're not designed for agent-to-agent communication — use shared state for that. + +**Build your own coordination layer.** The SDK gives you sessions, tools, hooks, and events within a session. Everything above that — routing, shared memory, cross-agent communication, identity — is yours to build. Squad's ~15K lines of SDK code are mostly this coordination layer. + +**Watch for platform evolution.** The `agent.list()`/`agent.select()` RPC methods and `customAgents[]` config suggest the platform is thinking about multi-agent scenarios. Features like cross-session events, agent pipelines, and capability-based routing may be coming. Build your coordination layer so it can delegate to the platform when those primitives arrive. + +## The Bottom Line + + + +A custom agent in Copilot CLI is simpler than it looks — it's a named system prompt with a tool scope. The SDK gives you enough to build coordination on top (sessions, tools, hooks, events), but coordination itself is your responsibility. Squad's approach — session-per-agent with charter-driven specialization and file-based shared memory — is one valid pattern. It won't be the only one. + +The most underappreciated part: **different charters produce genuinely different analysis.** Not because the model changes, but because the prompt changes what it pays attention to. That's the real value of multi-agent coordination — not parallelism, not scale, but *cognitive diversity applied to the same problem.* + + diff --git a/website/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk.md b/website/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk.md new file mode 100644 index 0000000..f8b6322 --- /dev/null +++ b/website/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk.md @@ -0,0 +1,284 @@ +--- +slug: /2026-04-17-choosing-multi-agent-patterns-copilot-sdk +canonical_url: https://dfberry.github.io/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk +custom_edit_url: null +sidebar_label: "2026-04-17 Choosing multi-agent patterns in Copilot SDK" +title: "Two Ways to Build Multi-Agent Systems in Copilot SDK — and When Each One Wins" +description: "The Copilot SDK gives you two patterns for multi-agent work: in-session agents and session-per-agent. I compared them side-by-side using Squad as the real-world example, and worked out when each pattern is the right call." +published: false +tags: + - GitHub Copilot + - AI agents + - Copilot SDK + - multi-agent patterns + - Squad +keywords: + - copilot sdk customAgents + - session-per-agent pattern + - multi-agent architecture + - CustomAgentConfig + - copilot cli custom agent + - agent coordination patterns +updated: 2026-04-17 00:00 PST +--- + +# Two Ways to Build Multi-Agent Systems in Copilot SDK — and When Each One Wins + + + +> Part 4 of a series. Previously: [Exploring Copilot CLI Session Management](/blog/2026-04-15-session-storage-decision-guide), [When Session Data Lies](/blog/2026-04-16-when-session-data-lies), and [Agent Coordination in Copilot CLI](/blog/2026-04-17-agent-coordination-copilot-sdk). + +I've been building with [Squad](https://github.com/bradygaster/squad) — an AI team framework built on top of Copilot CLI — and I kept bumping into an architectural question that I think every agent builder will eventually face: **when you need multiple agents, do you put them in the same session or give each one their own?** + +The Copilot SDK supports both patterns. But it doesn't tell you when to pick which. So I dug in. + +## The Two Patterns + +The Copilot SDK (`@github/copilot-sdk`) gives you two fundamentally different ways to work with multiple agents. They look similar in config but behave very differently at runtime. + +### Pattern 1: In-Session Agents (`customAgents[]`) + +You pass an array of agent configurations when creating a session: + +```typescript +const session = await client.createSession({ + customAgents: [ + { + name: "security-reviewer", + prompt: "You are a security expert. Review code for vulnerabilities...", + tools: ["grep", "view", "glob"], + }, + { + name: "performance-reviewer", + prompt: "You are a performance engineer. Analyze code for bottlenecks...", + tools: ["grep", "view", "powershell"], + }, + ], +}); +``` + +The SDK provides RPC methods to switch between them: + +```typescript +await session.agent.list(); // see available agents +await session.agent.select("security-reviewer"); // activate one +await session.agent.deselect(); // go back to default +``` + +What's happening under the hood: the platform swaps the system prompt and tool scope. The conversation history stays the same. It's one person switching between different instruction manuals — same desk, same memory, one task at a time. + +The `CustomAgentConfig` interface is straightforward: + +```typescript +interface CustomAgentConfig { + name: string; + displayName?: string; + description?: string; + tools?: string[] | null; // null = all tools + prompt: string; + mcpServers?: Record; + infer?: boolean; // available for model inference +} +``` + +### Pattern 2: Session-Per-Agent + +You create a separate session for each agent: + +```typescript +const securitySession = await client.createSession({ + model: "claude-sonnet-4", + systemMessage: { + content: securityCharterPrompt, + }, + tools: securityTools, +}); + +const performanceSession = await client.createSession({ + model: "claude-haiku-4.5", // cheaper model for this task + systemMessage: { + content: performanceCharterPrompt, + }, + tools: performanceTools, +}); + +// Run in parallel +const [securityResult, perfResult] = await Promise.allSettled([ + securitySession.send({ prompt: "Review this PR for vulnerabilities" }), + performanceSession.send({ prompt: "Analyze this PR for bottlenecks" }), +]); +``` + +Each agent gets its own context window, its own model, its own conversation history. They're a team of people in separate offices who communicate through a shared whiteboard. + +## What Squad Actually Does + +Squad chose Pattern 2 — session-per-agent — and built an orchestration layer on top. Here's how it works concretely. + +### The Adapter Layer + +Squad imports exactly one thing from the Copilot SDK: the `CopilotClient` class. Everything else is wrapped behind an adapter: + +| SDK Concept | Squad Wrapper | What Squad Adds | +|---|---|---| +| `CopilotClient` | `SquadClient` | Connection lifecycle, auto-reconnect, error recovery, OpenTelemetry tracing | +| `CopilotSession` | `CopilotSessionAdapter` → `SquadSession` | Event name normalization, unsubscribe tracking | +| `SessionConfig` | `SquadSessionConfig` | Stable interface that won't break when the SDK updates | + +This adapter layer is a design decision worth noting. Squad mirrors the SDK's `CustomAgentConfig` type in its own `SquadCustomAgentConfig` — it's nearly a 1:1 copy — but uses it as a compilation target, not a runtime mechanism. + +### Charter Compilation + +Each Squad agent has a `charter.md` file that defines their identity, expertise, and boundaries. At spawn time, the charter compiler transforms this into a system prompt: + +``` +charter.md + team.md + routing.md + decisions.md + → compileCharter() + → SquadCustomAgentConfig { name, prompt, tools } + → createSession({ systemMessage: { content: prompt } }) +``` + +The charter isn't just a prompt — it includes team context (who else is on the team), routing rules (what work goes where), and active decisions (conventions the team has agreed on). Every agent starts with shared situational awareness. + +### Session Lifecycle + +Each agent goes through a managed lifecycle: + +``` +spawning → active → idle → error → destroyed +``` + +The `AgentLifecycleManager` handles: +- **Spawning**: charter compilation → model selection → session creation → initial task +- **Model selection**: per-agent, based on task type (a reviewer might get a different model than a coder) +- **Session pool**: max 10 concurrent sessions, 5-minute idle timeout, 30-second health checks +- **Error isolation**: one agent crashing doesn't take down others +- **Parallel execution**: `Promise.allSettled()` so failures don't short-circuit the batch + +### Cross-Agent Communication + +Since agents can't see each other's conversations, Squad uses two mechanisms: + +1. **Shared files**: `decisions.md`, agent `history.md` files, orchestration logs — all committed to git with `merge=union` strategy so branches combine cleanly +2. **EventBus**: real-time event aggregation across all sessions, giving the coordinator visibility into what every agent is doing + +This is the most important difference from the SDK pattern. In-session agents share context implicitly (same conversation). Session-per-agent systems need explicit communication channels. + +## Side-by-Side Comparison + +| Dimension | In-Session (`customAgents[]`) | Session-Per-Agent | +|---|---|---| +| **Context** | Shared — all agents see the full conversation | Isolated — each agent has private history | +| **Parallelism** | None — one agent active at a time | Full — agents work simultaneously | +| **Models** | Same model for all agents in the session | Different model per agent | +| **Failure** | One failure affects the whole session | Failures are isolated | +| **Communication** | Implicit (shared conversation) | Explicit (files, events, messages) | +| **Context limits** | Shared window fills up fast with many agents | Each agent manages its own limits | +| **Overhead** | Low — just config on session creation | Higher — session pool, event bus, lifecycle management | +| **Resume** | One session to resume | Multiple sessions to track and resume | + +## When Each Pattern Wins + +### Use In-Session Agents When... + +**Agents share context and take turns.** The key signal is that each agent needs to see what the previous one said or did. + +**Persona switching.** "Explain this code like I'm a beginner" → "Now review it as a security expert." The security expert benefits from seeing the beginner explanation — it reveals which parts the user found confusing. + +**Guided workflows.** A multi-stage wizard where each stage builds on the last: gather requirements → generate code → review the generated code. Stage 3 needs the full history of stages 1 and 2. Breaking these into separate sessions means re-explaining everything. + +**Specialized lenses.** Different ways to look at the same artifact in a single conversation. A documentation writer and a code reviewer analyzing the same PR — the reviewer's comments inform what the writer emphasizes. + +**Lightweight delegation.** "Ask the SQL expert about this query" as a quick sub-task within a larger conversation. The SQL expert sees the surrounding context, answers, and you continue. + +**Chatbot personalities.** A support bot that can switch between friendly, technical, and escalation modes. Same conversation, different tone. The `infer` flag on `CustomAgentConfig` is designed for exactly this. + +The pattern: **one conversation, multiple perspectives.** Low overhead. Shared memory is a feature, not a limitation. + +### Use Session-Per-Agent When... + +**Agents work independently and need isolation.** The key signal is that agents would get in each other's way if they shared context. + +**Parallel workstreams.** Frontend, backend, and test code being written simultaneously. These don't need to see each other's chain-of-thought — they need to see each other's *output* (the actual files). Running them in parallel cuts wall-clock time proportionally. + +**Different model needs.** A cheap fast model for linting and formatting. An expensive reasoning model for architecture decisions. A code-generation model for implementation. The SDK's `customAgents[]` uses one model for all of them — session-per-agent lets you right-size. + +**Long-running tasks.** One agent doing a 30-minute codebase refactor shouldn't block another from answering a quick question about the README. Separate sessions mean separate timelines. + +**Adversarial review.** A code reviewer shouldn't see the author's reasoning process — just the code. Shared context would leak intent, making the review less independent. Squad's reviewer lockout protocol depends on this isolation. + +**Failure isolation.** If one agent hits a rate limit, runs out of context, or crashes, the others keep working. In a shared session, one agent's failure can corrupt the conversation for everyone. + +**Scale.** Ten agents sharing one context window would burn through tokens fast — each agent's output becomes input for the next. Ten separate sessions means ten independent context budgets. + +The pattern: **multiple conversations, coordinated outcomes.** More infrastructure to build, but each agent is autonomous. + +## The Gray Area + +Some scenarios genuinely could go either way. Here's how I'd decide: + +| Scenario | In-Session | Session-Per-Agent | Deciding Factor | +|---|---|---|---| +| Code review + apply fix | ✅ Reviewer sees code, fixer sees feedback | ✅ Blind review before revealing intent | Do you want independent review? | +| Q&A with domain experts | ✅ Quick switching, shared thread | ✅ Experts need deep independent research | How deep does each expert need to go? | +| Multi-file refactor | Stretches one context fast | ✅ Parallelize across files | How many files? | +| Chatbot with modes | ✅ Natural fit, shared conversation | Overkill for mode switching | Is it really "agents" or just prompt switching? | +| Document generation | ✅ If sequential (outline → draft → edit) | ✅ If parallel (each chapter independently) | Sequential or parallel? | +| CI/CD pipeline agents | One agent can't block while another runs | ✅ Each stage runs independently | Always session-per-agent | + +**The deciding question: Do the agents need to see each other's thinking, or just each other's output?** + +- If **thinking** → in-session agents (shared context is the point) +- If **output** → session-per-agent (isolation is the point) + +## What's Missing from the SDK for Both Patterns + +Having built with both (well, having studied both — Squad built the session-per-agent side), there are gaps: + +### For In-Session Agents +- **No agent-to-agent messaging.** Agent A can't say "hey agent B, what do you think?" — only the user or coordinator can switch agents. There's no `agent.delegateTo("other-agent", message)`. +- **No agent memory boundaries.** When you switch agents, the new agent sees everything. Sometimes you want compartmentalization within a shared session. +- **No lifecycle hooks per agent.** `SessionHooks` fire for the session, not per-agent. You can't run custom logic when switching *to* a specific agent. + +### For Session-Per-Agent +- **No built-in coordination primitive.** The SDK gives you sessions. Everything else — pools, event buses, shared state, lifecycle management — is yours to build. Squad built all of this. +- **No cross-session context sharing.** If agent A discovers something agent B needs, there's no SDK-level mechanism to share it. Squad uses git-committed files. Others might use a database or message queue. +- **No session grouping.** You can't tell the SDK "these 5 sessions are part of one logical task." Each session is independent. The coordinator pattern is entirely user-space. + +### For Agent Builders in General +- **No standard charter format.** Squad invented `.charter.md` with specific sections (identity, expertise, boundaries). There's no SDK convention for this. Every framework will invent its own. +- **No agent discovery.** The `customAgents[]` array is static at session creation. There's no dynamic "find me an agent that can handle X" mechanism. +- **No cost attribution per agent.** Token usage is per-session. For in-session agents, you can't easily attribute cost to each agent's turns. + +## What I'd Build Next + +If I were starting a multi-agent system on Copilot SDK today, here's the decision tree I'd follow: + +``` +Do agents need shared conversation context? +├── Yes → Use customAgents[] +│ └── Do they need to run in parallel? +│ ├── No → You're done, customAgents[] is perfect +│ └── Yes → You need session-per-agent despite the context need +│ (copy relevant context between sessions explicitly) +│ +└── No → Use session-per-agent + └── How many agents? + ├── 2-3 → Simple Promise.allSettled(), lightweight + ├── 4-10 → Build a session pool with health checks + └── 10+ → You need an event bus and probably a queue +``` + +And regardless of pattern, I'd build the adapter layer first. Squad's approach of wrapping `CopilotClient` behind stable interfaces saved them from SDK breaking changes. The SDK is pre-1.0 and moving fast — your agent code shouldn't have to move with it. + +## The Bottom Line + +The Copilot SDK gives you the building blocks for both patterns. `customAgents[]` is the quick path when agents share a conversation. Session-per-agent is the scalable path when agents need autonomy. The SDK doesn't push you toward either one — which is both its strength (flexibility) and its gap (no guidance). + +Squad chose session-per-agent because its agents are autonomous specialists who need to work in parallel, use different models, and review each other's work without seeing each other's reasoning. That's the right call for a team simulation. If I were building a conversational assistant that occasionally calls on specialists, I'd start with `customAgents[]` and only graduate to session-per-agent when I hit the walls. + +The walls, when you hit them, are always the same: parallelism, isolation, or model diversity. If you need any of those, you need separate sessions. If you don't, keep it simple. + +--- + +*This is part of a series on building with the Copilot SDK. The patterns here are based on `@github/copilot-sdk@^0.1.32` and [Squad](https://github.com/bradygaster/squad) — both are pre-1.0 and evolving. The architectural tradeoffs, though, will outlast any specific API shape.* diff --git a/website/blog/2026-04-17-remote-control-custom-agents-from-your-phone.md b/website/blog/2026-04-17-remote-control-custom-agents-from-your-phone.md new file mode 100644 index 0000000..956053b --- /dev/null +++ b/website/blog/2026-04-17-remote-control-custom-agents-from-your-phone.md @@ -0,0 +1,262 @@ +--- +slug: /2026-04-17-remote-control-custom-agents-from-your-phone +canonical_url: https://dfberry.github.io/blog/2026-04-17-remote-control-custom-agents-from-your-phone +custom_edit_url: null +sidebar_label: "2026-04-17 Remote control custom agents from your phone" +title: "Remote Control Your Custom Agent from Your Phone with Copilot CLI --remote" +description: "Copilot CLI's --remote flag lets you steer any custom agent from your phone. Here's how to set it up, what to design for, and what I learned using it with Squad as a real-world example." +published: false +tags: + - GitHub Copilot + - AI agents + - Copilot CLI + - remote control + - custom agents + - mobile development +keywords: + - copilot cli remote + - copilot mobile + - custom agent remote control + - copilot phone + - github mobile copilot + - agent.md remote +updated: 2026-04-17 00:00 PST +--- + +# Remote Control Your Custom Agent from Your Phone + + + +> Part 5 of a series. Previously: [Exploring Copilot CLI Session Management](/blog/2026-04-15-session-storage-decision-guide), [When Session Data Lies](/blog/2026-04-16-when-session-data-lies), [Agent Coordination in Copilot CLI](/blog/2026-04-17-agent-coordination-copilot-sdk), and [Two Ways to Build Multi-Agent Systems](/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk). + +I saw Pamela Fox's LinkedIn post about Copilot CLI remote control and immediately wondered: does this work with custom agents? If I've built my own `.agent.md` — with its own system prompt, tool scoping, and domain expertise — can I steer it from my phone the same way I'd steer base Copilot? + +The `--remote` flag launched April 13, 2026. I tested it with [Squad](https://github.com/bradygaster/squad), a custom agent that coordinates an entire AI team through a single `.agent.md` file. But the takeaways apply to any custom agent you've built. + +Short answer: yes, it works. The interesting part is how to design your agent to take advantage of it. + +## What `copilot --remote` Actually Does + +The `--remote` flag streams your CLI session to GitHub in real time. You get a link and a QR code. Open either one on your phone — through GitHub.com or GitHub Mobile — and you're looking at the same session, fully interactive. + +```bash +copilot --remote +``` + +From your phone you can: +- Send messages and steering commands +- Switch between plan, interactive, and autopilot mode +- Approve or deny permission requests +- Respond to `ask_user` prompts +- Stop the session entirely + +Everything stays in sync. What you type on your phone shows up in the terminal. What the agent does in the terminal shows up on your phone. Each session is private to the GitHub account that started it. + +## Setting It Up + +### Prerequisites + +1. **Update Copilot CLI** — run `/update` in an existing session, or install the latest version +2. **GitHub repository** — your working directory needs to be a GitHub repo (remote sessions use GitHub's infrastructure) +3. **GitHub Mobile beta** (optional) — for the best mobile experience, join [iOS TestFlight](https://testflight.apple.com/join/NLskzwi5) or [Google Play beta](https://play.google.com/apps/testing/com.github.android) +4. **Enterprise/Business users** — an admin needs to [enable remote control policies](https://docs.github.com/en/copilot/concepts/agents/copilot-cli/about-remote-access#administering-remote-access) + +### Starting a Remote Session with Your Custom Agent + +If you have a custom agent installed — whether it's Squad, a code reviewer, a docs generator, or anything else in `.github/agents/` — it's just: + +```bash +# Navigate to your repo with the custom agent +cd my-project + +# Start Copilot CLI with remote enabled +copilot --remote +``` + +Your custom agent loads from `.github/agents/your-agent.agent.md` the same way it always does. The `--remote` flag doesn't change agent discovery or loading — it adds the streaming layer on top. + +Once the session starts, you'll see something like: + +``` +Remote session enabled +https://github.com/your-name/your-repo/tasks/abc123 + +Press Ctrl+E to show QR code +``` + +Scan the QR code with your phone, or open the link in a browser. You're in. + +### Selecting Your Custom Agent Remotely + +When you open the session on your phone, you're in the default Copilot agent. To switch to your custom agent, use the `/agent` command: + +``` +/agent my-agent +``` + +Now everything you type goes through your agent's system prompt and tool scope. If your agent is Squad, that means talking to a coordinator that fans out work to specialists. If it's a code reviewer, it means getting reviews shaped by your `.agent.md` instructions. Whatever you built — it works the same from your phone. + +### Always-On Remote + +If you want every session to be remotely accessible, add this to `~/.copilot/config.json`: + +```json +{ + "remoteSessions": true +} +``` + +Now `copilot` (without `--remote`) still enables remote access. Use `copilot --no-remote` when you want a purely local session. + +## The Workflow: Start, Walk Away, Steer + +Here's the pattern that makes remote + custom agents useful. I'll use Squad as the example, but the workflow applies to any long-running custom agent. + +### 1. Start a long-running task from your desk + +```bash +copilot --remote +/agent squad +``` + +> "Team, refactor the authentication module to use the new token service." + +With Squad, this fans out to multiple agents working in parallel. With a simpler custom agent, it might be a single long-running task — a codebase migration, a comprehensive review, a test suite expansion. The point is: work that takes longer than you want to sit and watch. + +### 2. Walk away + +Your agent keeps working. You don't need to be watching. + +Keep the machine awake: + +``` +/keep-alive busy +``` + +The `busy` option prevents sleep only while Copilot is actively working. Once agents finish and the session is idle, your machine can sleep normally. Other options: `on` (never sleep), `8h` (sleep after 8 hours), `off` (normal behavior). + +### 3. Check in from your phone + +Open GitHub Mobile. Tap **Copilot**. Your session is listed under "Agent sessions." Tap to open. + +You can see what the agent has done and steer the next steps: + +``` +What's the status of the auth refactor? +``` + +Or redirect: + +``` +Skip the profile page changes — focus on the login form first. +``` + +With Squad, these messages go to the coordinator, which routes them to the right specialist agent. With a simpler custom agent, you're talking directly to it. + +### 4. Approve permissions from anywhere + +If an agent needs to run a command or access a tool that requires permission, the request shows up on your phone. Approve or deny right there — no need to walk back to your desk. + +### 5. Resume from a different machine + +If you shut down the session, Copilot gives you a resume command: + +```bash +copilot --resume=SESSION_ID --remote +``` + +Pick up right where you left off, from any machine with access to the repo. + +## Designing Your Custom Agent for Remote Use + +If you're building a custom agent, here's what I've learned about how `--remote` interacts with agent features. + +### What works seamlessly + +- **Agent discovery** — `.github/agents/*.agent.md` files load the same way locally and remotely. No changes needed. +- **Skills** — `.copilot/skills/` are available in remote sessions. +- **Tool access** — all tools your agent uses (grep, view, edit, powershell) work through the remote connection. +- **`ask_user` prompts** — these render on the phone and the user can respond. This is huge for agents that need human decisions mid-task. +- **Permission requests** — approval/denial flows work from the mobile UI. +- **Session continuity** — `--resume` preserves your agent's full conversation history. + +### What to think about + +- **Session length** — mobile connections may be intermittent. Use `/keep-alive` to ensure the host machine stays awake. The agent keeps working whether your phone is connected or not. +- **Output volume** — agents that produce a lot of terminal output (long diffs, verbose logs) can be hard to read on a small screen. Consider how your agent formats output if you expect mobile use. +- **Interaction design** — `ask_user` with structured forms (enums, booleans, multi-select) is much more phone-friendly than free-text questions. If your agent uses `ask_user` with a `requestedSchema`, the form renders as selectable options. That's way better than typing on a phone keyboard. +- **One agent at a time** — the Copilot CLI loads one custom agent per session. You switch with `/agent`. You can't have two custom agents active simultaneously in the same session. (Squad works around this by managing its own sessions internally via the Copilot SDK — see [Two Ways to Build Multi-Agent Systems](/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk) for that pattern.) + +### The `ask_user` opportunity + +This is the thing I'm most excited about for agent builders. Before `--remote`, `ask_user` was a blocking call — the agent stops and waits, and you'd better be at your keyboard. Now it's a push notification on your phone. Your agent can be working through a complex task, hit a decision point, send you a structured question, and you tap your answer while waiting for coffee. + +Design your agents with this in mind: + +```typescript +// Phone-friendly: structured choices +ask_user({ + message: "The auth refactor found 3 breaking changes. How should I handle them?", + requestedSchema: { + properties: { + approach: { + type: "string", + title: "Migration approach", + enum: ["Fix all callers now", "Add deprecation warnings", "Create compatibility shim"], + default: "Add deprecation warnings" + } + }, + required: ["approach"] + } +}); +``` + +One tap vs. typing a paragraph on a phone keyboard. Think about this when you're designing agent interaction points. + +## Real-World Example: Squad as a Remote Custom Agent + +To make this concrete, here's how this plays out with [Squad](https://github.com/bradygaster/squad) — a custom agent that coordinates an entire AI team through a single `.agent.md` file. + +Squad is a good stress test for `--remote` because it's one of the more complex custom agents out there. When you tell Squad to do something, it doesn't just execute — it fans out work to specialist agents (frontend, backend, tester, lead) running in parallel, chains follow-up tasks, and manages cross-agent decisions through shared files. + +From your phone, that looks like: + +``` +> Team, refactor the authentication module. + +🏗️ Flight — reviewing requirements, defining API contract +⚛️ EECOM — updating frontend auth components +🔧 CAPCOM — creating new token service endpoint +🧪 FIDO — writing test cases from requirements +📋 Scribe — logging decisions +``` + +You can check in later, approve permissions, redirect work, or ask for status — all from the mobile UI. The coordinator handles routing your messages to the right specialist. + +Interestingly, Squad also built its own remote control (`squad start --tunnel` and `squad rc --tunnel`) before the platform feature existed, using devtunnel + WebSocket. Now that `copilot --remote` is native, the platform version is simpler for most cases — zero setup, GitHub account auth, proper mobile app. Squad's tunnel approach still has value for custom UIs or team-roster-aware interfaces, but for day-to-day use, `--remote` is the easier path. + +## What I'd Like to See Next + +A few things that would make remote + custom agents even better: + +1. **Agent-specific notifications** — "Your agent is waiting for input" as a push notification, not just visible when you open the session. + +2. **Quick actions** — pre-defined response buttons based on common `ask_user` patterns. "Approve all", "Review first", "Skip" as persistent buttons rather than typing commands. + +3. **Multi-session dashboard** — custom agents like Squad manage multiple sessions internally. Surfacing all of them in one mobile view would make remote steering much more useful. + +4. **Bandwidth-aware output** — agents could detect they're being viewed remotely and adjust verbosity. Summary on mobile, full diff on desktop. + +5. **Offline queue** — let me type responses while offline and deliver them when connectivity returns. The agent could continue working on non-blocked tasks while my response is queued. + +## The Bottom Line + +`copilot --remote` turns any custom agent into a mobile-accessible tool. Whether it's a simple code reviewer or a complex multi-agent coordinator like Squad, the pattern is the same: start the task at your desk, walk away, steer from your phone. + +The setup is one flag. The interesting work is designing your agent's interaction points to be phone-friendly. Structured `ask_user` prompts, concise status updates, and clear decision points make the difference between an agent you can actually steer from your phone and one that requires a full keyboard. + +I started a Squad session from my desk and approved permission requests from Fairhaven Coffee. But the same workflow works with any custom agent — a migration tool, a review bot, a docs generator. If your agent does work that takes longer than you want to sit and watch, `--remote` is worth adding to your workflow. + +--- + +*This is part of a series on building with the Copilot SDK. Remote sessions launched April 13, 2026 in public preview via `copilot --remote`. Custom agent support works today with `.github/agents/*.agent.md` files. [Squad](https://github.com/bradygaster/squad) is one example of a custom agent framework that works with `--remote` out of the box.* diff --git a/website/blog/2026-04-17-three-layer-bug-prevention-for-custom-agents.md b/website/blog/2026-04-17-three-layer-bug-prevention-for-custom-agents.md new file mode 100644 index 0000000..c206c36 --- /dev/null +++ b/website/blog/2026-04-17-three-layer-bug-prevention-for-custom-agents.md @@ -0,0 +1,268 @@ +--- +slug: /2026-04-17-three-layer-bug-prevention-for-custom-agents +canonical_url: https://dfberry.github.io/blog/2026-04-17-three-layer-bug-prevention-for-custom-agents +custom_edit_url: null +sidebar_label: "2026-04-17 Three-layer bug prevention for custom agents" +title: "Three-Layer Bug Prevention for Custom Agents Built on Copilot SDK" +description: "A real bug in Squad — an invalid CLI flag copy-pasted across 8 files — reveals a pattern any custom agent can use to prevent bugs from spreading through AI-generated code." +published: false +tags: + - GitHub Copilot + - AI agents + - Copilot SDK + - testing + - developer workflow +keywords: + - copilot sdk bug prevention + - ai agent code quality + - tdd agents + - custom agent testing + - multi-agent bug patterns +updated: 2026-04-17 00:00 PST +--- + +# Three-Layer Bug Prevention for Custom Agents Built on Copilot SDK + + + +> Part of a series on building custom agents with Copilot SDK. See also: [Agent Coordination in Copilot CLI](/blog/2026-04-17-agent-coordination-copilot-sdk). + +## A Real Bug, Eight Files Deep + +I was reviewing a PR on [Squad](https://github.com/bradygaster/squad), an AI team framework built on Copilot CLI, when I found a bug that tells a broader story about agent-built code. + +The bug: Squad's watch and loop commands shell out to the Copilot CLI to dispatch work to agents. Every one of those commands used `--message` to pass the prompt — a flag that **doesn't exist**. The correct flag is `-p`. The result: every automated dispatch silently failed with `error: unknown option '--message'`. + +How did it happen? One developer wrote a `buildAgentCommand()` function in the first watch capability with `--message`. Then that function was copy-pasted — sometimes by humans, sometimes by AI agents — into seven more files. A second developer built the `loop` command months later, saw the existing pattern, and followed it. The bug spread because the pattern *looked* established. + +The fix PR changed `--message` to `-p` in all eight files. Correct, but not structural. The same class of bug can happen again next time someone adds a watch capability. + +This got me thinking: **if you're building a custom agent on the Copilot SDK, what would prevent this class of bug?** Not just this specific flag, but the pattern — incorrect code that looks correct because it matches existing code, and spreads because agents copy what they see. + +## Why AI Agents Make This Worse + +Human developers copy-paste too, but they're more likely to: +- Google the flag before using it +- Notice when a command fails in their terminal +- Ask "wait, is this right?" when something looks unfamiliar + +AI agents are excellent pattern matchers. When an agent sees `--message` used consistently across 7 files, it treats that as a strong signal: *this is the correct pattern*. The agent doesn't verify against external documentation. It doesn't run the command to check. It copies what exists with high confidence. + +This is the **inherited bug problem**: agents amplify existing bugs by treating frequency as correctness. The more files contain the bug, the more confident the agent becomes that it's right. + +## The Three-Layer Approach + +Good engineering prevents most of this. TDD combined with extracted shared functions would have caught this specific bug on day one. But agents move fast, skip steps, and generate code across many files in parallel. You need defense in depth — layers that catch what the previous layer missed. + +Here's the pattern I'd recommend for any custom agent built on the Copilot SDK: + +### Layer 1: Code Structure — Extract and Centralize + +**The principle:** Any function that gets used in more than one place should exist in exactly one place. + +In Squad's case, `buildAgentCommand()` should live in a single shared module: + +```typescript +// cli/core/agent-command.ts — single source of truth +export function buildAgentCommand( + prompt: string, + options: { agentCmd?: string; copilotFlags?: string } +): { cmd: string; args: string[] } { + if (options.agentCmd) { + const parts = options.agentCmd.trim().split(/\s+/); + return { cmd: parts[0]!, args: [...parts.slice(1), '-p', prompt] }; + } + const args = ['-p', prompt]; + if (options.copilotFlags) { + args.push(...options.copilotFlags.trim().split(/\s+/)); + } + return { cmd: 'copilot', args }; +} +``` + +Every watch capability and the loop command imports from this one file. Now: +- A bug fix is one line, not eight +- An agent writing a new capability imports the function instead of reinventing it +- The function becomes the canonical pattern that agents copy — and it's correct + +**How the Copilot SDK helps:** When creating a session, you can scope which tools and files an agent has access to. If your agent's charter says "use `buildAgentCommand()` from `cli/core/agent-command.ts` for all CLI invocations," the agent follows that instruction. The SDK's `SystemMessageConfig` lets you embed this guidance directly in the agent's system prompt: + +```typescript +const session = await client.createSession({ + systemMessage: { + mode: 'append', + content: 'When shelling out to Copilot CLI, always use buildAgentCommand() from cli/core/agent-command.ts. Never construct CLI arguments manually.' + }, + onPermissionRequest: approveAll, +}); +``` + +**What this catches:** Duplication. One correct implementation, everywhere. + +### Layer 2: Knowledge — Encode Conventions as Agent Memory + +**The principle:** If a convention isn't written down where agents can read it, it doesn't exist for agents. + +Code extraction prevents duplication, but it doesn't prevent someone from bypassing the shared function and writing their own. You need the *reason* documented, not just the code. + +For a Squad-style team, this means writing it into `decisions.md`: + +```markdown +## 2026-04-17 — Copilot CLI invocation convention + +**Decision:** All commands that shell out to Copilot CLI must use the shared +`buildAgentCommand()` from `cli/core/agent-command.ts`. The non-interactive +prompt flag is `-p` (not `--message`, which doesn't exist). Direct CLI +invocation uses `copilot` (not `gh copilot`, which causes Windows console +window issues). + +**Rationale:** `--message` was used in 8 files for months before anyone caught +it. Copy-paste propagation made the bug look intentional. Centralizing +prevents drift. +``` + +Every agent reads this at spawn time. When an agent is tempted to write `['copilot', '--message', prompt]` inline, it sees the decision and uses the shared function instead. + +**For non-Squad custom agents:** The same principle applies through the SDK. You can use `SessionHooks` to inject conventions into every session: + +```typescript +const session = await client.createSession({ + hooks: { + onUserPromptSubmitted: async (input) => { + // Inject project conventions into every prompt + if (input.prompt.includes('shell') || input.prompt.includes('execFile')) { + return { + additionalContext: `CONVENTION: Use buildAgentCommand() for CLI invocations. The prompt flag is -p. Never use --message.` + }; + } + } + }, + onPermissionRequest: approveAll, +}); +``` + +The `onUserPromptSubmitted` hook fires before the model processes the prompt, letting you append convention reminders contextually. The agent sees the convention at exactly the moment it needs it. + +**What this catches:** Agents bypassing the shared code. Even if someone writes a new function, the convention tells them which flag to use. + +### Layer 3: Verification — Tests That Validate Reality + +**The principle:** If your tests mock the thing that's broken, they can't catch the breakage. + +The existing tests for `buildAgentCommand()` mocked `execFile` and checked that `--message` appeared in the arguments. The tests passed perfectly — they validated that the code produced the *expected wrong output*. The mock replaced reality with an assumption, and the assumption was wrong. + +**TDD prevents this from the start.** Write the test before the implementation: + +```typescript +import { buildAgentCommand } from '../cli/core/agent-command.js'; + +describe('buildAgentCommand', () => { + test('uses -p flag for non-interactive prompt', () => { + const { cmd, args } = buildAgentCommand('test prompt', {}); + expect(cmd).toBe('copilot'); + expect(args).toContain('-p'); + expect(args).not.toContain('--message'); + expect(args).toContain('test prompt'); + }); + + test('respects custom agent command', () => { + const { cmd, args } = buildAgentCommand('test', { agentCmd: 'my-agent --flag' }); + expect(cmd).toBe('my-agent'); + expect(args).toContain('--flag'); + expect(args).toContain('-p'); + }); + + test('passes copilot flags through', () => { + const { args } = buildAgentCommand('test', { copilotFlags: '--model gpt-4' }); + expect(args).toContain('--model'); + expect(args).toContain('gpt-4'); + }); +}); +``` + +These tests validate the function's contract directly — no mocks, no assumptions about the external CLI. The `-p` flag is specified in the test before the implementation exists. + +**For deeper validation,** add an integration smoke test: + +```typescript +import { execFileSync } from 'node:child_process'; + +test('copilot CLI accepts -p flag', () => { + // Verify the flag is valid by checking help output + const help = execFileSync('copilot', ['--help'], { encoding: 'utf8' }); + expect(help).toContain('-p'); +}); +``` + +This test fails if the CLI ever changes its flags — catching the problem at the source, not downstream. + +**How the SDK enables this:** The `SessionHooks.onPostToolUse` hook lets you validate tool results after execution: + +```typescript +hooks: { + onPostToolUse: async (input) => { + if (input.toolName === 'powershell' && input.toolResult.resultType === 'failure') { + const output = input.toolResult.textResultForLlm; + if (output.includes('unknown option')) { + return { + additionalContext: `A CLI flag was rejected. Check the flag against the CLI's --help output before retrying. Known valid flags: -p (prompt), --model, --agent.` + }; + } + } + } +} +``` + +This hooks into the agent's tool pipeline in real time. When a command fails with "unknown option," the hook injects a correction before the agent retries — turning a runtime failure into a learning moment within the session. + +**What this catches:** The actual bug. If `-p` is wrong tomorrow, the test fails. No amount of convention documentation saves you if the external tool changes. + +## How the Three Layers Work Together + +Each layer catches what the previous one misses: + +| Layer | What it prevents | What it misses | +|-------|-----------------|----------------| +| **1. Code structure** | Duplication drift — bug appears once, not eight times | Someone bypassing the shared function | +| **2. Knowledge** | Agents ignoring conventions — the *why* is documented | The convention itself being wrong | +| **3. Verification** | Wrong conventions — tests validate against reality | Nothing, if the tests are comprehensive | + +The layers compound. With all three: +- The function exists in one place (structure) +- Agents know to use it (knowledge) +- Tests prove it works (verification) + +Without any one layer, bugs find a way in: +- Without structure: correct knowledge, duplicated in eight places, gradually drifting +- Without knowledge: correct function exists, but agents write their own version +- Without verification: correct function, well-documented, but the flag is wrong and nobody knows + +## Applying This to Your Custom Agent + +If you're building a custom agent with the Copilot SDK, here's the practical checklist: + +### Structure +- [ ] Extract shared utilities — anything used in 2+ places gets its own module +- [ ] Scope agent system prompts to reference shared modules: "use X from Y" +- [ ] Use `SessionConfig.availableTools` to limit which tools agents can use, reducing surface area for mistakes + +### Knowledge +- [ ] Document conventions where agents read them — system prompts, decisions files, or `onUserPromptSubmitted` hooks +- [ ] Include the *rationale*, not just the rule — agents follow "use `-p` because `--message` doesn't exist" better than "use `-p`" +- [ ] Use `SessionHooks` to inject conventions contextually (when the agent is about to do the relevant thing) + +### Verification +- [ ] Write tests for shared utilities before implementing them (TDD) +- [ ] Don't mock the thing you're testing — mock the boundaries, test the logic directly +- [ ] Add integration smoke tests for external tool invocations +- [ ] Use `onPostToolUse` hooks to detect and correct tool failures in real time + +## The Bottom Line + + + +AI agents are powerful pattern matchers, and that's exactly the problem. They copy what they see with high confidence, and they see bugs as often as they see correct code. The more an incorrect pattern appears in a codebase, the more an agent trusts it. + +The fix isn't to make agents smarter — it's to make the codebase harder to get wrong. Extract shared code so bugs can only live in one place. Document conventions so agents know the right pattern. Write tests that validate reality so wrong patterns get caught. + +Three layers. Each one simple. Together, they catch the class of bug that spreads through codebases like `--message` spread through eight files — silently, confidently, and wrong. diff --git a/website/blog/2026-04-17-understanding-any-repo-with-ai-tools.md b/website/blog/2026-04-17-understanding-any-repo-with-ai-tools.md new file mode 100644 index 0000000..ec082dd --- /dev/null +++ b/website/blog/2026-04-17-understanding-any-repo-with-ai-tools.md @@ -0,0 +1,721 @@ +--- +slug: /2026-04-17-understanding-any-repo-with-ai-tools +canonical_url: https://dfberry.github.io/blog/2026-04-17-understanding-any-repo-with-ai-tools +custom_edit_url: null +sidebar_label: "2026-04-17 Understanding any repo with AI tools" +title: "Understanding Any Repository: An AI-Powered Field Guide for Developers, PMs, and Open-Source Adopters" +description: "A practical guide to understanding unfamiliar codebases using Copilot CLI, Squad, Graphify, deep blame, and GitHub MCP — organized by what you need to know, not which tool to open." +published: false +tags: + - GitHub Copilot + - AI agents + - Copilot CLI + - repository understanding + - developer tools + - code archaeology + - knowledge graphs + - open source +keywords: + - understand a codebase + - copilot cli repo exploration + - graphify knowledge graph + - git deep blame + - code archaeology + - repository analysis + - custom agents + - squad ai team +updated: 2026-04-17 00:00 PST +--- + +# Understanding Any Repository: An AI-Powered Field Guide + + + +> Part 6 of a series. Previously: [Session Storage Decision Guide](/blog/2026-04-15-session-storage-decision-guide), [When Session Data Lies](/blog/2026-04-16-when-session-data-lies), [Agent Coordination in Copilot CLI](/blog/2026-04-17-agent-coordination-copilot-sdk), [Two Ways to Build Multi-Agent Systems](/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk), and [Remote Control Custom Agents](/blog/2026-04-17-remote-control-custom-agents-from-your-phone). + +--- + +You just cloned a repository you've never seen before. Maybe you're a developer picking up a teammate's project. Maybe you're a PM trying to understand what your team actually built. Maybe you're evaluating an open-source library before betting your product on it. + +The question is always the same: **"What is this, and should I trust it?"** + +There's now an ecosystem of AI-powered tools that answer that question at different depths — from a quick five-minute scan to forensic-level commit archaeology. But nobody has mapped out which tool to use when, what questions to ask at each level, or how these tools complement each other. + +This is that map. + +## The toolkit at a glance + +Before we go deep, here's the landscape. Each tool occupies a different niche: + +| Tool | What it does | Best for | Depth | +|---|---|---|---| +| **GitDiagram / RepoMapr** | Instant architecture diagrams from a GitHub URL | Evaluating repos before cloning — zero setup | Glance | +| **Copilot CLI** | Interactive Q&A with full repo context | Quick questions, code explanation, git history | Surface → Medium | +| **Copilot CLI `/fleet`** | Parallel multi-agent analysis across modules | Analyzing many areas simultaneously | Medium | +| **Sourcegraph Cody** | Cross-repo semantic search and code intelligence | "How is X handled across all our services?" | Medium → Deep | +| **Squad** (custom agent) | Multi-agent team with persistent memory and decisions | Deep architecture analysis, ongoing project understanding | Deep | +| **Graphify** | Builds a knowledge graph from code, docs, and media | Structure visualization, dependency mapping, cross-file relationships | Deep | +| **`dependency-cruiser`** | Module dependency analysis with rule enforcement | Circular deps, architectural violations (JS/TS) | Deep | +| **Deep blame** (`git blame -C -C -C`) | Line-level attribution tracing across renames and moves | "Why was this written?", "Who decided this?" | Forensic | +| **GitHub MCP** | Structured access to issues, PRs, commits, and workflows | Project history, decision archaeology, contributor patterns | Medium | +| **Copilot CLI session store** | Query past AI sessions across time and contributors | "What was worked on?", "What changed recently?" | Medium | + +You don't need all of them for every repo. The guide below is organized by **how deep you need to go**. + +--- + + + +## Setting up the tools + +### Copilot CLI + +If you have GitHub Copilot, you already have the CLI: + +```bash +copilot -p "What does this project do?" +``` + +For repos with custom agents (like Squad), use the `--agent` flag: + +```bash +copilot --agent squad -p "Walk me through the architecture" +``` + +### Graphify + +Graphify works as a skill inside Copilot CLI, Claude Code, Codex, Cursor, and others: + +```bash +# Inside any supported AI coding tool +/graphify . + +# For deeper extraction +/graphify . --mode deep + +# Install standalone +pip install graphifyy +``` + +After running, you get three artifacts: +- **`graph.html`** — Interactive visual graph you can click, search, and filter +- **`GRAPH_REPORT.md`** — Human-readable summary of the architecture (god nodes, communities, surprising connections) +- **`graph.json`** — Machine-readable graph for session-to-session reuse + +### Deep blame (Git configuration) + +Standard `git blame` only shows the last person who touched a line. Deep blame traces code across renames, moves, and refactors: + +```bash +# Deep blame — track lines across file moves and copies +git blame -C -C -C --follow -- src/core/engine.ts + +# Breakdown: +# -C detect lines moved/copied within a file +# -C -C detect across files in the same commit +# -C -C -C detect across ALL commits (expensive but thorough) +# --follow continue history past renames +``` + +### GitDiagram and RepoMapr (zero-setup visual overview) + +These browser-based tools generate architecture diagrams from any public GitHub repo — no cloning, no install: + +- **[GitDiagram](https://gitdiagram.com)** — Paste a GitHub URL → instant interactive diagram in seconds +- **[RepoMapr](https://repomapr.com)** — Prefix any GitHub URL with `repomapr.com/` → clickable architecture map with AI chat + +Both are ideal for the "should I even clone this?" decision. PMs and OSS adopters: start here. + +### Sourcegraph Cody (cross-repo search) + +[Cody](https://sourcegraph.com/cody) integrates with Sourcegraph's code intelligence to search across entire organizations — not just one repo. Install it in VS Code or JetBrains: + +``` +# In Cody chat (VS Code extension) +"How is user authentication handled across all our services?" +"Show me every place we call the payments API" +``` + +Where Copilot CLI understands the repo you're in, Cody understands *all* the repos in your organization. For teams with microservices or monorepo-plus-satellite architectures, this is the difference between seeing one tree and seeing the forest. + +### dependency-cruiser (JS/TS dependency analysis) + +For JavaScript and TypeScript projects, [dependency-cruiser](https://github.com/sverweij/dependency-cruiser) catches circular dependencies, architectural violations, and unexpected coupling: + +```bash +npm install -g dependency-cruiser + +# Generate a visual dependency graph +depcruise src --include-only "^src" --output-type dot | dot -T svg > deps.svg + +# Validate against architectural rules +depcruise --validate .dependency-cruiser.cjs src +``` + +This complements Graphify's language-agnostic approach with JS/TS-specific depth — it knows about `import`, `require`, dynamic imports, and TypeScript path aliases. + +### GitHub MCP + +If your Copilot CLI has the GitHub MCP server configured, you can query issues, PRs, and commits programmatically: + +```bash +copilot -p "Search issues in this repo about authentication" +copilot -p "Show me the last 10 PRs merged to main" +``` + +--- + + + +## Level 0: "Should I even look at this?" (Before you clone) + +**Goal:** Evaluate a repository from your browser — no local setup, no cloning, no install. + +Sometimes you just need a quick read on whether a repo is worth your time. Maybe you're a PM evaluating a vendor's SDK. Maybe you're an OSS adopter comparing three libraries. Maybe a teammate sent you a link and said "take a look." + +### GitDiagram — instant architecture at a glance + +Go to [gitdiagram.com](https://gitdiagram.com) and paste any public GitHub URL. In seconds you get an interactive diagram showing the repo's file structure, module relationships, and architectural layers. + +No clone. No install. No account. Just paste and read. + +### RepoMapr — architecture map with AI chat + +Prefix any GitHub URL with `repomapr.com/`: + +``` +https://repomapr.com/github.com/bradygaster/squad +``` + +You get a clickable architecture map — and you can chat with an AI about any node. Click a module, ask "What does this do?", and get an answer grounded in the actual code. + +### GitHub itself + +Don't overlook what's already on the repo page: + +```bash +# Quick health check via GitHub MCP or just browse the repo +copilot -p "How active is github.com/bradygaster/squad? Last commit, open issues, PR velocity?" +``` + +Or just scan manually: +- **Last commit date** — Is this maintained? +- **Open issues vs. closed** — Is the maintainer responsive? +- **Contributors tab** — Bus factor? +- **Dependency graph** (Insights → Dependency graph) — What does it pull in? + +### Questions to ask at Level 0 + +| Question | Where to look | +|---|---| +| "Is this maintained?" | Last commit date, issue response time | +| "How complex is this?" | GitDiagram structure, file count | +| "What does it depend on?" | GitHub dependency graph, package manifest | +| "Is the community healthy?" | Stars trend, PR merge rate, contributor count | +| "Should I clone this?" | All of the above — if 3+ are green, clone it | + +--- + + + +## Level 1: "What does this repo do?" (First 5 minutes) + +**Goal:** Get a mental model of what this project is, who it's for, and how big it is. + +### With Copilot CLI + +Start with the broadest possible questions: + +```bash +copilot -p "What does this project do? Who is it for? What's the tech stack?" +copilot -p "What are the main entry points?" +copilot -p "How is this repo organized — what's in each top-level directory?" +``` + +Copilot reads the README, package manifests, and directory structure to give you a quick orientation. + +### With Graphify + +For a structural overview that goes beyond what a README tells you: + +```bash +/graphify . +``` + +Then read `GRAPH_REPORT.md`. Look for: + +- **God nodes** — The most connected files/modules. These are the architectural load-bearing walls. If you change them, everything feels it. +- **Communities** — Natural clusters of related code. These often map to features, layers, or bounded contexts. +- **Surprising connections** — Cross-community edges that reveal hidden coupling (a UI component that directly calls a database module, for instance). + +### With Squad + +If the repo has a Squad team (`.github/agents/squad.agent.md`), the coordinator already knows the architecture: + +```bash +copilot --agent squad -p "Give me a 2-minute overview of this project" +``` + +Squad will delegate to the right specialist — a lead for architecture, a tester for quality assessment, a backend dev for API structure. + +### Questions to ask at Level 1 + +These work with any of the tools above: + +| Question | What you learn | +|---|---| +| "What does this project do in one paragraph?" | Purpose and scope | +| "What's the tech stack?" | Languages, frameworks, infrastructure | +| "What are the main entry points?" | Where execution starts | +| "How big is this codebase?" | Scale and complexity | +| "Is there a test suite? What's the coverage strategy?" | Quality signal | +| "What are the external dependencies?" | Supply chain and integration surface | + +**For PMs:** Focus on "Who is this for?" and "What problem does it solve?" — the tools will extract this from README, docs, and code comments even when it isn't explicitly documented. + +**For OSS adopters:** Ask "When was the last commit?" and "How many contributors are active?" — staleness is the number one risk signal. + +--- + + + +## Level 2: "How is it built?" (First hour) + +**Goal:** Understand the architecture, patterns, and how pieces connect. + +### Explore the graph interactively + +Open `graph.html` from your Graphify run. This is where structure becomes visual: + +- **Click a community** to isolate a feature area +- **Search for a module** to see everything it connects to +- **Look for bridges** — nodes that connect two otherwise separate communities. These are integration points and often the most complex code. + +### Follow the data flow + +Ask Copilot CLI to trace specific paths: + +```bash +copilot -p "Trace the request flow from the API entry point to the database" +copilot -p "What happens when a user submits a form? Walk me through every file involved." +copilot -p "Where is authentication enforced? Show me the middleware chain." +``` + +### Understand the patterns + +```bash +copilot -p "What architectural patterns does this project use? (MVC, event-driven, microservices, etc.)" +copilot -p "Is there dependency injection? How are services wired together?" +copilot -p "Where are the abstractions? What interfaces define the contracts?" +``` + +### Analyze multiple areas in parallel with `/fleet` + +Copilot CLI's `/fleet` command spawns parallel sub-agents, each analyzing a different part of the codebase simultaneously: + +```bash +copilot -p "/fleet Analyze these areas in parallel: + Track 1: Summarize the authentication and authorization approach + Track 2: Document the database schema and data access patterns + Track 3: Map the API endpoints and their request/response contracts + Track 4: Assess the test coverage strategy and gaps" +``` + +This is 3-4x faster than asking sequentially. Each track works independently and reports back. + +> **Gotcha:** `/fleet` spawns generic explore agents — it does NOT use custom agents from `.github/agents/`. Good for read-only analysis, not for charter-driven work. + +### Check dependency health (JS/TS projects) + +For JavaScript and TypeScript repos, `dependency-cruiser` reveals what the code graph alone can't — circular imports, layering violations, and forbidden dependencies: + +```bash +# Visual dependency graph +depcruise src --include-only "^src" --output-type dot | dot -T svg > deps.svg + +# Check for violations against architectural rules +depcruise --validate .dependency-cruiser.cjs src +``` + +Look for: +- **Circular dependencies** — modules that import each other, creating fragile coupling +- **Layer violations** — UI code importing database modules directly +- **Orphan modules** — files that nothing imports (dead code candidates) + +### Search across repos with Sourcegraph Cody + +If you're in an organization with multiple repos, Cody answers questions that span repositories: + +``` +"How is rate limiting implemented across our services?" +"Show me every GraphQL resolver that touches the user table" +"What other repos depend on this package?" +``` + +This is the only tool in this guide that can search beyond the repo you're standing in. + +### Check the build and test infrastructure + +```bash +copilot -p "How do I build this project? What are the npm scripts / make targets?" +copilot -p "How is CI/CD configured? What checks run on PRs?" +copilot -p "What's the test strategy — unit, integration, e2e? Where do tests live?" +``` + +### Query what others have been working on + +If you're using Copilot CLI's session store, you can query past sessions: + +```sql +-- What files have been edited recently across all sessions? +SELECT file_path, COUNT(*) as edit_count +FROM session_files +WHERE tool_name = 'edit' +GROUP BY file_path +ORDER BY edit_count DESC +LIMIT 20; + +-- What were recent sessions about? +SELECT summary, created_at +FROM sessions +WHERE repository LIKE '%my-repo%' +ORDER BY created_at DESC +LIMIT 10; +``` + +This tells you where the active development is — which is often where the complexity lives. + +### Role-specific questions for Level 2 + +**Developer questions:** + +| Question | Why it matters | +|---|---| +| "What's the error handling strategy?" | Tells you if failures are managed or ignored | +| "Where's the configuration loaded?" | First thing you'll need to change | +| "What's the logging approach?" | How you'll debug in production | +| "Are there feature flags?" | How changes are rolled out | +| "What are the hot paths — most-changed files in the last 3 months?" | Where you'll spend your time | + +**PM questions:** + +| Question | Why it matters | +|---|---| +| "What features were added in the last quarter?" | Velocity and direction | +| "Which areas have the most open issues?" | Pain points and tech debt | +| "What's the PR review cycle like?" | Team health signal | +| "Are there any areas with single-contributor ownership?" | Bus factor risk | + +**OSS adopter questions:** + +| Question | Why it matters | +|---|---| +| "How hard is it to contribute?" | Onboarding friction | +| "What's the breaking change history?" | Stability signal | +| "Are security advisories addressed promptly?" | Maintenance quality | +| "What's the license situation for all dependencies?" | Legal risk | + +--- + + + +## Level 3: "Why was it built this way?" (Deep blame) + +**Goal:** Understand the *decisions* behind the code — not just what it does, but why. + +This is where most tools stop. READMEs explain what. Architecture docs (when they exist) explain how. But *why* a particular approach was chosen over alternatives — that lives in git history, issues, and PR discussions. + +### The deep blame workflow + +When you find code that confuses you, here's the forensic process: + +#### Step 1: Find who wrote it and when + +```bash +# Deep blame — traces across renames and moves +git blame -C -C -C -- src/core/engine.ts +``` + +This gives you a commit SHA and author for every line. Look for lines where the commit message is interesting — merge commits, "fix:" prefixes, or references to issues. + +#### Step 2: Read the commit in context + +```bash +git show abc1234 +``` + +Or ask Copilot to explain it: + +```bash +copilot -p "Explain commit abc1234 — what problem was it solving and what approach did it take?" +``` + +#### Step 3: Find the issue or PR that drove the change + +```bash +# Search for the commit SHA in PRs +copilot -p "Find the PR that contains commit abc1234" + +# Search issues for keywords from the commit message +copilot -p "Search issues in this repo about 'MCP tool-loss' or 'dispatch'" +``` + +#### Step 4: Trace the full file history + +```bash +# See every commit that touched this file, including renames +git log --follow --oneline -- src/core/engine.ts + +# See the full diff history +git log --follow -p -- src/core/engine.ts +``` + +#### Step 5: Check for decision records + +Many mature projects record architectural decisions: + +```bash +# Architecture Decision Records +ls docs/adr/ docs/decisions/ .squad/decisions.md 2>/dev/null + +# Ask Copilot to search for decision context +copilot -p "Are there any architecture decision records or design docs that explain why the dispatch system uses process-per-round instead of persistent sessions?" +``` + +### A real example: Why doesn't Squad use `--agent` in dispatch? + +Here's deep blame in action. In the Squad repo, the dispatch system (`execute.ts`) spawns `copilot -p ` — a generic Copilot process with no `--agent squad` flag. That seems wrong. Let's investigate: + +**Step 1 — Blame the dispatch command builder:** + +```bash +git blame -C -C -C -- packages/squad-cli/src/cli/commands/watch/capabilities/execute.ts +``` + +Reveals the function was introduced in commit `ab1333e2` (Mar 31) — and from day one, it used generic Copilot. + +**Step 2 — Check the commit:** + +```bash +git show ab1333e2 +``` + +It's the initial execute capability. No mention of `--agent` in the commit message. + +**Step 3 — Search issues:** + +Issue #928 (Apr 8) explains everything: + +> "MCP tool-loss root cause: `execFileSync` per cycle kills MCP connections. Fix: use `CopilotClient` persistent sessions." + +And issue #775 discovered: + +> "Fleet ignores custom agents — spawns generic explore agents, NOT the custom agents from `.github/agents/`." + +**Step 4 — The decision record:** + +Commit `cb413bf1` (Apr 4) added `ralph-instructions.md` with an explicit note: + +> "Intentionally minimal — the agent reads `.squad/ralph-instructions.md` for full instructions, matching the PS1 ralph-watch design." + +**Verdict:** Not a bug — a deliberate architectural choice. `--agent squad` would still kill MCP connections per round, and fleet mode ignores custom agents entirely. The team chose file-based identity as a pragmatic workaround while building toward SDK persistent sessions. + +Without deep blame, you'd assume it's a bug and file a PR. With deep blame, you understand the *why* and can make a better decision about what to fix. + +### Deep blame questions + +| Question | Command / Prompt | +|---|---| +| "Who originally wrote this function?" | `git blame -C -C -C -- ` | +| "Has this file been renamed or moved?" | `git log --follow --diff-filter=R -- ` | +| "What was this file called before?" | `git log --follow --name-status -- ` | +| "Why was this line changed?" | `git show ` → read commit message and diff | +| "What issue drove this change?" | Search issues for commit SHA or keywords | +| "Was this approach debated?" | Find the PR → read review comments | +| "Was an alternative considered?" | Check for ADRs or decision records | +| "When did this pattern start?" | `git log --all --oneline -S "pattern" -- ` (pickaxe search) | + +--- + + + +## Level 4: "What's the living history?" (Ongoing understanding) + +**Goal:** Stay current with a repo you're invested in — catch changes, understand trends, and maintain context over time. + +### Continuous monitoring with Graphify + +```bash +# Rebuild the graph incrementally as files change +/graphify . --watch + +# Or update manually (SHA-based cache — only reprocesses changed files) +/graphify . --update +``` + +Compare `GRAPH_REPORT.md` across time to spot architectural drift — new communities forming, god nodes growing, unexpected coupling appearing. + +### Automated structure diagrams with repo-visualizer + +Add [`githubocto/repo-visualizer`](https://github.com/githubocto/repo-visualizer) as a GitHub Action to automatically generate and commit an SVG of your repo's file structure on every push: + +```yaml +# .github/workflows/visualize.yml +name: Repo Visualizer +on: push +jobs: + visualize: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: githubocto/repo-visualizer@v1 + with: + output_file: docs/repo-structure.svg + excluded_paths: node_modules,dist,.git +``` + +This gives you a living architecture diagram that updates with your code — useful for onboarding docs and README badges. + +### Session history as institutional memory + +Copilot CLI's session store accumulates understanding across every interaction anyone has with the repo: + +```sql +-- What areas got the most attention this week? +SELECT file_path, COUNT(*) as touches +FROM session_files +WHERE first_seen_at > now() - INTERVAL '7 days' +GROUP BY file_path +ORDER BY touches DESC +LIMIT 15; + +-- What questions have people been asking? +SELECT substr(user_message, 1, 120) as question, timestamp +FROM turns +WHERE session_id IN ( + SELECT id FROM sessions + WHERE repository LIKE '%my-repo%' +) +ORDER BY timestamp DESC +LIMIT 20; +``` + +### Squad orchestration logs + +If the repo uses Squad, the orchestration log (`.squad/orchestration-log/`) records every agent interaction — who was asked what, what they decided, and what they produced. This is a rich source of "what happened while I was away?" + +```bash +copilot --agent squad -p "What work was done in the last week? Summarize the orchestration log." +``` + +### GitHub MCP for trends + +```bash +copilot -p "Show me PRs merged this week. Summarize the themes." +copilot -p "What issues were opened this month? Any patterns?" +copilot -p "Who are the most active contributors right now?" +``` + +--- + + + +## Putting it all together: A real walkthrough + +Let's understand the [Squad](https://github.com/bradygaster/squad) repository from scratch — a TypeScript monorepo that implements an AI team framework. + +### Before cloning: Browser scan + +Paste `https://github.com/bradygaster/squad` into [GitDiagram](https://gitdiagram.com). In seconds you see: two packages (`squad-cli`, `squad-sdk`), a `.github/agents/` directory, a `.squad/` directory, and a `templates/` system. That's enough to know this is a monorepo with an agent framework and some kind of team/template structure. Worth cloning. + +### Minute 1: Quick scan + +```bash +copilot -p "What is this project? Summarize in 3 sentences." +``` + +> Squad is a CLI framework that creates AI agent teams for software projects. It uses GitHub Copilot CLI under the hood, with multi-agent coordination, persistent memory (decisions.md, agent histories), and automated watch/dispatch loops. It's a TypeScript monorepo with two packages: squad-cli and squad-sdk. + +### Minute 5: Structure map + +```bash +/graphify . --mode deep +``` + +The `GRAPH_REPORT.md` reveals: +- **God nodes**: `cli-entry.ts` (CLI router), `execute.ts` (dispatch engine), `squad.agent.md` (coordinator prompt) +- **Communities**: CLI commands, SDK abstractions, watch capabilities, template system +- **Surprising connection**: Template files must be synchronized across 5 locations (`.squad-templates/`, `templates/`, both packages, `.github/agents/`) + +### Minute 30: Architecture deep-dive + +```bash +copilot -p "How does the watch/dispatch system work? Trace from 'squad watch' to the actual agent invocation." +copilot -p "What's the difference between squad-cli and squad-sdk?" +copilot -p "How does the team casting system work?" +``` + +### Hour 1: Decision archaeology + +```bash +git blame -C -C -C -- packages/squad-cli/src/cli/commands/watch/capabilities/execute.ts +copilot -p "Search issues about dispatch, MCP tool-loss, or agent identity" +``` + +This reveals the dispatch amnesia problem (#928), the fleet agent identity gap (#775), and the deliberate `ralph-instructions.md` workaround — context that would take days to discover by reading code alone. + +--- + +## Quick reference card + +**"I want to know X → Use Y → Example"** + +| I want to know... | Tool | Command or prompt | +|---|---|---| +| If this repo is worth cloning | GitDiagram | Paste GitHub URL at [gitdiagram.com](https://gitdiagram.com) | +| Quick architecture overview (no install) | RepoMapr | `repomapr.com/github.com/owner/repo` | +| What this project does | Copilot CLI | `copilot -p "What does this project do?"` | +| How the code is structured | Graphify | `/graphify .` → read `GRAPH_REPORT.md` | +| What the most important files are | Graphify | Look for god nodes in `graph.html` | +| How a specific feature works | Copilot CLI | `copilot -p "Trace the auth flow from login to token storage"` | +| Multiple areas at once | `/fleet` | `copilot -p "/fleet Track 1: auth Track 2: db Track 3: API"` | +| How X works across all our repos | Sourcegraph Cody | `"How is auth handled across our services?"` | +| Whether there are circular deps | dependency-cruiser | `depcruise src --output-type dot \| dot -T svg > deps.svg` | +| Who wrote this code and why | Deep blame | `git blame -C -C -C -- ` → `git show ` | +| What decisions were made | GitHub MCP + blame | Search issues/PRs for keywords or commit SHAs | +| What was worked on recently | Session store | Query `session_files` and `sessions` tables | +| Whether this repo is maintained | GitHub MCP | Check recent PRs, issue response times, last commit | +| How complex the dependencies are | Graphify | `/graphify . --mode deep` → check cross-community edges | +| What the team is working on | Squad | `copilot --agent squad -p "What work happened this week?"` | +| If a pattern was intentional | Deep blame + issues | Blame → commit → PR → review comments → ADRs | +| How hard it is to contribute | Copilot CLI | `copilot -p "What's the contribution process? Any gotchas?"` | + +--- + +> ### Beyond this guide +> +> This post focuses on the tools I've used hands-on. A few others worth knowing about: +> +> - **[Cursor](https://cursor.sh)** and **[Claude Code](https://docs.anthropic.com/en/docs/claude-code)** — IDE-based AI assistants with strong multi-file understanding. If you're not using Copilot CLI, these cover similar ground for Levels 1-2. +> - **[Greptile](https://greptile.com)** — AI-powered codebase search that indexes massive repos. Enterprise-focused, excels at "find every place we do X" queries at scale. +> - **`git log --graph --oneline --all`** — The free, zero-install version of understanding branch history. Pair it with `git shortlog -sn` for contributor stats. Don't sleep on built-in Git. +> - **[githubocto/repo-visualizer](https://github.com/githubocto/repo-visualizer)** — GitHub Action that auto-generates SVG structure diagrams on every push (covered in Level 4 above). + +--- + + + +## What's next + +Understanding a repo is an ongoing process, not a one-time event. The tools keep getting better: + +- **Copilot CLI session resume** means your understanding persists across sessions — you don't start from zero every time +- **Graphify's incremental mode** means the knowledge graph evolves with the codebase +- **Squad's persistent memory** (decisions.md, agent histories) means the AI team accumulates institutional knowledge +- **SDK persistent sessions** (coming) will solve dispatch amnesia — agents that remember across rounds without workarounds + +The shift is from "read the code" to "interrogate the code." The codebase becomes a conversation partner, not a wall of text. The tools just determine how deep that conversation can go. + +--- + +*This is part 6 of a series on building with AI agents. The series started with [Session Storage Decision Guide](/blog/2026-04-15-session-storage-decision-guide) and explored [session data integrity](/blog/2026-04-16-when-session-data-lies), [agent coordination](/blog/2026-04-17-agent-coordination-copilot-sdk), [multi-agent patterns](/blog/2026-04-17-choosing-multi-agent-patterns-copilot-sdk), and [remote-controlling agents from your phone](/blog/2026-04-17-remote-control-custom-agents-from-your-phone). Each post builds on discoveries made while working with Squad — an AI team framework that became both the subject and the tool.* diff --git a/website/blog/2026-04-18-observability-for-custom-copilot-agents.md b/website/blog/2026-04-18-observability-for-custom-copilot-agents.md new file mode 100644 index 0000000..5fa8453 --- /dev/null +++ b/website/blog/2026-04-18-observability-for-custom-copilot-agents.md @@ -0,0 +1,313 @@ +--- +slug: /2026-04-18-observability-for-custom-copilot-agents +canonical_url: https://dfberry.github.io/2026-04-18-observability-for-custom-copilot-agents +custom_edit_url: null +sidebar_label: "2026-04-18 Observability for custom Copilot CLI agents" +title: "Knowing What Your Agent Team Did and Why: Observability for Custom Copilot CLI Agents" +description: "I investigated how to trace agent reasoning in custom Copilot CLI agent teams — whether you're in a live session or reviewing a PR the next morning." +published: false +tags: + - GitHub Copilot + - AI agents + - observability + - developer workflow + - Squad + - Copilot CLI +keywords: + - copilot cli observability + - ai agent reasoning + - custom agent team + - github copilot agents + - agent decision tracking + - squad observability +updated: 2026-04-18 00:00 PST +--- + +# Knowing What Your Agent Team Did and Why + + + +![A person on the Bellingham Bay boardwalk at dawn, watching fishing boats return to harbor with glowing logbooks on deck](./media/2026-04-18-observability-for-custom-copilot-agents/bellingham-bay-boardwalk-fleet.png) + +*Every boat comes back with a catch — but only the ones with logbooks can tell you where they went and why.* + +I've been using [Squad](https://github.com/bradygaster/squad), a human-led AI agent team framework built on [Copilot CLI](https://docs.github.com/en/copilot/github-copilot-in-the-cli), for a few months now. I set up ten agents — each with a charter, a history file, and specific skills. Some days I'm sitting at the terminal directing them. Other days I delegate work through issues and review the PRs later. + +Both ways work. But I keep running into the same question: **when I review what the team did, can I understand _why_ they did it?** + +This post is my investigation into that question — not a conclusion. It breaks down into three layers: what any team using AI agents needs to think about, what Copilot CLI provides as a platform, and what I see in Squad as a custom agent framework built on top. Things are moving fast. + +## The question that started this + +I had delegated a task through an issue: "update the content pipeline for the new API version." The next morning I had a clean PR with the right changes. But the agent had restructured one function in a way I didn't expect. I wanted to know: why this approach? What did the agent consider? What constraints drove the decision? + +The code was correct. But I couldn't trace the reasoning. + +And here's the thing: if I'd been in a live session, I could have just asked. The reasoning would have been right there in the conversation. But because I'd delegated the work, the reasoning was... somewhere. Not lost — but not connected to the PR I was reviewing. + +That's the gap I wanted to understand. The more I dig into it, the more I think it's not just a tooling problem — it's a design problem. Tools can record what happened. But whether you can actually course-correct depends on how you set up your agents to explain themselves. + +## The observability question — for any AI agent + + + +![A mushroom forager examining a chanterelle up close in the forest on the left, and studying collected specimens with field notes at a cabin table on the right](./media/2026-04-18-observability-for-custom-copilot-agents/whatcom-forager-two-modes.png) + +*Hands in the dirt, or studying what you collected — both require knowing what you're looking at.* + +This isn't specific to any platform or framework. Any system where AI agents make decisions on a developer's behalf faces the same question — whether it's a custom agent team, a CI pipeline with AI steps, or a coding assistant with increasing scope. + +Two natural patterns emerge when you work with agents: + +**Live sessions** — you're at the terminal, talking to the team. You see every decision as it happens. You can ask "why did you do that?" and get an answer immediately. You're steering. + +**Delegated work** — you set direction through an issue or a prompt, the team executes, and you review the output later. You're setting goals, reviewing results, course-correcting. + +Both are human-directed. The observability question is the same in both cases: **can you reconstruct why the team made those decisions and changed that code?** But the answer is very different depending on which pattern you used. In a live session, missing rationale is recoverable — you can ask. In delegated work, missing rationale means you're reading code without context. + +```mermaid +flowchart LR + H[Human] -->|live session| A[Agent Team] + H -->|issue/prompt| D[Delegation] + D --> A + A -->|PR, commit, decision| O[Output] + O -->|review| H +``` + +If you're a developer using AI agents for real work — or a team lead deciding whether to adopt them — this question isn't academic. It's the difference between: + +- **"I can use AI agents and stay accountable"** vs. **"I shipped code I can't fully explain"** +- **"I can scale my team's output with agents"** vs. **"I scaled output but lost the ability to course-correct"** +- **"I can onboard someone new and they can follow the reasoning trail"** vs. **"Only I know why things are the way they are, and even I'm not sure"** + +**For developers**, this is about code quality and review confidence. If you can't trace why an agent restructured a function, you're approving code on faith. That might work for trivial changes. It doesn't scale to anything complex. + +**For team leads**, the concerns compound: + +- **Review scalability.** Without reasoning trails, every delegated PR becomes expensive senior-review work. If delegation saves coding time but increases review time, you haven't scaled — you've moved the labor. +- **Incident response.** When an agent causes a regression, observability is how you reconstruct what happened. Without it, your postmortem is: "the AI did something and we're not sure why." +- **Drift detection.** Agent teams don't just fail once — they drift. Models update, prompts evolve, context shifts. Observability is how you notice when behavior changes incrementally, before it becomes a production issue. + +The answer is probably yes — you can use AI agents and stay accountable. But only if you design for it. That design happens at two levels: what the platform gives you, and what you build on top. + +## What Copilot CLI gives you + + + +![Aerial view of the Nooksack River through Whatcom County farmland with glowing sensor stations along the banks](./media/2026-04-18-observability-for-custom-copilot-agents/nooksack-river-sensor-stations.png) + +*Sensor stations measure what flows past them — water level, temperature, speed. Platform telemetry works the same way: it captures the signal, not the meaning.* + +I use Copilot CLI as my platform. Here's what it provides as of mid-April 2026: + +### Session persistence + +Every Copilot CLI session — live or delegated — lands in `~/.copilot/session-state/`. The full transcript: prompts, responses, tool calls, file changes, checkpoints. You can browse sessions with `/session` and resume any past session with `/resume`. + +### OpenTelemetry (shipped in v1.0.4) + +As of [Copilot CLI v1.0.4](https://github.com/github/copilot-cli/issues/2471), you can enable OTel instrumentation: + +```bash +COPILOT_OTEL_ENABLED=true +# or +OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 +``` + +This gives you traces for agent sessions, LLM calls, and tool executions — token usage metrics, operation durations, and OTLP HTTP export with enterprise auth headers. Run `copilot help monitoring` for the full reference. + +### What's still missing + +Two open issues on [github/copilot-cli](https://github.com/github/copilot-cli) highlight gaps: + +- **[#2396](https://github.com/github/copilot-cli/issues/2396) — Session attribution.** Sessions don't currently record _how_ you launched them — interactive vs SDK vs headless — or which custom tool created them. If you run three different agent tools, their sessions all look identical. The issue proposes persisting `client_type` and `clientName` in session-state files. + +- **[#1791](https://github.com/github/copilot-cli/issues/1791) — Session history.** There's no cross-session audit view without starting the agent. The issue proposes a `copilot --history` flag for querying session history directly from the shell — spend no tokens, launch no agent. + +These help with "what happened" and "which session did it." But the platform captures sessions and telemetry — not your team's project-specific intent. That's the next layer. + +## What I see in Squad + + + +![A hiker at a trail fork near Mount Baker — one path with a detailed hand-carved sign, the other with only a generic marker](./media/2026-04-18-observability-for-custom-copilot-agents/chain-lakes-trail-fork.png) + +*A generic "Trail" sign tells you a path exists. A hand-carved one tells you where it goes and why you'd take it.* + +[Squad](https://github.com/bradygaster/squad) is the custom agent framework I use on top of Copilot CLI. Its design philosophy is that people stay accountable for priorities, approvals, and final changes while agents handle coordination and repetition. The work stays inspectable because it lives in your repo as files. + +This is where project-specific reasoning lives — the layer no platform can ship for you. Platforms capture **what happened**. They can't tell you which rationale is **project-relevant**. + +Consider these questions: + +- "Why did the agent pick Redis over a file cache?" → That depends on your team's standing decision to prefer managed services. +- "Why did the agent restructure the function?" → That depends on your charter's rule about separating API contracts from implementation. +- "Why did the agent skip the edge case in the test?" → That depends on your skill's instruction to focus on the happy path first and file follow-up issues for edge cases. + +The platform can tell you the agent called 12 tools and used 50K tokens. It can even summarize the session. But it doesn't know about your team's decisions, your project's constraints, or your agents' specific mandates. + +Here's what I see in Squad right now. The same patterns apply with system prompts, ADRs, policy files, or whatever mechanism your framework uses to scope agent behavior. + +### Charters — scope + accountability + +When something goes wrong, the first question is "who was supposed to own this?" Charters answer that before anyone has to ask. + +Each agent has a charter that defines what they own, how they work, and their boundaries. In my setup, charters look like this: + +```markdown +# Gonzo — Infrastructure Charter + +## Responsibilities +- GitHub Projects Setup +- Labels & Configuration +- GitHub Actions workflows + +## Scope Boundaries +- Does: Infrastructure, automation, GitHub platform configuration +- Doesn't: Design templates (→ Piggy), strategic decisions (→ Kermit) +``` + +When Gonzo opens a PR that changes a GitHub Action, the charter explains why Gonzo did it (it's in scope) and why Piggy didn't (it's not in Piggy's scope). The charter is both an instruction and an explanation. + +**What to add for observability:** An explicit section on how the agent should narrate their work: + +```markdown +## When producing output +- Every PR description includes: parent issue link, reasoning summary, + what was considered but rejected +- Every commit message references the source issue +- Architectural choices reference the relevant standing decision +``` + +This is just charter text. No code change. The agent reads it and follows it. + +### Decisions — the shared brain + +Individual agents forget between sessions. Standing decisions don't — they're the one file every agent reads at startup. + +`decisions.md` is the team's institutional memory. Every agent reads it at session start. Standing decisions shape behavior across all sessions: + +```markdown +## Prefer managed services over self-hosted +**What:** When choosing infrastructure, prefer managed/cloud services. +**Why:** Reduces operational burden. Team doesn't have on-call rotation. +**When:** 2026-03-15 +``` + +When an agent chooses Azure Cache for Redis over a local Redis container, you can trace it back to this decision. The decision is the **why**. It persists across sessions, across agents, across modes. + +**What to add for observability:** A standing decision that requires reasoning in outputs: + +```markdown +## All delegated work must include reasoning +**What:** Every PR opened from delegated work must include a "Reasoning" +section explaining key decisions and what alternatives were considered. +**Why:** The person reviewing wasn't in the session. They need context. +**When:** 2026-04-18 +``` + +### Agent history — per-persona memory + +Charters define what an agent *should* do. History captures what they *learned* doing it. + +Each agent accumulates a `history.md` — learnings from past sessions that shape future behavior. When Gonzo learns that a specific GitHub Action syntax causes failures in this repo, that goes in Gonzo's history. Next time Gonzo works on Actions, they know. + +History files serve observability because they're the answer to "has this agent dealt with this before, and what did they learn?" + +### Skills — repeatable tasks with built-in standards + +If charters define who does what and decisions define why, skills define *how* — including what the output should look like. + +Skills encode how to do specific tasks — including quality gates and output standards. A well-written skill includes what the output should look like: + +```markdown +## PR Description Format +- Reference the source issue or dispatch parent +- Include a checklist of validation criteria +- This provides traceability from the content PR back to + the engineering change +``` + +Skills are instructions AND observability policy in one file. They tell the agent what to produce and tell the reviewer what to expect. + +### Orchestration logs — the narrative bridge + +Raw session data tells you everything that happened. Orchestration logs tell you what *mattered*. + +If your agent team produces orchestration logs (structured summaries of what happened during a work session), those become the bridge between raw session data and human understanding: + +```markdown +## Orchestration Log — 2026-04-18T14:30:00 + +**Agent:** Gonzo (Infrastructure) +**Task:** Update CI pipeline for new API version +**Key decisions:** +- Chose matrix strategy over sequential jobs (faster, same coverage) +- Skipped Windows runner (no Windows-specific code in this change) +**Artifacts:** PR #847, 3 commits +**Standing decisions referenced:** "Prefer managed services", "CI must pass before review" +``` + +### The feedback loop + + + +![Cross section of an old-growth cedar trunk with tree rings, a researcher's hand annotating specific rings with a pencil](./media/2026-04-18-observability-for-custom-copilot-agents/cedar-tree-rings-history.png) + +*Every ring is a season. The annotations are what turn raw growth into a story you can read.* + +The real power isn't any single artifact — it's the loop between them. + +```mermaid +flowchart TD + L[Live Session] -->|corrects agent| D[decisions.md] + D -->|reads at start| DW[Delegated Work] + DW -->|PR + reasoning| R[Human Reviews] + R -->|spots issue| L + R -->|approves| Done[✓ Done] + H[history.md] -->|recalls| DW + DW -->|learns| H +``` + +1. In a live session, you notice an agent making a choice you disagree with. You correct them. That correction becomes a decision in `decisions.md`. +2. Next time that agent (or any agent) runs — live or delegated — they read `decisions.md` and behave differently. +3. The delegated work produces a PR with a reasoning section. You review it. If the reasoning references the decision you wrote, the loop is closed. +4. If the reasoning doesn't make sense, you're back in a live session asking questions. The loop continues. + +**The human is always directing the work.** Charters, decisions, history, and skills are the mechanisms. The question is whether those mechanisms produce enough signal for you to know _when_ to course-correct — so you spend less time re-investigating and more time deciding. + +## What I'd do on Monday + + + +![A person on a covered porch overlooking Chuckanut Bay, writing a checklist in a field journal with coffee beside them](./media/2026-04-18-observability-for-custom-copilot-agents/chuckanut-bay-field-journal.png) + +*The best time to start a field journal is before you need to look something up.* + +If you're setting up a custom agent team and want observability from day one, here's where I'd start — organized by layer: + +**Any agent setup:** + +1. **Decide what "good reasoning" looks like for your project.** Before you pick tools or frameworks, know what you'd want to see when reviewing agent work. That's the bar everything else gets measured against. + +**Copilot CLI platform:** + +2. **Enable OTel now.** `COPILOT_OTEL_ENABLED=true` gets you traces immediately. Watch for session attribution ([#2396](https://github.com/github/copilot-cli/issues/2396)) and session history ([#1791](https://github.com/github/copilot-cli/issues/1791)) as they ship. + +**Custom agent layer (Squad or equivalent):** + +3. **Add observability expectations to every agent's scope definition.** Tell agents to explain their reasoning in PR descriptions, reference relevant decisions, and note what alternatives they considered. This is free — it's just configuration text. + +4. **Write a standing decision requiring reasoning in outputs.** Make it team policy, not per-agent hope. "Every PR from delegated work must include a Reasoning section." + +5. **Link outputs to inputs.** Every PR should reference its source issue. Every commit should trace to a task. The goal is: from any output, you can walk backward to the intent. + +6. **Build the feedback loop.** When you spot a reasoning gap during review, don't just fix the code — update the decision file or charter so the next session benefits. That's how the system gets smarter. + +The agents do the work. The platform records the telemetry. The custom layer captures the project-specific why. But the human defines what "good reasoning" looks like — and that's the part no platform can ship for you. + +--- + +_This is a snapshot of my investigation as of April 2026. Copilot CLI and Squad are both evolving fast. The specific features and issue numbers referenced here may have changed by the time you read this._ + +_Squad is an open-source project by [Brady Gaster](https://github.com/bradygaster/squad). Observability patterns referenced here also draw from [Tamir Dresher's](https://www.tamirdresher.com/blog) excellent series on scaling AI agent teams, particularly his posts on [Aspire + Squad observability](https://www.tamirdresher.com/blog/2026/03/22/aspire-squad-love), [securing agent teams](https://www.tamirdresher.com/blog/2026/03/25/securing-hardening-ai-agent-squad), and [cross-squad communication](https://www.tamirdresher.com/blog/2026/03/26/scaling-ai-part8-pathfinder)._ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/bellingham-bay-boardwalk-fleet.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/bellingham-bay-boardwalk-fleet.png new file mode 100644 index 0000000..53ddfb2 Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/bellingham-bay-boardwalk-fleet.png differ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/cedar-tree-rings-history.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/cedar-tree-rings-history.png new file mode 100644 index 0000000..0e7960e Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/cedar-tree-rings-history.png differ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chain-lakes-trail-fork.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chain-lakes-trail-fork.png new file mode 100644 index 0000000..2336c27 Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chain-lakes-trail-fork.png differ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chuckanut-bay-field-journal.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chuckanut-bay-field-journal.png new file mode 100644 index 0000000..dddf0d2 Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/chuckanut-bay-field-journal.png differ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/nooksack-river-sensor-stations.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/nooksack-river-sensor-stations.png new file mode 100644 index 0000000..5794ceb Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/nooksack-river-sensor-stations.png differ diff --git a/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/whatcom-forager-two-modes.png b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/whatcom-forager-two-modes.png new file mode 100644 index 0000000..0d8e3af Binary files /dev/null and b/website/blog/media/2026-04-18-observability-for-custom-copilot-agents/whatcom-forager-two-modes.png differ