Skip to content

Migrate to Copilot SDK for azd agent implementation#6883

Open
wbreza wants to merge 85 commits intomainfrom
copilot-sdk-phase1
Open

Migrate to Copilot SDK for azd agent implementation#6883
wbreza wants to merge 85 commits intomainfrom
copilot-sdk-phase1

Conversation

@wbreza
Copy link
Contributor

@wbreza wbreza commented Feb 25, 2026

Replace langchaingo with GitHub Copilot SDK for agent mode

Resolves #6871 #6872 #6873 #6874 #6875 | Epic #6870

Replaces langchaingo agent orchestration with GitHub Copilot SDK (copilot-sdk/go v0.1.32).

Agent API

agent, _ := factory.Create(ctx, agent.WithMode(agent.AgentModePlan))
initResult, _ := agent.Initialize(ctx)
result, _ := agent.SendMessage(ctx, prompt)
agent.Stop()

Screenshots

Startup
image

Mid Flow
image

Completed
image

Features

  • Copilot SDK client lifecycle, session create/resume, permission hooks
  • Initialize() with model/reasoning prompts, dynamic model list
  • SelectSession() UX picker, resume via WithSessionID()
  • AgentDisplay: 15+ event types, reasoning window, intent spinner, nested subagents
  • Azure plugin auto-install, skills + MCP servers from plugin
  • Usage metrics: tokens, billing rate, premium requests, duration
  • Init flow: single prompt with azure-prepare/azure-validate skills

Config Namespace Migration

  • All config keys moved from ai.agent.* to copilot.* namespace
  • Composable constants in internal/agent/copilot/config_keys.go with ConfigRoot prefix
  • mcp.consent to copilot.consent, mcp.errorHandling.* to copilot.errorHandling.*
  • Clean break - no backward compatibility for old keys

Package Restructure

  • Deleted entire pkg/llm/ package (azure_openai, ollama, github_copilot, model_factory, manager, langchaingo deps)
  • Created internal/agent/copilot/ subpackage (copilot_client, session_config, feature flag, config keys)
  • FeatureLlm to FeatureCopilot (alpha key string stays "llm" for existing user config compat)
  • Consent commands moved: azd mcp consent to azd copilot consent
  • WithSystemMessage AgentOption for system prompt overrides

Error Middleware Streamlining

  • Replaced 4-prompt, 5-agent-call orchestration with single consent prompt + single agent interaction in plan mode
  • Troubleshooting workflow in embedded Go text templates
  • Agent-driven flow: diagnose, explain, ask user via ask_user tool, fix or show steps
  • "Always allow" and "always skip" preference persistence via copilot.errorHandling.* config keys
  • Retry-after-fix prompt before re-running failed command

UX Rendering Fixes

  • Fix WaitForIdle hang: SessionIdle events arriving before AssistantMessage are now deferred via pendingIdle flag and flushed when the message arrives, preventing indefinite hangs
  • Fix ticker vs Pause() race: Added renderGuard sync.RWMutex so Pause() write-locks to block ticker renders and wait for any in-flight render to complete, closing the TOCTOU gap that caused spinner to render over consent prompts
  • Fix consent grant persistence: PromptAndGrantConsent no longer returns config save errors as tool denials - user approval is honored immediately, persistence errors are logged but don't block execution
  • Add consent error logging: Silent error swallowing in checkUnifiedRules and permission handler now logs failures for [consent] diagnostics

Agent Display Improvements

  • Contextual verbs: Read, Edit, Create, Search, Find, Ran, Fetched, Queried instead of generic "Ran toolname with"
  • Colored diff stats: Edit shows (+3 -1) with green/red ANSI, Create shows (+25) in green
  • Tree-style sub-detail: Shell commands show description on main line, actual command on indented sub-line; MCP tools show args as tree
  • Error display: Failed tool calls show error message on red sub-line
  • Skill/subagent display: Skills show plugin name and version, subagents show description tree and completion summary
  • Consistent whitespace: All section transitions use printSeparated for uniform blank-line spacing
  • Spinner matches completion: In-progress spinner uses same contextual verb format; description/intent preferred over raw command
  • AssistantMessage rendered in real-time by AgentDisplay (was silently cached, never printed)
  • Guard against empty/whitespace-only assistant messages and reasoning causing extra blank lines

Copilot CLI Distribution

  • On-demand download following the Bicep tool pattern (internal/agent/copilot/cli.go)
  • Downloads version-pinned CLI binary from npm registry on first agent use (CLI 1.0.2 / SDK v0.1.32)
  • Caches at ~/.azd/bin/copilot-cli-{version}[.exe], override via AZD_COPILOT_CLI_PATH
  • Implements tools.ExternalTool interface (Name, InstallUrl, CheckInstalled)
  • Thread-safe via sync.Once, progress spinner during download, 200MB decompression limit
  • Deleted fragile npm node_modules path scanning and SDK bundler (~106MB binary size reduction)

CI Pipeline Cleanup

  • Removed ghCopilot build tag, GhCopilotClientId/GhCopilotIntegrationId from ci-build.ps1 and 5 pipeline YAMLs
  • Removed langchaingo from go.mod

Testing

  • 80+ unit tests covering display helpers, consent persistence, usage metrics
  • Consent persistence tests verify approve-once to approve-always to auto-approve flow
  • Consent reload test verifies rules survive config file round-trip
  • E2E test, golangci-lint 0 issues

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds the GitHub Copilot SDK (v0.1.25) as a foundational dependency and creates new agent infrastructure that will eventually replace the existing langchaingo-based implementation. This is Phase 1 of a 3-phase migration, where all new code coexists alongside existing code without any deletions or modifications to current functionality.

Changes:

  • Adds GitHub Copilot SDK dependency and creates wrapper types (CopilotClientManager, SessionConfigBuilder) that bridge azd config to SDK types
  • Implements CopilotAgent and CopilotAgentFactory that satisfy the existing Agent interface for seamless future switchover
  • Creates event handling infrastructure (SessionEventLogger, SessionFileLogger) that maps SDK events to azd's existing UX patterns
  • Adds 8 new config options (ai.agent.*) for customizing agent behavior (model, tools, MCP servers, system message)

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
go.mod / go.sum Adds copilot-sdk/go v0.1.25 and transitive dependency jsonschema-go v0.4.2 (both marked indirect)
resources/config_options.yaml Adds 8 new config keys for Copilot SDK agent customization
pkg/llm/copilot_client.go Manager wrapping SDK client lifecycle with azd-specific error messages
pkg/llm/copilot_client_test.go Tests for client manager instantiation (2 cases)
pkg/llm/session_config.go Bridge converting azd config → SDK SessionConfig with MCP server merging
pkg/llm/session_config_test.go Tests for config reading, tool control, MCP merging (8 cases)
internal/agent/copilot_agent.go Agent implementation using SDK Session.SendAndWait, reuses existing UX patterns
internal/agent/copilot_agent_factory.go Factory creating agents with SDK client, session, hooks, event handlers
internal/agent/logging/session_event_handler.go Event handlers mapping SDK events to thought channel + file logging
internal/agent/logging/session_event_handler_test.go Tests for event handling, tool input extraction, composite handler (10 cases)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@microsoft-github-policy-service microsoft-github-policy-service bot added the no-recent-activity identity issues with no activity label Mar 6, 2026
@wbreza wbreza force-pushed the copilot-sdk-phase1 branch from d606a7b to b3ff2be Compare March 9, 2026 18:15
@microsoft-github-policy-service microsoft-github-policy-service bot removed the no-recent-activity identity issues with no activity label Mar 9, 2026
@wbreza wbreza force-pushed the copilot-sdk-phase1 branch from 0c5b9fd to f4f6bd0 Compare March 11, 2026 00:26
@wbreza wbreza marked this pull request as ready for review March 11, 2026 00:37
Copy link
Member

@spboyer spboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined Code Review — GPT-5.4 + Claude Opus 4.6

Reviewed independently by two models, findings deduplicated and combined below.

Summary

Severity Count Details
🔴 Critical/Bug 7 3x nil deref in error middleware, 1x error swallowed, 2x security (fail-open consent, blanket permissions), 1x WaitForIdle hang
🟡 Warning 5 Always-on content logging, swallowed errors, cleanup ordering, singleton lifecycle, config fallback
🔵 Suggestion 1 Ticker/canvas race condition

Cross-Model Consensus (flagged by BOTH models independently)

These are the highest-confidence findings:

  • Nil deref on agentResult.Content in error middleware (3 locations)
  • Original error silently swallowed when user declines fix
  • Blanket SDK permission approval bypasses safety boundary
  • Consent check fails open on error

Dead Config Keys (flagged by GPT-5.4)

ai.agent.mode in config_options.yaml is never read (factory hard-codes "interactive"), and ai.agent.copilot.logLevel is never read (container constructs NewCopilotClientManager(nil) which defaults to "debug"). Users can set these keys, but they have no runtime effect. Either wire them into construction or remove from config_options.yaml until they work.

@wbreza wbreza changed the title Phase 1: Add Copilot SDK foundation alongside existing langchaingo agent Migrate to Copilot SDK for azd agent implementation Mar 11, 2026
@wbreza wbreza force-pushed the copilot-sdk-phase1 branch from c961cd9 to 5e41265 Compare March 11, 2026 19:35
@vhvb1989
Copy link
Member

Heads up: orphaned ghCopilot build tag infrastructure

This PR removes the ghCopilot build-gated registration files (github_copilot_registration.go and github_copilot_registration_stub.go), but several related pieces are still in place and are now effectively dead code:

Still present (untouched by this PR):

  • cli/azd/pkg/llm/github_copilot.go — Still has //go:build ghCopilot, still imports langchaingo, still defines the full GitHubCopilotModelProvider with device code auth flow and the clientID/copilotIntegrationID ldflags. Since the registration call site is deleted, this provider is compiled but never wired into IoC.
  • cli/azd/ci-build.ps1 — Still passes -tags ghCopilot and the ldflags (-X ...clientID, -X ...copilotIntegrationID) when GitHubCopilotClientId is provided.
  • CI pipeline YAML filesbuild-cli.yml, cross-build-cli.yml, and build-and-test.yml still pass ghCopilotClientId / ghCopilotIntegrationId parameters through.

The build won't break (the old code compiles but is just unused), but it's worth tracking cleanup of these artifacts — either in this PR or as a follow-up:

  1. Remove pkg/llm/github_copilot.go (and its langchaingo dependency)
  2. Remove the ghCopilot tag logic from ci-build.ps1
  3. Clean up the ghCopilotClientId/ghCopilotIntegrationId params from CI YAML files

Just flagging since the list is fresh — totally fine if this is planned for a subsequent iteration.

Copy link
Member

@vhvb1989 vhvb1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers from my side. The migration to the Copilot SDK looks solid overall. Left a couple of comments for consideration:

  1. Silent plugin installensurePlugins() installs the Azure plugin system-wide without user awareness. Worth surfacing a message or prompting.
  2. Old config migration — Users with ai.agent.model.type set to "github-copilot" will get a hard failure. A small auto-migration or friendly error would help early adopters.
  3. Orphaned ghCopilot build taggithub_copilot.go, ci-build.ps1, and CI YAML params still reference the old build tag. Dead code that can be cleaned up in a follow-up.

None of these are blocking — just flagging for author consideration.

wbreza and others added 26 commits March 14, 2026 01:37
On azd init with agent mode, checks for previous sessions in the
current directory via client.ListSessions(). If found, prompts user
to resume a previous session or start fresh.

Resume uses client.ResumeSession() which restores full conversation
history with the same MCP servers, skills, permissions, and hooks.

Changes:
- CopilotAgentFactory: add ListSessions() and Resume() methods
- init.go: add session picker before agent creation, add
  CopilotAgentFactory to initAction struct

Spec at docs/specs/copilot-agent-ux/session-resume.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Show numbered choices in session picker (DisplayNumbers: true)
- Convert timestamps to local time (Today 3:04 PM, Yesterday, Jan 2)
- Truncate labels to ~120 chars total
- Shorter prompt: 'Previous sessions found:'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summaries from sessions can contain newlines and markdown. Use
strings.Fields() to collapse all whitespace into single spaces,
then truncate to fit within 120 chars total.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added DisplayNumbers and EnableFiltering to reasoning effort, model
selection, and session picker prompts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fetch models via ListModels() instead of hardcoding. Each option shows:
  'Claude Sonnet 4.5 (high) (1x)'
  — name, default reasoning effort, billing multiplier

Also:
- Remove trailing ':' from prompt messages (UX components add them)
- Add ListModels() to CopilotAgentFactory

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Accumulate usage from assistant.usage and session.usage_info events:
input/output tokens, cost multiplier, premium requests, API duration,
and model used.

Display at session end:
  Session usage:
  • Model:            claude-sonnet-4.5
  • Input tokens:     45.2K
  • Output tokens:    12.8K
  • Total tokens:     58.0K
  • Cost:             1.0x premium
  • Premium requests: 15
  • API duration:     2m 34s

Token counts formatted as K/M for readability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major refactor: CopilotAgent is now a self-contained agent that
encapsulates initialization, session management, display, and usage.
CopilotAgentFactory creates agents with dependencies wired via IoC.

New API:
  agent, _ := factory.Create(ctx, agent.WithMode('interactive'))
  initResult, _ := agent.Initialize(ctx)
  selected, _ := agent.SelectSession(ctx)
  result, _ := agent.SendMessage(ctx, prompt, agent.WithSessionID(...))
  // result.Content, result.SessionID, result.Usage
  agent.Stop()

New types (types.go):
  AgentResult{Content, SessionID, Usage}
  InitResult{Model, ReasoningEffort, IsFirstRun}
  UsageMetrics with Format() method
  AgentOption: WithModel, WithReasoningEffort, WithMode, WithDebug
  SendOption: WithSessionID
  InitOption: WithForcePrompt

Agent methods:
  Initialize() — config prompts (first run), plugin install, client start
  SelectSession() — UX picker for session resume
  ListSessions() — raw session listing
  SendMessage() / SendMessageWithRetry() — returns AgentResult
  Stop() — cleanup

Deleted (old langchaingo agent):
  agent.go, agent_factory.go, conversational_agent.go, prompts/

Simplified:
  init.go — ~40 lines instead of ~150
  container.go — removed old AgentFactory, ModelFactory registrations
  error.go — updated to use CopilotAgentFactory

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each prompt component (Select, Prompt) now adds a blank line after
its Ask() call. Callers manage spacing before prompts. Removed the
leading blank line from SelectSession (was doubling up with the
trailing blank from the previous prompt).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New test files:
- types_test.go: UsageMetrics.Format(), TotalTokens(), formatTokenCount,
  stripMarkdown, formatSessionTime (40 test cases)
- display_test.go: extractToolInputSummary, extractIntentFromArgs,
  toRelativePath, GetUsageMetrics accumulation

All pure functions tested without SDK mocking.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename Cost to BillingRate (per-request multiplier, not cumulative)
- Show premium requests from SDK (omit if not reported)
- Handle session.shutdown for TotalPremiumRequests
- Remove manual API call counter
- Fix all lint issues (errorlint, gosec, staticcheck, unused)
- Fix formatting, remove unused code (github_copilot_registration files,
  discoverInstalledPluginDirs)
- All CI checks pass: gofmt, golangci-lint, tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix config_options.yaml: tools.available/excluded type 'object' -> 'array'
- Add missing config docs: ai.agent.skills.directories, ai.agent.skills.disabled
- Log warning on userConfigManager.Load() failure instead of silently swallowing
- Simplify redundant MCP server unmarshaling (removed type probe, single path)

Other review items (r3, r6, r8, r9) referenced old code that was deleted
in the agent consolidation commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes:
- Fix nil pointer dereferences in error.go when agentResult is nil
- Fix session.error not unblocking WaitForIdle (signal idleCh on error)
- Fix consent check fails-open: deny on error instead of allow

Security:
- PreToolUse consent check now denies on error with logged reason
- OnPermissionRequest remains approve-all (CLI-level coarse permissions,
  fine-grained control via PreToolUse hooks)

Code quality:
- Deterministic cleanup order: changed from map to ordered slice with
  reverse teardown (session events -> file logger -> client)
- Log warning on config load failure
- Simplified MCP server unmarshaling

Config:
- Added skills.directories and skills.disabled to config_options.yaml
- Fixed tools.available/excluded type from object to array

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AgentMode:
- New AgentMode type with constants: AgentModeInteractive, AgentModeAutopilot,
  AgentModePlan. WithMode() now takes AgentMode instead of string.

Removed logging (redundant with Copilot CLI logs at ~/.copilot/logs/):
- thought_logger.go — old langchaingo callback handler
- file_logger.go — old langchaingo callback handler
- chained_handler.go — old langchaingo callback handler
- session_event_handler.go — SessionEventLogger, SessionFileLogger,
  CompositeEventHandler all unused after AgentDisplay consolidation
- session_event_handler_test.go

Kept: logging/util.go with TruncateString (used by display.go)

Net: 902 lines deleted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the user declines the agent fix, the code returned 'err' which
was nil (consent check succeeded), silently swallowing the original
command failure. Now returns originalError to preserve the error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Deleted (all replaced by Copilot CLI built-in tools):
- tools/dev/ — command executor (shell tool)
- tools/io/ — 12 file/directory tools + tests
- tools/common/ — AnnotatedTool interface, ToLangChainTools, ToolLoader
- tools/loader.go — composite tool loader
- tools/mcp/tool_adapter.go — MCP-to-langchaingo adapter
- tools/mcp/sampling_handler.go — MCP sampling handler
- tools/mcp/elicitation_handler.go — MCP elicitation handler
- tools/mcp/loader.go — MCP tool loader
- consent/consent_wrapper_tool.go — langchaingo tool wrapper

Cleaned:
- Removed WrapTool/WrapTools from ConsentManager interface and impl
- Removed common package import from consent
- Kept tools/mcp/embed.go with McpJson embed (still used by factory)

Net: 7,025 lines deleted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updated test expectations after MCP tool migration to Copilot SDK skills.
Tests now verify error_troubleshooting, provision_common_error, and
validate_azure_yaml (the 3 remaining MCP tools).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- intPtr() -> new() for pointer creation
- strings.Split -> strings.SplitSeq for range iteration
- strings.HasPrefix+TrimPrefix -> strings.CutPrefix
- floatPtr/strPtr helpers replaced with new() in tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…error middleware

- Migrate all config keys from ai.agent.* to copilot.* namespace
- Move copilot_client.go and session_config.go to internal/agent/copilot/
- Delete entire pkg/llm/ package (azure_openai, ollama, github_copilot, model_factory, manager)
- Move consent commands from azd mcp consent to azd copilot consent
- Streamline error middleware: single consent prompt + agent-driven troubleshooting
- Troubleshooting prompts in embedded Go text templates
- AgentDisplay: render AssistantMessage in real-time, red x for failed tools
- Remove Content from AgentResult, delete dead feedback package
- Adopt SDK bundler for CLI binary embedding, remove npm path scanning
- Clean up CI pipelines: remove ghCopilot build tag and ldflags
- Add WithSystemMessage AgentOption
- Add composable config key constants with ConfigRoot prefix
- Remove langchaingo dependency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add CopilotCLI managed tool (internal/agent/copilot/cli.go) following Bicep pattern
- Download platform-specific CLI from npm registry on first use
- Cache at ~/.azd/bin/copilot-cli-{version}, override via AZD_COPILOT_CLI_PATH
- Implement tools.ExternalTool interface (Name, InstallUrl, CheckInstalled)
- Integrate with CopilotClientManager (resolves CLI at Start time)
- Remove SDK bundler (zcopilot_* files, go tool bundler CI step, tool dep)
- Binary size reduced ~106MB (no longer embedded)
- Fix cspell: add agentcopilot to word list, reword comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix WaitForIdle hang when SessionIdle fires before AssistantMessage
- Fix ticker vs Pause() TOCTOU race causing spinner to render over consent prompts
- Fix consent grant errors silently denying tool execution
- Add error logging for consent rule load/save failures
- Improve tool completion display with contextual verbs and diff stats
- Add tree-style sub-detail for shell commands and MCP tool args
- Add colored diff stats (green +N / red -N) for edit/create tools
- Show plugin version in skill invocation display
- Normalize whitespace spacing via printSeparated for all section transitions
- Show error messages on tool call failures
- Add comprehensive unit tests for display helpers and consent persistence

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza force-pushed the copilot-sdk-phase1 branch from 4e90ac2 to e500827 Compare March 14, 2026 08:39
wbreza and others added 3 commits March 14, 2026 01:47
Includes copilot CLI plugin management (ListPlugins, InstallPlugin),
session time formatting, consent command migration, and azdcontext updates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On WSL with Windows filesystem mounts (/mnt/c/...), fsnotify's
filepath.Walk + inotify watch setup can hang or be extremely slow.
Moving NewWatcher() to a background goroutine prevents it from
blocking SendMessage. The watcher results are still collected at
cleanup via mutex-protected access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…E test

- Check GitHub Copilot auth status before session creation
- Prompt to sign in via copilot login (OAuth device flow) if not authenticated
- Add CopilotCLI.Login() wrapper for interactive copilot login command
- Fix spinner showing 'Running Ran tool' — use tool name for spinner, verb for completion
- Fix E2E test: add OnPermissionRequest: ApproveAll to session config
- Add ErrToolExecutionSkipped to excluded errors list in error mapping test
- Add unit tests for Login success and error cases

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1: Add Copilot SDK Foundation (new code alongside existing)

8 participants