Skip to content

feat: add /diagnose skill — deep diagnostic root cause analysis#935

Open
milstan wants to merge 12 commits intogarrytan:mainfrom
milstan:milstan/diagnose-skill
Open

feat: add /diagnose skill — deep diagnostic root cause analysis#935
milstan wants to merge 12 commits intogarrytan:mainfrom
milstan:milstan/diagnose-skill

Conversation

@milstan
Copy link
Copy Markdown

@milstan milstan commented Apr 9, 2026

Summary

  • New /diagnose skill for deep evidence-based root cause analysis
  • Complements /investigate (debug-and-fix) — /diagnose proves root cause with evidence chains, no code changes
  • Anti-convergence guardrails prevent premature conclusions (5 Deadly Sins, mandatory environment verification, workflow map before hypotheses)
  • Learns and reuses workflow maps + environment knowledge across sessions via gstack learnings system
  • Mandatory output blocks: Evidence Gates, Hypothesis Table, Completeness Check
  • Adaptive turn budgets: saves turns on cached phases, spends them on exhaustive Phase 3-4 analysis
  • Suggests next gstack skills based on diagnosis outcome (plan-eng-review, plan-ceo-review, review, ship)

What /diagnose does differently from /investigate

Aspect /investigate /diagnose
Goal Find and fix the bug Prove root cause with evidence chains
Code changes Yes (Edit/Write) No (read-only)
Hypothesis testing 1-2, test most likely first 3+ mandatory, test easiest to disprove first
Evidence gates None 3 mandatory gates with printed checklists
Completeness check None Phase 4: ≥2 alternative causes investigated even at 10/10 confidence
Learnings None Saves/loads workflow maps, env quirks, system boundaries
Best for Single-system bugs with obvious fix path Multi-system issues, recurrent bugs, risky fixes needing certainty

Files

  • diagnose/SKILL.md.tmpl — template (source of truth, ~1050 lines)
  • diagnose/SKILL.md — generated for Claude host (~1790 lines)
  • test/skill-e2e-diagnose.test.ts — 2 gate-tier E2E tests
  • test/helpers/touchfiles.ts — touchfile + tier entries for diagnose

Testing

  • bun test — passes (tier 1, free)
  • bun run gen:skill-docs --host all — all 8 hosts generate cleanly
  • bun run skill:check — ✅ for diagnose (27 browse commands validated)
  • EVALS=1 bun test test/skill-e2e-diagnose.test.ts — 2/2 pass, ~$0.30
    • diagnose-discovery: verifies Phase 0 environment detection
    • diagnose-no-edit: guardrail ensuring Edit/Write tools are never used

Test plan

  • bun test passes
  • bun run gen:skill-docs --host all generates cleanly
  • bun run skill:check shows ✅ for diagnose
  • E2E: diagnose-discovery PASS
  • E2E: diagnose-no-edit PASS (Edit/Write never used)

🤖 Generated with Claude Code

Milan and others added 12 commits April 6, 2026 15:42
Read-only evidence-gathering complement to /investigate. Overcomes the
model's bias towards action by enforcing evidence gates at each phase.
Produces a diagnostic report with certainty scores — no code changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two gate-tier tests:
- diagnose-discovery: verifies Phase 0 environment detection
- diagnose-no-edit: guardrail ensuring Edit/Write tools are never used

Both pass: 2/2, $0.29 total, 84s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ns, turn budget

Three root causes for env-profile learnings never being saved:

1. Bare `gstack-learnings-log` / `gstack-learnings-search` without full path —
   binary not on PATH in all environments. Fixed: use ~/.claude/skills/gstack/bin/

2. Phase 0j used angle-bracket template placeholders (<FULL INVENTORY...>) that
   the model treated as examples rather than fill-in-the-blank instructions.
   Fixed: explicit YOUR_ACTUAL_INVENTORY_HERE with format example and rules.

3. Model burned all turns retrying Phase 0-pre learnings search (empty output
   from gstack-learnings-search was ambiguous). Fixed: use Grep tool instead
   of Bash, single call with explicit "do not retry" instruction.

Also: added Phase 0 turn budget (≤5 tool calls) and $B quoting fix (line 232).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ow map, narrative bias

Analyzed a real diagnostic session that wasted 500+ lines querying the wrong
database, skipped the workflow map, jumped between hypotheses without testing,
and declared confidence 8/10 without reproducing the issue.

New guardrails:
- "5 Deadly Sins" section: wrong database, skipping workflow map, narrative
  bias, premature confidence, sequential hypothesis testing
- Phase 0-env: mandatory environment verification before ANY database query
  (print host/db, verify it matches the reported environment)
- Phase 1f: "MANDATORY BEFORE ANY HYPOTHESIS" with explicit warning about
  the garrytan#1 failure mode (skipping the map → anchoring on first suspicious thing)
- Evidence Gate 1: now a printable checklist that must include workflow map
  completion; "print it with answers" instruction
- Anti-narrative rule in Phase 2: catch "so it must be..." reasoning
- Anti-premature-convergence rules: max confidence 7 without reproduction,
  "what ELSE could explain this?" prompt after every suspicious finding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tested on real issue #3449. Without budgets, model burned 162 tool calls
exploring code without building the workflow map or printing Evidence Gate.
With budgets: env check printed immediately, workflow map built with file
refs, 4 hypotheses with testability ratings, all in 113 tool calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sions

New learnings the skill saves after each session:
- Workflow maps (architecture type, key: workflow-FLOW_NAME) — the most
  expensive artifact to build (10-15 tool calls). Compact arrow notation
  with file:line references. Future sessions reuse instead of re-tracing.
- Environment quirks (operational type, key: env-*) — database host
  mappings, staging/prod gotchas. Prevents the wrong-database trap.

New learnings the skill consumes at start:
- Phase 0-pre now loads ALL learnings (limit 20) via gstack-learnings-search,
  not just env-profile. Explicitly looks for workflow-*, env-*, and pitfall
  learnings relevant to the current issue.
- Phase 1f checks for cached workflow maps BEFORE building from scratch.
  If a matching workflow-* learning exists, starts from it and spot-checks
  2-3 file:line refs instead of re-reading all the code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tfalls

Phase 0-pre now runs two targeted searches instead of one unfiltered:
- --type architecture (limit 15): workflow maps, system boundaries
- --type operational (limit 10): env-profile, db host mappings, env quirks

This ensures workflow maps and environment knowledge aren't crowded out
by root-cause pitfalls that go stale after fixes.

End-of-session learnings reordered by durability:
1. Workflow maps (always save — most expensive to rebuild)
2. Environment quirks (db host traps, staging/prod differences)
3. Cross-system boundary patterns
4. Environment profile updates

Removed automatic root-cause and dead-end logging — these go stale after
fixes. Only log pitfalls that represent recurring structural patterns,
not one-off bug findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Adaptive tool budgets: when cached learnings exist, Phase 0 drops to
   ≤3 calls and Phase 1 to ≤15 calls. Saved budget (~12 calls) carries
   forward to Phase 3 for deeper hypothesis testing.

2. Stale env-profile detection: after loading cached env-profile, run a
   quick smoke test (check if key env vars still exist, deps file still
   present). If the smoke test reveals mismatches (new tools appeared,
   old tools vanished), re-run full detection (0a-0g) and update the
   profile. Prevents blindly trusting a stale cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…roughness

Two changes:

1. Issue-aware learnings loading: Phase 0-pre now extracts keywords from the
   issue and uses --query to load RELEVANT learnings first, then broader.
   Prevents irrelevant workflow maps from crowding out useful ones as learnings
   accumulate over 10-20 runs/day. Added hygiene section: stable key names
   for natural dedup, >20 entries triggers prune suggestion.

2. Uncapped Phase 3-4 thoroughness: Phases 0-2 have budgets (save turns).
   Phase 3 has NO budget cap ("use as many tool calls as needed to reach
   confidence 9-10"). Phase 4 has explicit "do NOT skip" language. Budget
   preamble rewritten: "the goal is NOT speed — it's exhaustive understanding."
   Every turn saved early is explicitly redirected to deeper investigation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The skill template said "must have" but the model skipped the artifacts
anyway. Root cause: instructions were phrased as checklists to verify
mentally, not as mandatory output blocks the model must print.

Three fixes:

1. Evidence Gate 1, Hypothesis Table, and Hypothesis Results are now
   MANDATORY OUTPUT BLOCKS with "if this block does not appear in your
   output, you have violated the skill protocol" language. Each has a
   fill-in-the-blank format the model must complete.

2. Phase 4 (Exhaustive Analysis) now has its own mandatory output block:
   COMPLETENESS CHECK requires investigating ≥2 alternative explanations
   for the symptom even after confirming root cause at 10/10. The block
   must list each alternative, what was checked, and the result.

3. Diagnostic report template now includes a COMPLETENESS section that
   requires Phase 4 findings — alternative causes investigated and
   contributing factors verified. Can't write the report without doing
   the completeness work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Report now includes a NEXT STEPS section that recommends the logical
follow-up skill:
- ROOT_CAUSE + simple fix → /investigate
- ROOT_CAUSE + complex fix → plan + /plan-eng-review
- ROOT_CAUSE + scope question → plan + /plan-ceo-review
- PROBABLE_CAUSE → what data would upgrade confidence, or /qa
- INSUFFICIENT_EVIDENCE → /investigate with specific instructions
- Security implications → /cso
- Multi-system fix → /review when PR is ready

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…stigate

Replace /investigate suggestions with actionable next steps:
- Simple fix → implement it, then /review + /ship
- Complex fix → plan + /plan-eng-review
- Scope question → plan + /plan-ceo-review
- Fix PR ready → /review + /ship

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@milstan
Copy link
Copy Markdown
Author

milstan commented Apr 9, 2026

Hey @garrytan I’ve made this skill and have been using it for a while to go beyond /investigate for some of our more complex debugging (that often requires putting the bias towards actions aside in order to understand first). I thought it might be of use to other people as well, so here it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant