feat: add /diagnose skill — deep diagnostic root cause analysis by milstan · Pull Request #935 · garrytan/gstack

milstan · 2026-04-09T05:03:03Z

Summary

New /diagnose skill for deep evidence-based root cause analysis
Complements /investigate (debug-and-fix) — /diagnose proves root cause with evidence chains, no code changes
Anti-convergence guardrails prevent premature conclusions (5 Deadly Sins, mandatory environment verification, workflow map before hypotheses)
Learns and reuses workflow maps + environment knowledge across sessions via gstack learnings system
Mandatory output blocks: Evidence Gates, Hypothesis Table, Completeness Check
Adaptive turn budgets: saves turns on cached phases, spends them on exhaustive Phase 3-4 analysis
Suggests next gstack skills based on diagnosis outcome (plan-eng-review, plan-ceo-review, review, ship)

What /diagnose does differently from /investigate

Aspect	/investigate	/diagnose
Goal	Find and fix the bug	Prove root cause with evidence chains
Code changes	Yes (Edit/Write)	No (read-only)
Hypothesis testing	1-2, test most likely first	3+ mandatory, test easiest to disprove first
Evidence gates	None	3 mandatory gates with printed checklists
Completeness check	None	Phase 4: ≥2 alternative causes investigated even at 10/10 confidence
Learnings	None	Saves/loads workflow maps, env quirks, system boundaries
Best for	Single-system bugs with obvious fix path	Multi-system issues, recurrent bugs, risky fixes needing certainty

Files

diagnose/SKILL.md.tmpl — template (source of truth, ~1050 lines)
diagnose/SKILL.md — generated for Claude host (~1790 lines)
test/skill-e2e-diagnose.test.ts — 2 gate-tier E2E tests
test/helpers/touchfiles.ts — touchfile + tier entries for diagnose

Testing

bun test — passes (tier 1, free)
bun run gen:skill-docs --host all — all 8 hosts generate cleanly
bun run skill:check — ✅ for diagnose (27 browse commands validated)
EVALS=1 bun test test/skill-e2e-diagnose.test.ts — 2/2 pass, ~$0.30
- diagnose-discovery: verifies Phase 0 environment detection
- diagnose-no-edit: guardrail ensuring Edit/Write tools are never used

Test plan

bun test passes
bun run gen:skill-docs --host all generates cleanly
bun run skill:check shows ✅ for diagnose
E2E: diagnose-discovery PASS
E2E: diagnose-no-edit PASS (Edit/Write never used)

🤖 Generated with Claude Code

Read-only evidence-gathering complement to /investigate. Overcomes the model's bias towards action by enforcing evidence gates at each phase. Produces a diagnostic report with certainty scores — no code changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two gate-tier tests: - diagnose-discovery: verifies Phase 0 environment detection - diagnose-no-edit: guardrail ensuring Edit/Write tools are never used Both pass: 2/2, $0.29 total, 84s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ns, turn budget Three root causes for env-profile learnings never being saved: 1. Bare `gstack-learnings-log` / `gstack-learnings-search` without full path — binary not on PATH in all environments. Fixed: use ~/.claude/skills/gstack/bin/ 2. Phase 0j used angle-bracket template placeholders (<FULL INVENTORY...>) that the model treated as examples rather than fill-in-the-blank instructions. Fixed: explicit YOUR_ACTUAL_INVENTORY_HERE with format example and rules. 3. Model burned all turns retrying Phase 0-pre learnings search (empty output from gstack-learnings-search was ambiguous). Fixed: use Grep tool instead of Bash, single call with explicit "do not retry" instruction. Also: added Phase 0 turn budget (≤5 tool calls) and $B quoting fix (line 232). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ow map, narrative bias Analyzed a real diagnostic session that wasted 500+ lines querying the wrong database, skipped the workflow map, jumped between hypotheses without testing, and declared confidence 8/10 without reproducing the issue. New guardrails: - "5 Deadly Sins" section: wrong database, skipping workflow map, narrative bias, premature confidence, sequential hypothesis testing - Phase 0-env: mandatory environment verification before ANY database query (print host/db, verify it matches the reported environment) - Phase 1f: "MANDATORY BEFORE ANY HYPOTHESIS" with explicit warning about the garrytan#1 failure mode (skipping the map → anchoring on first suspicious thing) - Evidence Gate 1: now a printable checklist that must include workflow map completion; "print it with answers" instruction - Anti-narrative rule in Phase 2: catch "so it must be..." reasoning - Anti-premature-convergence rules: max confidence 7 without reproduction, "what ELSE could explain this?" prompt after every suspicious finding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tested on real issue #3449. Without budgets, model burned 162 tool calls exploring code without building the workflow map or printing Evidence Gate. With budgets: env check printed immediately, workflow map built with file refs, 4 hypotheses with testability ratings, all in 113 tool calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sions New learnings the skill saves after each session: - Workflow maps (architecture type, key: workflow-FLOW_NAME) — the most expensive artifact to build (10-15 tool calls). Compact arrow notation with file:line references. Future sessions reuse instead of re-tracing. - Environment quirks (operational type, key: env-*) — database host mappings, staging/prod gotchas. Prevents the wrong-database trap. New learnings the skill consumes at start: - Phase 0-pre now loads ALL learnings (limit 20) via gstack-learnings-search, not just env-profile. Explicitly looks for workflow-*, env-*, and pitfall learnings relevant to the current issue. - Phase 1f checks for cached workflow maps BEFORE building from scratch. If a matching workflow-* learning exists, starts from it and spot-checks 2-3 file:line refs instead of re-reading all the code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tfalls Phase 0-pre now runs two targeted searches instead of one unfiltered: - --type architecture (limit 15): workflow maps, system boundaries - --type operational (limit 10): env-profile, db host mappings, env quirks This ensures workflow maps and environment knowledge aren't crowded out by root-cause pitfalls that go stale after fixes. End-of-session learnings reordered by durability: 1. Workflow maps (always save — most expensive to rebuild) 2. Environment quirks (db host traps, staging/prod differences) 3. Cross-system boundary patterns 4. Environment profile updates Removed automatic root-cause and dead-end logging — these go stale after fixes. Only log pitfalls that represent recurring structural patterns, not one-off bug findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. Adaptive tool budgets: when cached learnings exist, Phase 0 drops to ≤3 calls and Phase 1 to ≤15 calls. Saved budget (~12 calls) carries forward to Phase 3 for deeper hypothesis testing. 2. Stale env-profile detection: after loading cached env-profile, run a quick smoke test (check if key env vars still exist, deps file still present). If the smoke test reveals mismatches (new tools appeared, old tools vanished), re-run full detection (0a-0g) and update the profile. Prevents blindly trusting a stale cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…roughness Two changes: 1. Issue-aware learnings loading: Phase 0-pre now extracts keywords from the issue and uses --query to load RELEVANT learnings first, then broader. Prevents irrelevant workflow maps from crowding out useful ones as learnings accumulate over 10-20 runs/day. Added hygiene section: stable key names for natural dedup, >20 entries triggers prune suggestion. 2. Uncapped Phase 3-4 thoroughness: Phases 0-2 have budgets (save turns). Phase 3 has NO budget cap ("use as many tool calls as needed to reach confidence 9-10"). Phase 4 has explicit "do NOT skip" language. Budget preamble rewritten: "the goal is NOT speed — it's exhaustive understanding." Every turn saved early is explicitly redirected to deeper investigation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The skill template said "must have" but the model skipped the artifacts anyway. Root cause: instructions were phrased as checklists to verify mentally, not as mandatory output blocks the model must print. Three fixes: 1. Evidence Gate 1, Hypothesis Table, and Hypothesis Results are now MANDATORY OUTPUT BLOCKS with "if this block does not appear in your output, you have violated the skill protocol" language. Each has a fill-in-the-blank format the model must complete. 2. Phase 4 (Exhaustive Analysis) now has its own mandatory output block: COMPLETENESS CHECK requires investigating ≥2 alternative explanations for the symptom even after confirming root cause at 10/10. The block must list each alternative, what was checked, and the result. 3. Diagnostic report template now includes a COMPLETENESS section that requires Phase 4 findings — alternative causes investigated and contributing factors verified. Can't write the report without doing the completeness work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Report now includes a NEXT STEPS section that recommends the logical follow-up skill: - ROOT_CAUSE + simple fix → /investigate - ROOT_CAUSE + complex fix → plan + /plan-eng-review - ROOT_CAUSE + scope question → plan + /plan-ceo-review - PROBABLE_CAUSE → what data would upgrade confidence, or /qa - INSUFFICIENT_EVIDENCE → /investigate with specific instructions - Security implications → /cso - Multi-system fix → /review when PR is ready Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…stigate Replace /investigate suggestions with actionable next steps: - Simple fix → implement it, then /review + /ship - Complex fix → plan + /plan-eng-review - Scope question → plan + /plan-ceo-review - Fix PR ready → /review + /ship Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

milstan · 2026-04-09T05:22:27Z

Hey @garrytan I’ve made this skill and have been using it for a while to go beyond /investigate for some of our more complex debugging (that often requires putting the bias towards actions aside in order to understand first). I thought it might be of use to other people as well, so here it is.

Milan and others added 12 commits April 6, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add /diagnose skill — deep diagnostic root cause analysis#935

feat: add /diagnose skill — deep diagnostic root cause analysis#935
milstan wants to merge 12 commits intogarrytan:mainfrom
milstan:milstan/diagnose-skill

milstan commented Apr 9, 2026

Uh oh!

milstan commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

milstan commented Apr 9, 2026

Summary

What /diagnose does differently from /investigate

Files

Testing

Test plan

Uh oh!

milstan commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant