Skip to content

chore(dx): add error-triage Claude skill for #console-alerts#3002

Closed
baktun14 wants to merge 3 commits intomainfrom
chore/dx-error-triage-skill
Closed

chore(dx): add error-triage Claude skill for #console-alerts#3002
baktun14 wants to merge 3 commits intomainfrom
chore/dx-error-triage-skill

Conversation

@baktun14
Copy link
Copy Markdown
Contributor

@baktun14 baktun14 commented Mar 27, 2026

Why

The team needs a repeatable way to triage #console-alerts — scanning Slack, investigating errors via Sentry and Grafana, deduplicating against Linear, and filing well-structured issues. This skill automates that workflow.

What

Adds .claude/skills/error-triage/ with a comprehensive triage workflow:

  • Slack integration: Reads #console-alerts, checks reactions/threads for existing context
  • Sentry investigation: Deep-dives into Sentry issues (scope, events, tag values, AI analysis, code correlation)
  • Grafana investigation: Queries Loki logs for empty-message alerts, filters nested JSON for real errors
  • Linear deduplication: Searches existing issues before proposing new ones
  • Issue creation: Follows project conventions (team, project, labels, description format)
  • Thread replies: Posts consolidated investigation findings back to Slack with dedup logic to avoid flooding threads

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive error-triage workflow for investigating Slack alerts and Grafana/Sentry incidents, covering alert reading, grouping/prioritization, investigation steps, deduplication checks, issue-drafting templates, and required reply formats for marking alerts attended.
  • Tests
    • Added evaluation cases to validate expected triage behaviors, investigation steps, severity assessment, deduplication, and issue-proposal formatting.

Adds a skill that scans #console-alerts, investigates Sentry and Grafana
errors in depth, deduplicates against Linear, and proposes well-structured
bug issues. Includes dedup logic to avoid flooding Slack threads.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Adds a new "error-triage" Claude skill: documentation defining a Slack-driven alert triage workflow (handles empty Grafana messages via Loki/Prometheus queries, Sentry investigations, grouping/prioritization, Linear dedupe/issue drafting) plus three eval cases validating expected behaviors.

Changes

Cohort / File(s) Summary
Skill documentation
.claude/skills/error-triage/SKILL.md
New, detailed workflow spec covering triggers, reading #console-alerts messages/threads, special Grafana handling (extract service context, query Loki/Prometheus), Sentry MCP investigation steps, grouping/prioritization, redaction rules, Slack reply formats, and strict Linear issue drafting process gated by approval.
Evaluations
.claude/skills/error-triage/evals/evals.json
New eval config with 3 cases asserting end-to-end behavior: read Slack alerts/threads, inspect reactions, prioritize Sentry/Grafana 5xx, perform MCP investigations, deduplicate against Linear, and produce structured issue proposals and a final triage summary.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Slack as Slack (#console-alerts)
participant Skill as Error-Triage Skill
participant Grafana as Grafana / Loki / Prometheus
participant Sentry as Sentry MCP
participant Linear as Linear
Slack->>Skill: Deliver alert message + thread (Grafana may be empty)
Skill->>Slack: Read thread context & reactions
alt Grafana alert (empty body)
Skill->>Grafana: Extract service context, query logs/metrics
Grafana-->>Skill: Return logs/metrics
else Sentry alert
Skill->>Sentry: Query unresolved issues, events, scope tags
Sentry-->>Skill: Return events & stack traces
end
Skill->>Skill: Group/prioritize alerts, correlate traces to repo
Skill->>Linear: Search for existing issues (dedupe)
Linear-->>Skill: Return matches
Skill->>Human: Present proposed issues for approval
alt Approved
Skill->>Linear: Create issues (title/labels/description)
Linear-->>Skill: Return issue URLs
end
Skill->>Slack: Post consolidated thread replies and final triage table

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble alerts from Slack tonight,

I chase empty Grafana into light,
I follow Sentry's tangled traces,
I tidy threads and mark the places,
Carrots, code, and tickets — all aligned. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'chore(dx): add error-triage Claude skill for #console-alerts' is specific and clearly summarizes the main change—adding a new Claude skill for error triage.
Description check ✅ Passed The PR description includes both required sections (Why and What) with clear explanations of the motivation and the specific features added, aligning well with the template requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/dx-error-triage-skill

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: chore that does not touch source code files.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.claude/skills/error-triage/SKILL.md (1)

62-62: Optional wording polish for repeated sentence starts.

Line 62 has three consecutive sentences beginning with “If…”. Consider slight rewording for readability; no behavior impact.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/error-triage/SKILL.md at line 62, Line 62 repeats three
sentences that all start with "If…", hurting flow; rewrite them to vary sentence
openings while preserving meaning — e.g., keep the first as-is, change the
second to "When an alert has eyes, note who is investigating," and the third to
"Record thread replies that include a Linear issue link (e.g., `CON-XXX`) so you
can skip duplicate checks." Update the sentence phrasing in the paragraph
containing those sentences to use this or a similar variation to improve
readability without changing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/error-triage/SKILL.md:
- Around line 16-19: Update the four fenced code blocks that lack language tags
by adding an appropriate language identifier (e.g., text or bash) to each
opening backtick fence; specifically, add a language tag to the blocks beginning
with "Use ToolSearch to find: slack_read_channel", "Use ToolSearch to find
relevant Grafana tools:", "Use ToolSearch to find Sentry tools:", and the block
starting "✅ Investigated via Sentry:" so markdownlint MD040 stops failing—open
each triple-backtick fence for those blocks and append "text" (or a more
specific tag like "bash" if commands are present).

---

Nitpick comments:
In @.claude/skills/error-triage/SKILL.md:
- Line 62: Line 62 repeats three sentences that all start with "If…", hurting
flow; rewrite them to vary sentence openings while preserving meaning — e.g.,
keep the first as-is, change the second to "When an alert has eyes, note who is
investigating," and the third to "Record thread replies that include a Linear
issue link (e.g., `CON-XXX`) so you can skip duplicate checks." Update the
sentence phrasing in the paragraph containing those sentences to use this or a
similar variation to improve readability without changing behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e3895bb-a411-4b3f-ac64-a66a50534e2b

📥 Commits

Reviewing files that changed from the base of the PR and between b0ca8c7 and 8bc3e0c.

📒 Files selected for processing (2)
  • .claude/skills/error-triage/SKILL.md
  • .claude/skills/error-triage/evals/evals.json

Comment thread .claude/skills/error-triage/SKILL.md Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: chore that does not touch source code files.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.claude/skills/error-triage/SKILL.md (1)

62-62: Consider varying sentence structure for readability.

Three consecutive sentences begin with "If", which slightly impacts flow. While this is a minor style consideration, you could rephrase for variety if desired.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/error-triage/SKILL.md at line 62, Rewrite the three
consecutive sentences that begin with "If" to improve flow: keep the same facts
but vary sentence openings in the line that mentions reactions and thread
replies (the sentence containing the example token `CON-XXX`), e.g. turn the
first into a declarative statement ("A checkmark reaction indicates it has been
handled."), change the second to start with the reaction description rather than
"If" ("An eyes reaction means someone is investigating — note who."), and
rephrase the third to lead with the action ("Record any Linear issue link (e.g.,
`CON-XXX`) found in a thread reply so you can skip duplicate checks."); ensure
meaning and examples remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.claude/skills/error-triage/SKILL.md:
- Line 62: Rewrite the three consecutive sentences that begin with "If" to
improve flow: keep the same facts but vary sentence openings in the line that
mentions reactions and thread replies (the sentence containing the example token
`CON-XXX`), e.g. turn the first into a declarative statement ("A checkmark
reaction indicates it has been handled."), change the second to start with the
reaction description rather than "If" ("An eyes reaction means someone is
investigating — note who."), and rephrase the third to lead with the action
("Record any Linear issue link (e.g., `CON-XXX`) found in a thread reply so you
can skip duplicate checks."); ensure meaning and examples remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59cc8aeb-4d8c-4945-8380-dfe67b767c73

📥 Commits

Reviewing files that changed from the base of the PR and between 8bc3e0c and 856bc5e.

📒 Files selected for processing (1)
  • .claude/skills/error-triage/SKILL.md

Add language tags to fenced code blocks (markdownlint MD040) and vary
sentence structure at line 62 for readability.
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: chore that does not touch source code files.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
.claude/skills/error-triage/SKILL.md (2)

64-67: Clarify precedence between “skip addressed alerts” and “attend every alert.”

These two rules can be read as conflicting. Add one explicit rule for checkmarked alerts (e.g., “record in summary only, no new thread reply unless new evidence appears”).

Also applies to: 228-233

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/error-triage/SKILL.md around lines 64 - 67, Clarify the
precedence conflict between the "Skip:" rule (the bullet list that starts with
"Skip: - Deployment approval requests..." including the checkmark rule) and the
"attend every alert" policy by adding an explicit sentence that checkmarked
alerts are recorded in the summary only and should not receive a new thread
reply unless new evidence appears; update the same wording where the "attend
every alert" directive appears (referenced as "attend every alert") to point to
this exception so readers know the checkmark rule takes precedence for
already-addressed alerts.

99-99: Avoid hard-coding a single Loki datasource UID.

A fixed UID is brittle across Grafana environments and can break triage when datasources are recreated/renamed. Prefer “discover by name, then fall back to UID if confirmed.”

💡 Suggested doc tweak
-When querying Loki, use the `beenf7rks2e4gd` datasource UID. Filter by `service_name` label and use text filters like `|= '"level":"error"'` for nested JSON. The `detected_level` label is unreliable for error filtering.
+When querying Loki, first identify the correct Loki datasource for the environment (prefer lookup by name), then use its UID for queries. If the workspace is known to use `beenf7rks2e4gd`, use it as a fallback. Filter by `service_name` label and use text filters like `|= '"level":"error"'` for nested JSON. The `detected_level` label is unreliable for error filtering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/error-triage/SKILL.md at line 99, The doc currently
hard-codes the Loki datasource UID "beenf7rks2e4gd" which is brittle; update the
guidance and any example queries to first resolve the Grafana datasource by its
human-readable name (e.g., "Loki" or the expected datasource name) and only fall
back to using a UID when the name lookup fails and the UID is confirmed; keep
the recommendations to filter by the service_name label and use text filters
like |= '"level":"error"' for nested JSON and explicitly note that
detected_level is unreliable for error filtering so callers should prefer
name-resolution then optional UID fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.claude/skills/error-triage/SKILL.md:
- Around line 64-67: Clarify the precedence conflict between the "Skip:" rule
(the bullet list that starts with "Skip: - Deployment approval requests..."
including the checkmark rule) and the "attend every alert" policy by adding an
explicit sentence that checkmarked alerts are recorded in the summary only and
should not receive a new thread reply unless new evidence appears; update the
same wording where the "attend every alert" directive appears (referenced as
"attend every alert") to point to this exception so readers know the checkmark
rule takes precedence for already-addressed alerts.
- Line 99: The doc currently hard-codes the Loki datasource UID "beenf7rks2e4gd"
which is brittle; update the guidance and any example queries to first resolve
the Grafana datasource by its human-readable name (e.g., "Loki" or the expected
datasource name) and only fall back to using a UID when the name lookup fails
and the UID is confirmed; keep the recommendations to filter by the service_name
label and use text filters like |= '"level":"error"' for nested JSON and
explicitly note that detected_level is unreliable for error filtering so callers
should prefer name-resolution then optional UID fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc21f3a8-679e-4a78-a7bf-2df42b6335da

📥 Commits

Reviewing files that changed from the base of the PR and between 856bc5e and b46407f.

📒 Files selected for processing (1)
  • .claude/skills/error-triage/SKILL.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant