Skip to content

Latest commit

 

History

History
462 lines (317 loc) · 9.77 KB

File metadata and controls

462 lines (317 loc) · 9.77 KB

Usage Examples

Real-world examples of using the DevOps Execution Engine.


Example 1: Fix CrashLoopBackOff Pods

Scenario

API service pods are in CrashLoopBackOff due to OOMKilled errors.

Conversation

You: The API pods keep crashing

Clawd: Let me diagnose...

[Checks pods, logs, events]

🔍 Diagnosis Complete:
- api-service pods in CrashLoopBackOff
- Container exits with code 137 (OOMKilled)
- Memory limit: 256Mi
- Peak usage: 280Mi

Root cause: Memory limit too low

Would you like me to generate a fix plan?

You: yes

Clawd: 📋 EXECUTION PLAN: plan-20260126-002

Title: Fix CrashLoopBackOff - Increase Memory Limit
Risk: MEDIUM
Time: ~5min

Steps:
1. Patch deployment: memory 256Mi → 512Mi
2. Wait for rollout (5min timeout)
3. Verify all pods running
4. Check for new OOMKilled events

Rollback: kubectl rollout undo deployment/api-service

Approve? (yes/no)

You: yes

Clawd: ✅ Executing plan-20260126-002...

Step 1/4: Patching deployment... ✓
Step 2/4: Waiting for rollout... ✓ (2m 15s)
Step 3/4: Verifying pods... ✓
Step 4/4: Checking events... ✓ (no OOMKills)

✅ Complete! All pods running, no crashes.
Logged to: memory/actions-log.jsonl

Monitoring for 5 minutes to confirm stability...

Result

  • Pods running stable
  • No more OOMKilled errors
  • Execution plan saved for future reference
  • Full audit trail maintained

Example 2: Scale for Traffic Spike

Scenario

Marketing campaign starting, need to scale up capacity.

Conversation

You: We're launching a campaign in 10 minutes. Scale the API to handle 3x traffic.

Clawd: Let me check current capacity...

Current state:
- api-service: 3 replicas
- CPU usage: 45%
- Memory usage: 60%

For 3x traffic, recommend: 9 replicas

📋 EXECUTION PLAN: plan-20260126-003

Title: Scale api-service for campaign (3 → 9 replicas)
Risk: MEDIUM
Time: ~2min

Steps:
1. Scale deployment to 9 replicas
2. Wait for new pods (2min timeout)
3. Verify all 9 ready
4. Check load balancer registration

Rollback: kubectl scale deployment api-service --replicas=3

Approve?

You: approved

Clawd: ✅ Executing...

Step 1/4: Scaling to 9 replicas... ✓
Step 2/4: Waiting for rollout... ✓ (1m 42s)
Step 3/4: Verifying 9 ready... ✓
Step 4/4: Checking LB... ✓ (all registered)

✅ Complete! 9 replicas running and serving traffic.

Cluster capacity:
- Available CPU: 67%
- Available Memory: 58%

You're ready for the campaign. I'll monitor and alert if we approach limits.

Post-Campaign

You: Campaign is over. Scale back down.

Clawd: Current state:
- 9 replicas running
- CPU usage: 25%
- Memory usage: 40%

Recommend: 3 replicas (original capacity)

📋 EXECUTION PLAN: plan-20260126-004

Title: Scale down after campaign (9 → 3 replicas)
Risk: LOW
Time: ~1min

[similar approval workflow]

✅ Scaled down to 3 replicas. Saved $180/month in compute costs.

Example 3: Incident Response (SEV1)

Scenario

Production API completely down, customers affected.

Conversation

You: SEV1 - API is completely down!

Clawd: 🚨 SEV1 Acknowledged. Starting incident response...

TRIAGE:
- Status: All API pods failing health checks
- Impact: 100% of API traffic failing
- Started: 2 minutes ago

DIAGNOSIS:
1. Checking pod status... CrashLoopBackOff
2. Checking logs... Connection refused to database
3. Checking database... Database pod is down!
4. Checking events... Database pod OOMKilled 3 minutes ago

ROOT CAUSE:
- Database pod OOMKilled (memory spike during backup)
- API can't connect, fails health checks
- Load balancer removed all API pods from pool

MITIGATION PLAN:

📋 EMERGENCY EXECUTION PLAN: plan-20260126-SEV1-001

Title: Restore database and API service
Risk: HIGH (but required for SEV1)
Time: ~3min

Steps:
1. Restart database pod
2. Wait for database ready (30s)
3. Restart API pods to re-establish connections
4. Verify health checks passing
5. Confirm traffic flowing

Rollback: Not applicable (service already down)

This is a SEV1. Execute immediately? (yes/no)

You: yes execute now

Clawd: ✅ EXECUTING EMERGENCY PLAN...

Step 1/5: Restarting database pod... ✓
Step 2/5: Waiting for database... ✓ (database ready)
Step 3/5: Restarting API pods... ✓
Step 4/5: Checking health... ✓ (all healthy)
Step 5/5: Verifying traffic... ✓ (requests flowing)

🎉 SERVICE RESTORED

Incident timeline:
- Incident start: 13:45:00 UTC
- Root cause identified: 13:47:15 UTC
- Mitigation approved: 13:47:45 UTC
- Service restored: 13:50:30 UTC
- Total downtime: 5 minutes 30 seconds

POST-INCIDENT ACTIONS RECOMMENDED:
1. Increase database memory limit
2. Adjust backup schedule to off-peak
3. Add database memory alerts
4. Review API connection retry logic

Would you like me to generate plans for these improvements?

Follow-up

You: Yes, generate the improvement plans

Clawd: Creating 4 execution plans for post-incident improvements...

✅ Created:
- plan-20260126-005: Increase database memory
- plan-20260126-006: Reschedule backups to 2 AM
- plan-20260126-007: Add CloudWatch memory alarms
- plan-20260126-008: Update API connection pool config

Review and approve each when ready. Priority order suggested: 007, 005, 006, 008

Example 4: Cost Optimization Audit

Scenario

Monthly AWS bill review, looking for savings.

Conversation

You: Analyze our AWS costs and find savings

Clawd: Running cost optimization analysis...

📊 AWS COST ANALYSIS - January 2026

Total spend: $12,450
vs. last month: +$1,200 (+10.7%)

TOP EXPENSES:
1. EC2 instances: $6,800 (54.6%)
2. RDS databases: $3,200 (25.7%)
3. EBS volumes: $1,100 (8.8%)
4. Data transfer: $850 (6.8%)
5. Other: $500 (4.0%)

OPTIMIZATION OPPORTUNITIES:

🟢 Quick Wins ($950/month):
1. 5 idle EC2 instances (stopped >14 days)
   - i-abc123 (t3.large): $150/mo
   - i-def456 (t3.large): $150/mo
   - [3 more...]
   
2. 8 unattached EBS volumes
   - vol-xyz789 (500GB): $40/mo
   - [7 more...]
   Total: $200/mo

🟡 Medium Wins ($800/month):
1. Oversized RDS instances
   - db-prod-analytics (db.r5.2xlarge → db.r5.xlarge)
   - CPU avg: 12%, recommend downsize: $400/mo

2. Old snapshots (>90 days, 15 snapshots)
   - Total: $400/mo

🔴 Long-term ($1500/month):
1. Reserved Instance opportunities
   - 6 on-demand instances eligible
   - Potential 1-year RI savings: $1500/mo

TOTAL POTENTIAL SAVINGS: $3,250/month ($39,000/year)

Would you like me to generate cleanup plans?

You: Yes, start with the quick wins

Clawd: 📋 EXECUTION PLAN: plan-20260126-009

Title: Cost Optimization - Remove Idle Resources
Risk: MEDIUM
Time: ~10min

Pre-flight checks:
1. Verify instances truly idle (>14 days stopped)
2. Check for any active EBS mounts
3. Confirm no automation dependencies

Steps:
1. Terminate 5 idle EC2 instances
2. Delete 8 unattached EBS volumes
3. Verify no errors
4. Update inventory

Estimated savings: $950/month

⚠️ WARNING: This will PERMANENTLY DELETE resources.
Instances stopped: [list]
Last activity: [timestamps]

Confirm deletion? Type "CONFIRMED" to proceed

You: show me the instance details first

Clawd: IDLE INSTANCE DETAILS:

i-abc123 (t3.large, us-east-1a)
- Launched: 2025-11-03
- Last stop: 2025-12-15 (41 days ago)
- Name: test-server-old
- Tags: env=testing, owner=dev-team
- Cost: $150.72/month

[... details for 4 more instances ...]

These appear to be abandoned test instances. Safe to terminate?

Example 5: Deployment with Canary

Scenario

Deploy new version with gradual rollout.

Conversation

You: Deploy api v2.5.0 to production using canary strategy

Clawd: Setting up canary deployment for api v2.5.0...

CURRENT STATE:
- Version: v2.4.2
- Replicas: 6
- Traffic: 100% to v2.4.2

CANARY STRATEGY:
1. Deploy 1 replica v2.5.0 (15% traffic)
2. Monitor for 10 minutes
3. If healthy: Scale to 3 replicas (50% traffic)
4. Monitor for 10 minutes
5. If healthy: Complete rollout to 6 replicas (100% traffic)

📋 EXECUTION PLAN: plan-20260126-010

Title: Canary deployment api v2.5.0
Risk: HIGH (production deployment)
Time: ~25min (includes monitoring)

[detailed steps...]

Rollback: At any stage, revert to v2.4.2

This is a multi-stage plan. Approve entire workflow?

You: approved, but pause between stages for my review

Clawd: ✅ Executing with manual gates...

STAGE 1: Deploy canary (1 replica)
Step 1/3: Create canary deployment... ✓
Step 2/3: Configure traffic split (85%/15%)... ✓
Step 3/3: Verify canary healthy... ✓

Canary monitoring (10min):
- Error rate: 0.02% (normal)
- Latency p95: 145ms (same as v2.4.2)
- No errors in logs

🟢 Canary looks good. Proceed to 50% traffic? (yes/no/rollback)

You: yes

Clawd: STAGE 2: Scale to 50% traffic
[... continues ...]

STAGE 3: Complete rollout
[... continues ...]

✅ DEPLOYMENT COMPLETE

Final state:
- Version: v2.5.0
- Replicas: 6/6 healthy
- Traffic: 100% to v2.5.0
- Error rate: 0.02% (unchanged)
- Latency: 142ms (improved!)

Total deployment time: 28 minutes
Zero-downtime deployment successful.

Common Patterns

Pattern: Check Before Act

Always diagnose first:

"Check cluster health"
"Debug the api-service pods"
"Analyze recent deployments"

Then act based on findings.

Pattern: Explain Before Approve

If unsure about a plan:

You: explain step 3
You: what are the risks?
You: show me the rollback procedure

Pattern: Test in Staging

You: Generate a plan to upgrade api-service
Clawd: [shows plan]
You: Apply this to staging first
Clawd: [executes in staging]
You: Looks good, now production

Tips

  1. Start small - Begin with read-only operations
  2. Review plans - Don't just approve blindly
  3. Ask questions - "explain step 2", "what could go wrong?"
  4. Test rollback - Verify rollback procedures work
  5. Monitor logs - Check memory/actions-log.jsonl regularly
  6. Build library - Save successful plans as templates

More examples in the examples/ directory