Real-world examples of using the DevOps Execution Engine.
API service pods are in CrashLoopBackOff due to OOMKilled errors.
You: The API pods keep crashing
Clawd: Let me diagnose...
[Checks pods, logs, events]
🔍 Diagnosis Complete:
- api-service pods in CrashLoopBackOff
- Container exits with code 137 (OOMKilled)
- Memory limit: 256Mi
- Peak usage: 280Mi
Root cause: Memory limit too low
Would you like me to generate a fix plan?
You: yes
Clawd: 📋 EXECUTION PLAN: plan-20260126-002
Title: Fix CrashLoopBackOff - Increase Memory Limit
Risk: MEDIUM
Time: ~5min
Steps:
1. Patch deployment: memory 256Mi → 512Mi
2. Wait for rollout (5min timeout)
3. Verify all pods running
4. Check for new OOMKilled events
Rollback: kubectl rollout undo deployment/api-service
Approve? (yes/no)
You: yes
Clawd: ✅ Executing plan-20260126-002...
Step 1/4: Patching deployment... ✓
Step 2/4: Waiting for rollout... ✓ (2m 15s)
Step 3/4: Verifying pods... ✓
Step 4/4: Checking events... ✓ (no OOMKills)
✅ Complete! All pods running, no crashes.
Logged to: memory/actions-log.jsonl
Monitoring for 5 minutes to confirm stability...
- Pods running stable
- No more OOMKilled errors
- Execution plan saved for future reference
- Full audit trail maintained
Marketing campaign starting, need to scale up capacity.
You: We're launching a campaign in 10 minutes. Scale the API to handle 3x traffic.
Clawd: Let me check current capacity...
Current state:
- api-service: 3 replicas
- CPU usage: 45%
- Memory usage: 60%
For 3x traffic, recommend: 9 replicas
📋 EXECUTION PLAN: plan-20260126-003
Title: Scale api-service for campaign (3 → 9 replicas)
Risk: MEDIUM
Time: ~2min
Steps:
1. Scale deployment to 9 replicas
2. Wait for new pods (2min timeout)
3. Verify all 9 ready
4. Check load balancer registration
Rollback: kubectl scale deployment api-service --replicas=3
Approve?
You: approved
Clawd: ✅ Executing...
Step 1/4: Scaling to 9 replicas... ✓
Step 2/4: Waiting for rollout... ✓ (1m 42s)
Step 3/4: Verifying 9 ready... ✓
Step 4/4: Checking LB... ✓ (all registered)
✅ Complete! 9 replicas running and serving traffic.
Cluster capacity:
- Available CPU: 67%
- Available Memory: 58%
You're ready for the campaign. I'll monitor and alert if we approach limits.
You: Campaign is over. Scale back down.
Clawd: Current state:
- 9 replicas running
- CPU usage: 25%
- Memory usage: 40%
Recommend: 3 replicas (original capacity)
📋 EXECUTION PLAN: plan-20260126-004
Title: Scale down after campaign (9 → 3 replicas)
Risk: LOW
Time: ~1min
[similar approval workflow]
✅ Scaled down to 3 replicas. Saved $180/month in compute costs.
Production API completely down, customers affected.
You: SEV1 - API is completely down!
Clawd: 🚨 SEV1 Acknowledged. Starting incident response...
TRIAGE:
- Status: All API pods failing health checks
- Impact: 100% of API traffic failing
- Started: 2 minutes ago
DIAGNOSIS:
1. Checking pod status... CrashLoopBackOff
2. Checking logs... Connection refused to database
3. Checking database... Database pod is down!
4. Checking events... Database pod OOMKilled 3 minutes ago
ROOT CAUSE:
- Database pod OOMKilled (memory spike during backup)
- API can't connect, fails health checks
- Load balancer removed all API pods from pool
MITIGATION PLAN:
📋 EMERGENCY EXECUTION PLAN: plan-20260126-SEV1-001
Title: Restore database and API service
Risk: HIGH (but required for SEV1)
Time: ~3min
Steps:
1. Restart database pod
2. Wait for database ready (30s)
3. Restart API pods to re-establish connections
4. Verify health checks passing
5. Confirm traffic flowing
Rollback: Not applicable (service already down)
This is a SEV1. Execute immediately? (yes/no)
You: yes execute now
Clawd: ✅ EXECUTING EMERGENCY PLAN...
Step 1/5: Restarting database pod... ✓
Step 2/5: Waiting for database... ✓ (database ready)
Step 3/5: Restarting API pods... ✓
Step 4/5: Checking health... ✓ (all healthy)
Step 5/5: Verifying traffic... ✓ (requests flowing)
🎉 SERVICE RESTORED
Incident timeline:
- Incident start: 13:45:00 UTC
- Root cause identified: 13:47:15 UTC
- Mitigation approved: 13:47:45 UTC
- Service restored: 13:50:30 UTC
- Total downtime: 5 minutes 30 seconds
POST-INCIDENT ACTIONS RECOMMENDED:
1. Increase database memory limit
2. Adjust backup schedule to off-peak
3. Add database memory alerts
4. Review API connection retry logic
Would you like me to generate plans for these improvements?
You: Yes, generate the improvement plans
Clawd: Creating 4 execution plans for post-incident improvements...
✅ Created:
- plan-20260126-005: Increase database memory
- plan-20260126-006: Reschedule backups to 2 AM
- plan-20260126-007: Add CloudWatch memory alarms
- plan-20260126-008: Update API connection pool config
Review and approve each when ready. Priority order suggested: 007, 005, 006, 008
Monthly AWS bill review, looking for savings.
You: Analyze our AWS costs and find savings
Clawd: Running cost optimization analysis...
📊 AWS COST ANALYSIS - January 2026
Total spend: $12,450
vs. last month: +$1,200 (+10.7%)
TOP EXPENSES:
1. EC2 instances: $6,800 (54.6%)
2. RDS databases: $3,200 (25.7%)
3. EBS volumes: $1,100 (8.8%)
4. Data transfer: $850 (6.8%)
5. Other: $500 (4.0%)
OPTIMIZATION OPPORTUNITIES:
🟢 Quick Wins ($950/month):
1. 5 idle EC2 instances (stopped >14 days)
- i-abc123 (t3.large): $150/mo
- i-def456 (t3.large): $150/mo
- [3 more...]
2. 8 unattached EBS volumes
- vol-xyz789 (500GB): $40/mo
- [7 more...]
Total: $200/mo
🟡 Medium Wins ($800/month):
1. Oversized RDS instances
- db-prod-analytics (db.r5.2xlarge → db.r5.xlarge)
- CPU avg: 12%, recommend downsize: $400/mo
2. Old snapshots (>90 days, 15 snapshots)
- Total: $400/mo
🔴 Long-term ($1500/month):
1. Reserved Instance opportunities
- 6 on-demand instances eligible
- Potential 1-year RI savings: $1500/mo
TOTAL POTENTIAL SAVINGS: $3,250/month ($39,000/year)
Would you like me to generate cleanup plans?
You: Yes, start with the quick wins
Clawd: 📋 EXECUTION PLAN: plan-20260126-009
Title: Cost Optimization - Remove Idle Resources
Risk: MEDIUM
Time: ~10min
Pre-flight checks:
1. Verify instances truly idle (>14 days stopped)
2. Check for any active EBS mounts
3. Confirm no automation dependencies
Steps:
1. Terminate 5 idle EC2 instances
2. Delete 8 unattached EBS volumes
3. Verify no errors
4. Update inventory
Estimated savings: $950/month
⚠️ WARNING: This will PERMANENTLY DELETE resources.
Instances stopped: [list]
Last activity: [timestamps]
Confirm deletion? Type "CONFIRMED" to proceed
You: show me the instance details first
Clawd: IDLE INSTANCE DETAILS:
i-abc123 (t3.large, us-east-1a)
- Launched: 2025-11-03
- Last stop: 2025-12-15 (41 days ago)
- Name: test-server-old
- Tags: env=testing, owner=dev-team
- Cost: $150.72/month
[... details for 4 more instances ...]
These appear to be abandoned test instances. Safe to terminate?
Deploy new version with gradual rollout.
You: Deploy api v2.5.0 to production using canary strategy
Clawd: Setting up canary deployment for api v2.5.0...
CURRENT STATE:
- Version: v2.4.2
- Replicas: 6
- Traffic: 100% to v2.4.2
CANARY STRATEGY:
1. Deploy 1 replica v2.5.0 (15% traffic)
2. Monitor for 10 minutes
3. If healthy: Scale to 3 replicas (50% traffic)
4. Monitor for 10 minutes
5. If healthy: Complete rollout to 6 replicas (100% traffic)
📋 EXECUTION PLAN: plan-20260126-010
Title: Canary deployment api v2.5.0
Risk: HIGH (production deployment)
Time: ~25min (includes monitoring)
[detailed steps...]
Rollback: At any stage, revert to v2.4.2
This is a multi-stage plan. Approve entire workflow?
You: approved, but pause between stages for my review
Clawd: ✅ Executing with manual gates...
STAGE 1: Deploy canary (1 replica)
Step 1/3: Create canary deployment... ✓
Step 2/3: Configure traffic split (85%/15%)... ✓
Step 3/3: Verify canary healthy... ✓
Canary monitoring (10min):
- Error rate: 0.02% (normal)
- Latency p95: 145ms (same as v2.4.2)
- No errors in logs
🟢 Canary looks good. Proceed to 50% traffic? (yes/no/rollback)
You: yes
Clawd: STAGE 2: Scale to 50% traffic
[... continues ...]
STAGE 3: Complete rollout
[... continues ...]
✅ DEPLOYMENT COMPLETE
Final state:
- Version: v2.5.0
- Replicas: 6/6 healthy
- Traffic: 100% to v2.5.0
- Error rate: 0.02% (unchanged)
- Latency: 142ms (improved!)
Total deployment time: 28 minutes
Zero-downtime deployment successful.
Always diagnose first:
"Check cluster health"
"Debug the api-service pods"
"Analyze recent deployments"
Then act based on findings.
If unsure about a plan:
You: explain step 3
You: what are the risks?
You: show me the rollback procedure
You: Generate a plan to upgrade api-service
Clawd: [shows plan]
You: Apply this to staging first
Clawd: [executes in staging]
You: Looks good, now production
- Start small - Begin with read-only operations
- Review plans - Don't just approve blindly
- Ask questions - "explain step 2", "what could go wrong?"
- Test rollback - Verify rollback procedures work
- Monitor logs - Check memory/actions-log.jsonl regularly
- Build library - Save successful plans as templates
More examples in the examples/ directory