Summary
Add chaos engineering capabilities to test the robustness and fault tolerance of the distributed dataset conversion system. This will help ensure the system can handle real-world failures gracefully.
Problem
Distributed systems encounter various failure modes in production:
- Worker pod crashes (OOM, node failure, spot termination)
- Network partitions between workers and TiKV
- Slow workers (stragglers) affecting overall throughput
- Storage failures (S3/OSS connectivity issues)
- Concurrent merge lock contention
Currently, we have no automated way to verify the system's resilience to these failures.
Solution
Implement a chaos engineering framework with two approaches:
Phase 1: Unit-Level Fault Injection
Add fault injection module for testing specific failure scenarios:
crates/roboflow-distributed/src/chaos/
├── mod.rs # Chaos orchestration
├── fault_injector.rs # Configurable fault injection
├── chaos_config.rs # Chaos test configuration
└── scenarios.rs # Predefined failure scenarios
Key scenarios to test:
| Scenario |
What it tests |
Injection point |
| Worker crash mid-job |
Checkpoint recovery |
During convert() |
| Network partition |
TiKV reconnection, queue rebuild |
Before TiKV operations |
| Slow worker (straggler) |
Zombie reaping, heartbeat timeout |
Add delays to heartbeat |
| Merge lock race |
Optimistic locking correctness |
During try_claim_merge() |
| Storage failure |
Retry logic, graceful degradation |
During S3/OSS operations |
Example API:
pub struct FaultInjector {
config: ChaosConfig,
}
impl FaultInjector {
pub fn maybe_fail(&self, operation: &str) -> Result<(), ChaosError> {
if self.config.should_fail(operation) {
return Err(ChaosError::InjectedFailure(operation));
}
Ok(())
}
pub async fn maybe_delay(&self, operation: &str) {
if let Some(delay) = self.config.delay_for(operation) {
tokio::time::sleep(delay).await;
}
}
}
Phase 2: Chaos Mesh Integration
Add Chaos Mesh manifests for E2E chaos testing in Kubernetes:
Example experiments:
- Pod kill (random worker termination)
- Network delay (simulate slow TiKV)
- Network partition (isolate workers from coordination)
- Memory stress (test OOM handling)
Pros:
- Kubernetes-native, works with existing Helm deployment
- No code changes required for basic tests
- Rich fault types (network, IO, stress)
- Dashboard for experiment management
Tasks
1. Create fault injection module
2. Add chaos test cases
3. Create Chaos Mesh manifests
4. CI integration
Files to Create
crates/roboflow-distributed/src/chaos/mod.rs
crates/roboflow-distributed/src/chaos/fault_injector.rs
crates/roboflow-distributed/src/chaos/chaos_config.rs
crates/roboflow-distributed/src/chaos/scenarios.rs
tests/chaos_tests.rs
deploy/chaos-mesh/ (manifests)
Files to Modify
crates/roboflow-distributed/src/lib.rs
crates/roboflow-distributed/Cargo.toml
Cargo.toml (add chaos test feature)
Feature Flag
[features]
default = []
chaos-testing = [] # Enables fault injection in tests
Acceptance Criteria
Related
References
Summary
Add chaos engineering capabilities to test the robustness and fault tolerance of the distributed dataset conversion system. This will help ensure the system can handle real-world failures gracefully.
Problem
Distributed systems encounter various failure modes in production:
Currently, we have no automated way to verify the system's resilience to these failures.
Solution
Implement a chaos engineering framework with two approaches:
Phase 1: Unit-Level Fault Injection
Add fault injection module for testing specific failure scenarios:
Key scenarios to test:
convert()try_claim_merge()Example API:
Phase 2: Chaos Mesh Integration
Add Chaos Mesh manifests for E2E chaos testing in Kubernetes:
Example experiments:
Pros:
Tasks
1. Create fault injection module
crates/roboflow-distributed/src/chaos/moduleFaultInjectorwith configurable failure ratesChaosConfigfor test scenarios2. Add chaos test cases
3. Create Chaos Mesh manifests
4. CI integration
Files to Create
crates/roboflow-distributed/src/chaos/mod.rscrates/roboflow-distributed/src/chaos/fault_injector.rscrates/roboflow-distributed/src/chaos/chaos_config.rscrates/roboflow-distributed/src/chaos/scenarios.rstests/chaos_tests.rsdeploy/chaos-mesh/(manifests)Files to Modify
crates/roboflow-distributed/src/lib.rscrates/roboflow-distributed/Cargo.tomlCargo.toml(add chaos test feature)Feature Flag
Acceptance Criteria
Related
References