Ok Resilience

CLI tool for testing Gateway resilience

Overview

This tool is designed to help validate the resilience of API Gateways (such as Istio Ingress Gateway) by simulating various failure scenarios and observing how the gateway handles them. It focuses on two main types of failures.

Simulating TCP Failures (e.g., connection failures, resets)
Simulating HTTP Failures (e.g., gateway failures, upstream timeouts, retryable 5XX errors for GET requests)

Assert the gateway's behavior under these failure conditions by checking metrics from Prometheus.

Tool supports the following resilience tests:

Gateway Retries on Upstream 5xx Failures
Gateway Retries on Upstream TCP Resets
Outlier Detection and Ejection (Circuit Breaking)
Validate Gateway Per Request Timeouts
Failover Testing for Multi-Cluster Gateways

Prerequisites

Ensure the following tools are installed on your system:

Homebrew (for macOS/Linux users)
Go (version 1.18 or later)
kubectl
istioctl
kind
cloud-provider-kind

Install Prerequisites

For macOS/Linux users, you can install the tools using Homebrew:

brew install go
brew install kubectl
brew install istioctl
brew install kind
brew install cloud-provider-kind

For other platforms, refer to the official installation guides for each tool.

Setting Up the Environment

Httpbin in Kind Cluster

To set up the httpbin service in a local Kubernetes cluster using Kind, follow these steps:

Create a Kind cluster if you don't have one already:

./scripts/setup-kind.sh

Setup Istio, Kiali, and Prometheus:

# Install Istio using istioctl, then Kiali and Prometheus.
./scripts/install-istio.sh

Deploy httpbin application:

./scripts/deploy-httpbin.sh

Generate some traffic to httpbin:

export INGRESS_IP=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
for i in {1..50}; do echo -e "\nRequest ==> [$i]"; curl -s -D - -o /dev/null -H "Host: httpbin.local" http://$INGRESS_IP/get; sleep 1; done

# use generate traffic script
./scripts/generate-traffic.sh            # defaults to status endpoint
./scripts/generate-traffic.sh header     # hit /get to echo headers
./scripts/generate-traffic.sh delay      # hit /delay/5 to simulate slow upstreams

REQUEST_COUNT=10
./scripts/generate-traffic.sh status  # override request volume

Setup

Install dependencies:

go mod tidy

Build the CLI tool:

go build -o okresilience ./cmd/okresilience

Build the resilience executor:

go build -o resilience-executor ./cmd/test-runner

Run

Validate upstream 5xx failures with retries

With this test you can validate that the gateway retries requests when upstream services return 5xx errors.

./okresilience upstream5xxFailures \
    --prometheus-url=http://prometheus.local \
    --service-endpoint=http://httpbin.local/status/500 \
    --namespace=demo \
    --virtual-service=httpbin-vs \
    --num-requests=1 \
    --response-code=500 \
    --app=httpbin

Validate upstream TCP resets

With this test you can validate that the gateway retries requests when TCP resets occur in upstream services.

./okresilience upstreamTcpReset \
    --prometheus-url=http://prometheus.local \
    --service-endpoint=http://tcp-reset-service.local/ \
    --namespace=tcp-ns \
    --virtual-service=tcp-reset-vs \
    --num-requests=1 \
    --response-code=503 \
    --app=tcp-reset-service

Validate Gateway timeout behavior

Use this to verify that ingress timeouts surface as upstream responseCode=0 while clients observe HTTP 504. The upstream diff should equal num-requests * (timeout / perTryTimeout) as configured in the VirtualService.

./okresilience gatewayTimeoutVerify \
    --prometheus-url=http://prometheus.local \
    --service-endpoint=http://httpbin.local/delay/3 \
    --namespace=demo \
    --virtual-service=httpbin-vs \
    --num-requests=1 \
    --app=httpbin

Expected:

Upstream metrics with responseCode=0 increase by num-requests * (timeout / perTryTimeout)
All downstream responses are HTTP 504 (Gateway Timeout)

Resilience Executor

The resilience executor is a comprehensive tool that parses and executes resilience test cases from a JSON configuration file. It's designed to sequentially run multiple test scenarios with different assertion types, supporting automatic metric validation with retry logic.

Building the Executor

go build -o resilience-executor ./cmd/test-runner

Test File Format

The executor uses a JSON file to define test cases. Each test case includes:

name: Test case identifier
description: Human-readable description of what the test validates
type: Test category (retry-5xx, retry-tcp-errors, timeout, ok-response, outlier-detection)
prometheus_url: Prometheus server endpoint
service_endpoint: Target service URL for traffic generation
namespace: Kubernetes namespace
virtual_service: Name of the VirtualService resource (optional if using destination_rule)
destination_rule: Name of the DestinationRule resource (optional if using virtual_service)
num_requests: Number of requests to generate
expected_response_codes: Array of expected HTTP status codes (validated during execution)
app: Application/destination name for metrics queries
assertions: Array of metric assertions to validate

Each assertion includes:

source: Currently supports "metric"
metric_name: Name of the Prometheus metric (e.g., istio_requests_total, istio_tcp_connections_closed_total)
labels: Label selectors for the metric query
expected: Object with operator and value to validate against
- operator: Comparison operator (eq, gte, lte, gt, lt)
- value: Expected metric difference value

Example Test Configuration

See resilience-tests.json for complete examples. Here's a sample:

{
    "name": "upstream5xxFailures",
    "description": "Test to verify gateway retries on upstream 5xx failures",
    "prometheus_url": "http://prometheus.local",
    "service_endpoint": "http://httpbin.local/status/500",
    "namespace": "demo",
    "virtual_service": "httpbin-vs",
    "num_requests": 1,
    "expected_response_codes": [500],
    "app": "httpbin",
    "type": "retry-5xx",
    "assertions": [
        {
            "source": "metric",
            "metric_name": "istio_requests_total",
            "labels": {
                "destination_app": "httpbin",
                "response_code": "500",
                "namespace": "demo"
            },
            "expected": {
                "operator": "eq",
                "value": 4
            }
        }
    ]
}

Running Tests

Using the Shell Script (Recommended)

The script automatically builds the binary before running tests:

./scripts/resilience-executor.sh

With options:

./scripts/resilience-executor.sh --test upstream5xxFailures --delay 3 --retries 10
./scripts/resilience-executor.sh -f custom-tests.json -t outlierDetectionVerify

Using the Binary Directly

Build first:

go build -o resilience-executor ./cmd/test-runner

Then run:

./resilience-executor --test-file=resilience-tests.json

Run a Specific Test

./scripts/resilience-executor.sh --test upstream5xxFailures

Or directly with binary:

./resilience-executor --test-file=resilience-tests.json --test-name=upstream5xxFailures

Custom Retry Configuration

./scripts/resilience-executor.sh --retries 10 --delay 3

Or with binary:

./resilience-executor \
    --test-file=resilience-tests.json \
    --max-retries=10 \
    --delay-seconds=3

Test Execution Flow

For each test case, the executor:

Validates Configuration: Ensures all required fields are present
Captures Baseline Metrics: Queries Prometheus for initial metric values
Generates Traffic: Sends the specified number of requests to the service
Validates Response Codes: Verifies actual response codes match expected values
Waits for Metrics Sync: Implements retry logic (default: 5 retries with 5-second delays)
Queries Metrics After: Fetches updated metric values from Prometheus
Validates Assertions: Checks if metric differences match expected values using comparison operators
Reports Results: Displays PASSED/FAILED status with detailed assertion breakdown

Assertion Operators

eq: Exact equality (actual == expected)
gte: Greater than or equal (actual >= expected)
lte: Less than or equal (actual <= expected)
gt: Greater than (actual > expected)
lt: Less than (actual < expected)

Output Features

The executor provides:

Welcome Banner: Clear test suite identification
Test Counter: Shows [N/M] for test progress
Real-time Logging: Detailed step-by-step execution logs
Response Code Validation: Visual indicators (✓/✗) for each request
Assertion Tracking: Per-assertion pass/fail status with metric deltas
Progress Indicators: Visual feedback during metric syncing
Summary Report:
- Test and assertion pass rates
- Progress bar showing completion percentage
- Color-coded final status
- Clear success/failure banners

Example output:

╔═══════════════════════════════════════════════════════════════╗
║           OkResilience - Gateway Resilience Validator         ║
╚═══════════════════════════════════════════════════════════════╝

[1/5] Executing: upstream5xxFailures (retry-5xx)
  ⏳ Capturing baseline metrics (1 assertion(s))...
    • Assertion 0: istio_requests_total = 38
  ✓ Baseline metrics captured
  🚀 Generating 1 request(s)...
    • Request 1: ✓ 500
  ✓ All response codes matched expectations
  ⏳ Validating metrics (max 5 retries, 5 second delays)...
    • Assertion 0: ✓ (diff=4, expected eq 4)
  ✓ Metrics validation passed on attempt 1

┌─────────────────────────────────────────────────────────────────
  ✓ PASSED | upstream5xxFailures
  Assertions: 1 passed, 0 failed (total: 1)
─────────────────────────────────────────────────────────────────

╔═══════════════════════════════════════════════════════════════╗
║                    📊 EXECUTION SUMMARY                       ║
║  ✓ All tests PASSED! Gateway resilience verified successfully. ║
╚═══════════════════════════════════════════════════════════════╝

Command-Line Flags

--test-file (default: "resilience-tests.json"): Path to the test cases JSON file
--max-retries (default: 5): Maximum number of retries for metric validation
--delay-seconds (default: 5): Delay in seconds between retry attempts
--test-name (default: ""): Optional filter to run a specific test by name

Running Unit Tests

go test ./... -v

Test the resilience executor specifically:

go test ./cmd/test-runner -v

TCP Failure Simulation

Deploy the tcp-reset-service for testing TCP failures:

./scripts/deploy-tcp-reset.sh

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
cmd		cmd
pkg		pkg
resources		resources
scripts		scripts
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
resilience-tests.json		resilience-tests.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ok Resilience

Overview

Prerequisites

Install Prerequisites

Setting Up the Environment

Httpbin in Kind Cluster

Setup

Run

Validate upstream 5xx failures with retries

Validate upstream TCP resets

Validate Gateway timeout behavior

Resilience Executor

Building the Executor

Test File Format

Example Test Configuration

Running Tests

Using the Shell Script (Recommended)

Using the Binary Directly

Run a Specific Test

Custom Retry Configuration

Test Execution Flow

Assertion Operators

Output Features

Command-Line Flags

Running Unit Tests

TCP Failure Simulation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ok Resilience

Overview

Prerequisites

Install Prerequisites

Setting Up the Environment

Httpbin in Kind Cluster

Setup

Run

Validate upstream 5xx failures with retries

Validate upstream TCP resets

Validate Gateway timeout behavior

Resilience Executor

Building the Executor

Test File Format

Example Test Configuration

Running Tests

Using the Shell Script (Recommended)

Using the Binary Directly

Run a Specific Test

Custom Retry Configuration

Test Execution Flow

Assertion Operators

Output Features

Command-Line Flags

Running Unit Tests

TCP Failure Simulation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages