A production-ready High Availability RKE2 Kubernetes cluster deployed on AWS using Terraform Infrastructure-as-Code. This project implements industry best practices for deploying fault-tolerant Kubernetes clusters with multi-AZ distribution, automatic failover, and secure networking.
INTERNET
|
+----------+----------+
| Internet Gateway |
+----------+----------+
|
+---------------------------+---------------------------+
| | |
+-------+-------+ +-------+-------+ +-------+-------+
| Public Subnet | | Public Subnet | | Public Subnet |
| 10.0.1.0/24 | | 10.0.2.0/24 | | 10.0.3.0/24 |
| AZ-a | | AZ-b | | AZ-c |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
+-------+-------+ +-------+-------+ +-------+-------+
| Control | | Control | | Control |
| Plane-1 | | Plane-2 | | Plane-3 |
| (etcd) | | (etcd) | | (etcd) |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
+-----------+---------------+-----------+---------------+
| |
+----------+----------+ +----------+----------+
| Network Load | | Cross-AZ etcd |
| Balancer (NLB) | | Replication |
| :6443, :9345 | +---------------------+
+----------+----------+
|
+-------------------+-------------------+
| | |
+---+---+ +---+---+ +---+---+
|Worker | |Worker | |Worker |
|Node-1 | |Node-2 | |Node-3 |
+-------+ +-------+ +-------+
- High Availability: 3 Control Plane nodes with embedded etcd for quorum-based consensus
- Multi-AZ Deployment: Nodes distributed across 3 Availability Zones for fault tolerance
- Network Load Balancer: AWS NLB for API server high availability and automatic failover
- Cilium CNI: eBPF-based container networking for high performance and advanced features
- Security Hardened: Encrypted EBS volumes, restrictive security groups, tainted control plane nodes
- Production Ready: Proper health checks, retry logic, and graceful cluster initialization
- Modular Design: Clean Terraform module separation for maintainability
- Terraform >= 1.5.0
- AWS CLI configured with appropriate credentials
- SSH key pair for EC2 access
- AWS account with permissions for VPC, EC2, ELB, and IAM
# Clone the repository
git clone https://github.com/deviant101/ha-rke2-kubernetes-cluster.git
cd ha-rke2-kubernetes-cluster/terraform
# Copy and configure variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings
# Initialize Terraform
terraform init
# Preview changes
terraform plan
# Deploy the cluster
terraform apply# Get kubeconfig (after deployment)
$(terraform output -raw kubeconfig_command)
# Verify cluster
export KUBECONFIG=./kubeconfig.yaml
kubectl get nodes
kubectl get pods -A| Variable | Default | Description |
|---|---|---|
aws_region |
us-east-1 |
AWS region for deployment |
cluster_name |
rke2-ha-cluster |
Name of the Kubernetes cluster |
control_plane_count |
3 |
Number of control plane nodes |
worker_count |
3 |
Number of worker nodes |
control_plane_instance_type |
t3.medium |
EC2 instance type for control plane |
worker_instance_type |
t3.medium |
EC2 instance type for workers |
rke2_version |
v1.34.6+rke2r1 |
RKE2 version to install |
vpc_cidr |
10.0.0.0/16 |
VPC CIDR block |
pod_cidr |
10.42.0.0/16 |
Kubernetes Pod CIDR |
service_cidr |
10.43.0.0/16 |
Kubernetes Service CIDR |
See terraform/variables.tf for all configuration options.
| Document | Description |
|---|---|
| Architecture Guide | Detailed AWS infrastructure architecture and design decisions |
| HA RKE2 Guide | High Availability concepts and RKE2 specifics |
| Deployment Guide | Step-by-step deployment instructions |
| Flow Diagrams | Cluster initialization and join process flows |
| Troubleshooting | Common issues and solutions |
terraform/
├── main.tf # Root orchestration and provider config
├── variables.tf # Input variable definitions
├── outputs.tf # Output definitions
├── terraform.tfvars.example
└── modules/
├── vpc/ # VPC, subnets, IGW, route tables
├── security-groups/ # Security group rules for CP and workers
├── nlb/ # Network Load Balancer configuration
├── control-plane/ # Control plane EC2 instances
└── workers/ # Worker node EC2 instances
- 3 etcd members using RKE2's embedded etcd
- Raft consensus for leader election
- Survives single node failure (2/3 quorum maintained)
- Cross-AZ replication for datacenter fault tolerance
- Network Load Balancer distributes API traffic across all control plane nodes
- Cross-zone load balancing enabled for even distribution
- Health checks ensure traffic only routes to healthy nodes
| Scenario | Impact | Recovery |
|---|---|---|
| 1 Control Plane failure | Cluster fully operational | Automatic (quorum maintained) |
| 2 Control Plane failures | API read-only mode | Manual intervention required |
| 1 Worker failure | Workloads rescheduled | Automatic (Kubernetes handles) |
| 1 AZ failure | 2 nodes continue operating | Automatic (cross-AZ design) |
- EBS volumes encrypted at rest
- Security groups restrict traffic to VPC CIDR
- RKE2 token marked as sensitive in Terraform
- Control plane nodes tainted to prevent workload scheduling
- Separate security groups for control plane and workers
- Restrict SSH access to specific IP ranges (not
0.0.0.0/0) - Consider VPN or bastion host for API access
- Enable AWS CloudTrail for audit logging
- Implement IAM roles for AWS API access from nodes
- Use private subnets with NAT Gateway
After deployment, Terraform provides:
# Cluster endpoints
terraform output kubernetes_api_endpoint
terraform output rke2_registration_endpoint
# Node IPs
terraform output control_plane_public_ips
terraform output worker_public_ips
# SSH commands
terraform output ssh_control_plane_commands
# Kubeconfig retrieval
terraform output kubeconfig_command# Destroy all resources
terraform destroyContributions are welcome! Please read our contributing guidelines and submit pull requests for any enhancements.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This is a reference implementation. Always review and adapt security configurations for your specific production requirements.