Skip to content

Latest commit

Β 

History

History
327 lines (260 loc) Β· 17.7 KB

File metadata and controls

327 lines (260 loc) Β· 17.7 KB

Architecture Overview

Charon is a production-ready Kubernetes infrastructure platform designed around category-based namespace isolation, secure VPN mesh networking, and centralized identity management.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     External Access Layer                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚  Internet ──▢ LoadBalancer ──▢ nginx-ingress ──▢ cert-manager      β”‚
β”‚               (Linode LKE)      (external)        (Let's Encrypt)   β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    VPN Coordination Layer                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚  Headscale (core namespace)                                         β”‚
β”‚  β”œβ”€ Control server for Tailscale mesh VPN                          β”‚
β”‚  β”œβ”€ User and pre-auth key management                               β”‚
β”‚  β”œβ”€ 100.64.0.0/10 address allocation                               β”‚
β”‚  └─ External HTTPS endpoint for enrollment                         β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Kubernetes Cluster (LKE)                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  core Namespace                                             β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  β–ͺ Headscale (VPN coordination)                             β”‚   β”‚
β”‚  β”‚  β–ͺ FreeIPA (LDAP/Kerberos identity management)             β”‚   β”‚
β”‚  β”‚  β–ͺ Container Registry (future)                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  monitoring Namespace                                       β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  β–ͺ Prometheus (metrics, 35+ targets, 15d retention)         β”‚   β”‚
β”‚  β”‚  β–ͺ Thanos (long-term storage, optional)                     β”‚   β”‚
β”‚  β”‚  β–ͺ Grafana (dashboards, git-sync, Tempo correlations)      β”‚   β”‚
β”‚  β”‚  β–ͺ Loki (log aggregation, emptyDir)                        β”‚   β”‚
  β”‚  β–ͺ Tempo (distributed tracing, OTLP/gRPC)                  β”‚   β”‚
β”‚  β”‚  β–ͺ Promtail (log collection, DaemonSet)                    β”‚   β”‚
β”‚  β”‚  β–ͺ AlertManager (alert routing)                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  gitops Namespace                                           β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  β–ͺ Redmine (project management, PostgreSQL external)       β”‚   β”‚
β”‚  β”‚  β–ͺ GitLab (future)                                          β”‚   β”‚
β”‚  β”‚  β–ͺ ArgoCD (future)                                          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  inference Namespace                                        β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  β–ͺ Open-WebUI (AI chat interface)                          β”‚   β”‚
β”‚  β”‚  β–ͺ Ollama (LLM inference)                                   β”‚   β”‚
β”‚  β”‚  β–ͺ vLLM (future, GPU workloads)                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  infra Namespace                                            β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  β–ͺ NetBox (future, IPAM)                                    β”‚   β”‚
β”‚  β”‚  β–ͺ Vault (future, secret management)                        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

Category-Based Namespaces

Services are organized into functional categories with strict RBAC boundaries:

Namespace Purpose Services
core Infrastructure dependencies Headscale, FreeIPA
monitoring Observability stack Prometheus, Grafana, Loki, Thanos
gitops Development tooling Redmine, GitLab (future), ArgoCD (future)
inference AI/ML workloads Open-WebUI, Ollama, vLLM (future)
infra Operations tools NetBox (future), Vault (future)

Benefits:

  • Clear security and trust boundaries
  • Independent resource quotas and limits
  • Simplified RBAC management
  • Logical grouping of related services
  • Easy to add new categories

See Namespaces Configuration for Terraform variables.

VPN Mesh Networking

Headscale provides self-hosted Tailscale control server for secure mesh VPN:

  • Address Range: 100.64.0.0/10 (CGNAT space)
  • Protocol: WireGuard
  • Enrollment: Pre-auth keys via kubectl exec
  • Client: Official Tailscale client on all platforms
  • Access: All services VPN-only (no public exposure)

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Laptop    β”‚           β”‚   Desktop    β”‚          β”‚ Kubernetes β”‚
β”‚ Tailscale   │◀─────────▢│  Tailscale   │◀────────▢│    Pods    β”‚
β”‚ 100.64.x.x  β”‚           β”‚  100.64.x.x  β”‚          β”‚ (sidecars) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                         β”‚                         β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   Headscale   β”‚
                         β”‚  (core ns)    β”‚
                         β”‚ 100.64.0.1    β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

See Networking Architecture and VPN Enrollment Guide.

Identity Management

FreeIPA provides centralized authentication and user management:

  • Protocol: LDAPS (port 636)
  • Base DN: dc=example,dc=org (configured via Terraform)
  • Services: LDAP directory, Kerberos KDC, certificate authority
  • Integration: All services authenticate via FreeIPA
  • Web UI: https://ipa.example.com (VPN-only)

LDAP Integration:

  • Redmine project management
  • Grafana dashboarding
  • Future services (GitLab, NetBox, Vault)

See FreeIPA Service and LDAP Integration Guide.

Multi-Container StatefulSet Pattern

All services use standardized 3-container (or 5-container with lifecycle) architecture:

Core Containers:

  1. nginx-tls - HTTPS termination (port 443), proxies to localhost
  2. Application - Main service on localhost (not exposed externally)
  3. Tailscale - VPN sidecar for mesh connectivity

Lifecycle Containers (when Tailscale enabled): 4. lifecycle-cleanup (init) - Cleans up old Headscale nodes and DNS records before startup 5. lifecycle-dns-create (sidecar) - Creates DNS record after Tailscale registers VPN IP

Benefits:

  • Security isolation (app never exposed directly)
  • Automatic cleanup of orphaned resources
  • Self-healing DNS on pod recreation
  • VPN-only access enforcement
  • Consistent deployment pattern

See StatefulSet Pattern Details.

Data Flow

Service Access Flow

User ──▢ VPN ──▢ DNS ──▢ nginx-tls ──▢ Application
         (WG)    (A)     (HTTPS)       (localhost)
  1. User connects to VPN (Tailscale client β†’ Headscale)
  2. DNS resolves service hostname to pod VPN IP
  3. nginx-tls terminates TLS, validates VPN IP range
  4. nginx-tls proxies request to application on localhost
  5. Application processes request (may authenticate via FreeIPA LDAP)

Metrics Collection Flow

Pods ──▢ Prometheus ──▢ Thanos ──▢ Grafana
         (scrape)       (store)    (query)
                                     β–²
Pods ──▢ Promtail ──▢ Loki β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         (push)        (index)
  1. Prometheus scrapes /metrics endpoints (ServiceMonitor auto-discovery)
  2. Thanos optionally archives and downsamples metrics for long-term storage
  3. Promtail DaemonSet collects logs from all pods
  4. Loki indexes logs with Kubernetes metadata labels
  5. Grafana queries Prometheus/Thanos for metrics and Loki for logs

See Monitoring Guide.

DNS Management Flow

Terraform ──▢ Cloudflare API ──▢ DNS Records
              (create A record)
                    β”‚
                    β–Ό
Init Container ──▢ Update IP ──▢ Service VPN IP
(lifecycle)        (on startup)
                    β”‚
                    β–Ό
DNS Sidecar ────▢ Update IP ──▢ Final VPN IP
(lifecycle)       (after VPN up)
  1. Terraform creates DNS A record with fallback IP (node IP)
  2. Init container updates DNS to pod's VPN IP on startup
  3. DNS sidecar waits for Tailscale connection, updates to final VPN IP
  4. Cleanup on deletion: Init container removes old records before starting

See DNS Management Guide.

Security Architecture

Network Security

  • VPN-only access: All services accessible only via Tailscale VPN (100.64.0.0/10)
  • nginx IP restrictions: allow 100.64.0.0/10; deny all;
  • No public exposure: Services not accessible from internet
  • TLS everywhere: cert-manager + Let's Encrypt for all HTTPS
  • External ingress: Only Headscale enrollment endpoint public

Authentication & Authorization

  • Centralized auth: FreeIPA LDAP for all service logins
  • LDAPS encryption: Port 636 for LDAP communication
  • RBAC boundaries: Kubernetes namespaces with strict RoleBindings
  • Cross-namespace RBAC: Explicit bindings for lifecycle scripts
  • Service accounts: Minimal permissions per service

Secrets Management

  • Kubernetes secrets: Passwords, tokens, keys
  • Terraform sensitive vars: Marked sensitive = true
  • Never committed: .env, *.tfvars, credentials gitignored
  • Vault (future): Centralized secret storage with auto-rotation

See Security Architecture for detailed security design.

Deployment Model

Infrastructure as Code

  • Tool: Terraform (HCL)
  • Pattern: Single terraform apply deploys everything
  • State: Local state file (S3/MinIO planned)
  • Variables: terraform.tfvars for configuration
  • Modules: None (flat structure for simplicity)

Dependency Management

  • Pattern: Explicit depends_on without [0] indexing
  • Count-based: All services toggleable via var.service_enabled
  • No circular deps: Fallback IPs prevent chicken-egg scenarios
  • Self-healing: Services recover automatically on failures

See Dependency Tree Management.

GitOps (Future)

  • ArgoCD planned: Automatic sync from Git
  • Current: Manual Terraform apply
  • Dashboard sync: Grafana uses git-sync sidecar pattern

Scalability Considerations

Current Limitations

  • Single-replica: Most services run 1 replica (StatefulSet pattern)
  • Local storage: PVCs on Linode block storage
  • Thanos storage: Filesystem-backed (not S3/object storage)
  • Loki retention: emptyDir (ephemeral)

Scaling Strategies

Vertical Scaling:

  • Adjust CPU/memory limits in Terraform variables
  • Increase PVC sizes (requires storage class support)

Horizontal Scaling (Future):

  • Convert StatefulSets to Deployments where applicable
  • Add LoadBalancer services for multi-replica
  • Implement object storage for Thanos/Loki

Node Scaling:

  • Linode LKE autoscaling enabled
  • Node affinity for CPU vs GPU workloads
  • Toleration for specialized nodes

See Scaling Guide.

Related Documentation

Architecture Deep-Dives

Service Documentation

Operations


Navigation: Documentation Index | Home