A GPU-first user-space inference runtime that orchestrates multiple frontier LLMs on a single consumer machine.
BlueOS treats your GPU as the primary processor and your CPU as an I/O coprocessor. Four specialized models (1T, 685B, 80B params) collaborate through shared natural-language memory on hardware with only 16GB VRAM — no cloud, no API keys, no subscription.
Rust orchestration (12K LOC) + C++/CUDA backend (5.7K LOC), targeting Linux userspace.
Commodity hardware is massively underutilized for LLM inference. A single RTX 4080 has 700 GB/s memory bandwidth, 16,384 CUDA cores, and tensor cores designed for matrix math — but generic OS abstractions treat GPUs as a peripheral device behind a slow PCIe bus. The result: most consumer hardware achieves 10-20% of its theoretical throughput during inference.
The insight: treat the GPU as the primary compute engine and the CPU as an I/O coprocessor, not the other way around. This is the same architecture that made the N64 RCP revolutionary — dedicate the fastest silicon to the hot path, and use everything else to keep it fed. BlueOS runs entirely in userspace (no kernel modules), maximizing performance within SCHED_DEADLINE + mlock + CUDA streams.
The second insight: one model doing everything is worse than four models each doing one thing. A 1T-parameter thinker reasons about the problem. A 685B coder writes the solution. An 80B reviewer finds the bugs. The coder fixes them. They communicate through a shared natural-language memory called the Thinking Buffer — each model reads enriched context from prior stages without knowing other models exist.
The full BlueOS stack, from hardware to API surface. Data flows bottom-up during boot, top-down during inference.
graph TB
subgraph API["Layer 4: Inference API"]
GEN["blue_generate()"]
STREAM["blue_generate_stream_poll()"]
CANCEL["blue_cancel()"]
KVAPI["blue_kv_save / blue_kv_load"]
end
subgraph CASCADE["Layer 3: Cascade Pipeline"]
TB["Thinking Buffer<br/>(named slots)"]
SPEC["Speculative Decoding<br/>(Qwen3-1.5B draft)"]
STAGES["4-Stage Cascade<br/>Think → Code → Review → Fix"]
ENTROPY["Entropy Monitor<br/>(skip / trim / budget)"]
end
subgraph MEM["Layer 2: Memory Subsystem"]
T0["Tier 0: VRAM<br/>16GB · 700 GB/s"]
T1["Tier 1: Pinned RAM<br/>48GB · 38 GB/s DDR5"]
T2["Tier 2: NVMe RAID-0<br/>4TB · 28 GB/s"]
DMA["DMA Engine<br/>(A-B double buffer)"]
UMH["Unified Memory Hub"]
end
subgraph SCHED["Layer 1: HCAL Scheduler"]
HCAL["Heterogeneous Compute<br/>Abstraction Layer"]
PCORES["P-cores 0-7<br/>(attention, orchestration)"]
ECORES["E-cores 8-23<br/>(data prep, tokenization)"]
GPU["GPU sm_89<br/>(FFN, MoE, dequant)"]
end
subgraph HAL["Layer 0: Hardware Abstraction"]
PCIE["PCIe Topology<br/>Gen4 x16 · 25 GB/s"]
GDS["GDS / pread() fallback"]
CPUDET["CPU Feature Detection<br/>(AVX2, FMA, AMX)"]
BLUEIR["Blue-IR 6-op ISA<br/>(sparsity fast-path)"]
end
subgraph HW["Hardware"]
HWCPU["i9-13900F<br/>8P + 16E cores"]
HWGPU["RTX 4080 16GB<br/>sm_89 · Ada Lovelace"]
HWRAM["64GB DDR5-5600"]
HWNVM["4× NVMe RAID-0"]
end
API --> CASCADE
CASCADE --> MEM
MEM --> SCHED
SCHED --> HAL
HAL --> HW
classDef compute fill:#2563eb,stroke:#1e40af,color:#fff
classDef memory fill:#059669,stroke:#047857,color:#fff
classDef io fill:#d97706,stroke:#b45309,color:#fff
classDef hw fill:#6b7280,stroke:#4b5563,color:#fff
class GEN,STREAM,CANCEL,KVAPI,TB,SPEC,STAGES,ENTROPY,HCAL,PCORES,ECORES,GPU compute
class T0,T1,T2,DMA,UMH memory
class PCIE,GDS,CPUDET,BLUEIR io
class HWCPU,HWGPU,HWRAM,HWNVM hw
Three-tier virtual VRAM makes 16GB act like 64GB+. A-B double-buffering hides PCIe latency: while the GPU computes on Buffer A, the next layer's weights stream into Buffer B via DMA. Compute time per layer (~10ms) exceeds DMA time (~8ms), so the GPU never stalls.
graph LR
subgraph TIER0["Tier 0 — GPU VRAM (16GB)"]
ACT["Activations<br/>KV Cache"]
BUF_A["Buffer A<br/>(computing)"]
BUF_B["Buffer B<br/>(DMA filling)"]
DRAFT["Draft Model<br/>(permanent)"]
end
subgraph TIER1["Tier 1 — Pinned Host RAM (48GB)"]
WARM["Warm Weights<br/>(queued models)"]
PREFETCH["Prefetched Experts<br/>(MoE co-activation)"]
POOL["MemAscend Pools<br/>(adaptive sizing)"]
end
subgraph TIER2["Tier 2 — NVMe RAID-0 (4TB)"]
GGUF["GGUF Model Files"]
KVSNAPSHOT["KV Cache Snapshots"]
COLD["Cold Weights"]
end
TIER2 -- "28 GB/s<br/>io_uring async" --> TIER1
TIER1 -- "25 GB/s<br/>PCIe DMA" --> TIER0
BUF_A <-.-> BUF_B
classDef hot fill:#dc2626,stroke:#b91c1c,color:#fff
classDef warm fill:#d97706,stroke:#b45309,color:#fff
classDef cold fill:#2563eb,stroke:#1e40af,color:#fff
class ACT,BUF_A,BUF_B,DRAFT hot
class WARM,PREFETCH,POOL warm
class GGUF,KVSNAPSHOT,COLD cold
The central innovation. Four specialized models collaborate through shared natural-language memory. Each stage reads from previous slots, writes to its own, and the entropy monitor decides whether downstream stages can be skipped.
graph LR
TASK["Task Input"]
subgraph S1["Stage 1: THINKER"]
M1["Kimi K2<br/>(1T MoE, 62B active)"]
end
subgraph S2["Stage 2: CODER"]
M2["DeepSeek V3.2<br/>(685B MoE, 37B active)"]
end
subgraph S3["Stage 3: REVIEWER"]
M3["Dense 80B"]
end
subgraph S4["Stage 4: FIXER"]
M4["DeepSeek V3.2<br/>(685B MoE, 37B active)"]
end
OUTPUT["Final Output"]
TASK -- "task" --> S1
S1 -- "reasoning" --> S2
S2 -- "code_draft" --> S3
S3 -- "review" --> S4
S4 -- "final_code" --> OUTPUT
ENT{"Entropy<br/>Monitor"}
ENT -. "H < threshold<br/>→ skip stage" .-> S3
ENT -. "H < threshold<br/>→ skip stage" .-> S4
classDef think fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef code fill:#2563eb,stroke:#1e40af,color:#fff
classDef review fill:#059669,stroke:#047857,color:#fff
classDef entropy fill:#dc2626,stroke:#b91c1c,color:#fff
class M1 think
class M2,M4 code
class M3 review
class ENT entropy
Slot rules: Each model sees only the Thinking Buffer slots it's configured to read. No model knows about the others — they just see progressively enriched context. The entropy monitor tracks output entropy per-token: if a stage produces low-entropy output (high confidence), downstream stages that would add minimal value are skipped.
Rust handles orchestration, scheduling, and memory management. C++/CUDA handles kernel launches and raw GPU compute. They meet at a strict C ABI boundary — no C++ types cross the FFI.
graph TB
subgraph RUST["Rust Side (blueos/)"]
RAPI["API Surface<br/>CascadeExecutor, ThinkingBuffer"]
RSCHED["HCAL Scheduler"]
RMEM["Memory Manager<br/>TV-VRAM, DMA, UMH"]
RSPEC["Speculative Engine"]
RFFI["common/ffi.rs<br/>extern C declarations<br/>#[repr(C)] structs"]
end
subgraph FFI["FFI Boundary — C ABI"]
HEADER["runtime.h<br/>Single source of truth"]
TYPES["Only: i32, u32, f32, *const u8,<br/>*mut opaque handles"]
end
subgraph CPP["C++ / CUDA Side (sovereign/blue/)"]
RUNTIME["BlueRuntime<br/>blue_pool_*, blue_generate"]
MPOOL["Model Pool<br/>weight switching"]
KERN["CUDA Kernels (sm_89)<br/>fused_dequant_gemv<br/>fused_rmsnorm<br/>sparse_ffn<br/>async_prefetch"]
LLAMA["llama.cpp / ggml<br/>tensor ops foundation"]
end
RUST --> FFI
FFI --> CPP
classDef rust fill:#d97706,stroke:#b45309,color:#fff
classDef ffi fill:#6b7280,stroke:#4b5563,color:#fff
classDef cpp fill:#2563eb,stroke:#1e40af,color:#fff
class RAPI,RSCHED,RMEM,RSPEC,RFFI rust
class HEADER,TYPES ffi
class RUNTIME,MPOOL,KERN,LLAMA cpp
What happens on a single blue_generate() call, end to end.
graph TB
REQ["Inference Request"]
SCHED["HCAL: select execution target"]
CHECK{"Weights in<br/>VRAM?"}
HIT["Proceed to compute"]
MISS["Tier 2→1→0 DMA pipeline<br/>(A-B double buffer)"]
DRAFT["Stage 1: Draft tokens<br/>(Qwen3-1.5B on E-cores)"]
VERIFY["Stage 2: Verify batch<br/>(target model, 1 forward pass)"]
ACCEPT{"Tokens<br/>accepted?"}
COMMIT["Commit to KV cache"]
REJECT["Rewind, reduce K"]
EXTEND["Extend generation"]
STREAM["Stream tokens to caller<br/>(ring buffer, per-token entropy)"]
DONE["Response complete"]
REQ --> SCHED
SCHED --> CHECK
CHECK -- "hit" --> HIT
CHECK -- "miss" --> MISS
MISS --> HIT
HIT --> DRAFT
DRAFT --> VERIFY
VERIFY --> ACCEPT
ACCEPT -- "yes" --> COMMIT
ACCEPT -- "no" --> REJECT
REJECT --> DRAFT
COMMIT --> EXTEND
EXTEND --> STREAM
STREAM --> DONE
classDef hot fill:#dc2626,stroke:#b91c1c,color:#fff
classDef compute fill:#2563eb,stroke:#1e40af,color:#fff
classDef mem fill:#059669,stroke:#047857,color:#fff
class DRAFT,VERIFY,EXTEND compute
class MISS,COMMIT mem
class STREAM,DONE hot
-
User-space over kernel module. No
insmod, no root required, no kernel version coupling.SCHED_DEADLINE+mlock+io_uringgive us everything we need from userspace. The blast radius of a bug is one process, not a kernel panic. -
Rust + C++/CUDA split at FFI boundary. Rust owns orchestration (scheduling, memory management, cascade logic) because those are complex state machines where memory safety matters. C++/CUDA owns GPU compute because CUDA's C++ API is the only way to access tensor cores, shared memory, and warp-level primitives. The FFI boundary is a strict C ABI — no C++ types, no Rust types, just
i32,f32, and opaque pointers. -
Tiered virtual VRAM over unified memory. CUDA Unified Memory hides the memory hierarchy, which means you can't optimize for it. TV-VRAM makes the tiers explicit: Tier 0 (VRAM, 700 GB/s) for hot data, Tier 1 (pinned RAM, 38 GB/s) for warm weights, Tier 2 (NVMe, 28 GB/s) for cold storage. A-B double buffering ensures the GPU never stalls waiting for PCIe.
-
Speculative decoding cascade over standard autoregressive. A small draft model (Qwen3-1.5B, always resident in VRAM) generates K candidate tokens. The target model verifies them in a single forward pass. Mathematically lossless — identical output distribution. 2-4x throughput increase at zero quality cost. Adaptive K tracks acceptance rate and adjusts speculation depth.
-
Four specialized models over one generalist. A 1T thinker with full attention on reasoning outperforms a generalist splitting attention between reasoning, coding, reviewing, and fixing. Each stage builds on verified prior work in the Thinking Buffer rather than maintaining everything in one attention window.
-
PCIe topology detection at boot. Not all NVMe drives have equal latency — CPU-direct drives bypass the PCH, saving ~2µs per I/O. BlueOS enumerates PCIe topology at boot, identifies which drives are CPU-direct vs. PCH-routed, and enables GPUDirect Storage only where the topology actually supports it. If GDS isn't available, it falls back to
pread()without crashing.
| Component | Spec | Role in BlueOS |
|---|---|---|
| CPU | i9-13900F (8P + 16E cores) | P-cores: attention, orchestration. E-cores: tokenization, data prep |
| GPU | RTX 4080 16GB (sm_89, Ada Lovelace) | Expert FFN, dequantization, MoE compute. 700 GB/s VRAM bandwidth |
| RAM | 64GB DDR5-5600 | Tier 1 warm weights, prefetched experts. 38 GB/s bandwidth |
| Storage | 4× NVMe RAID-0 | Tier 2 cold storage, GGUF files. 28 GB/s aggregate |
| PCIe | Gen 4 x16 | GPU ↔ RAM highway. 25 GB/s effective |
GPU VRAM internal: ~700 GB/s (Tier 0 — activations live here)
DDR5 RAM: ~38 GB/s (Tier 1 — warm weights)
NVMe RAID-0: ~28 GB/s (Tier 2 — cold storage)
PCIe Gen4 x16: ~25 GB/s (GPU ↔ RAM transfer)
L2 cache (GPU): ~3 TB/s (on-chip, exploited by fused kernels)
L3 cache (CPU): ~300 GB/s (36MB shared, used by CPU attention)
blueos/ Rust workspace root
├── Cargo.toml Workspace configuration
├── common/ Shared types: tensors, GPU, entropy, FFI declarations
│ └── src/
│ ├── entropy.rs Shannon entropy, KL divergence, mutual information
│ ├── ffi.rs All extern "C" FFI bindings (mirrors runtime.h)
│ ├── tensor.rs Tensor primitives and quantization types
│ ├── gpu.rs GPU device types, topology, compute scoring
│ ├── memory.rs Memory region types, VRAM tier definitions
│ └── error.rs BlueError types (thiserror)
├── boot/ 6-phase hardware init → runtime handoff
│ └── src/
│ ├── main.rs Boot sequence entry point
│ ├── cpu_isolation.rs Core pinning, IRQ steering, power lockdown
│ ├── hal.rs Hardware abstraction layer
│ └── pcie_topology.rs PCIe BDF enumeration, GDS capability gating
├── kernel/ Resource management (not a real OS kernel)
│ └── src/
│ ├── memory/
│ │ ├── tv_vram.rs Three-tier virtual VRAM manager
│ │ ├── dma_engine.rs A-B double-buffered DMA transfers
│ │ ├── umh.rs Unified Memory Hub (global address space)
│ │ ├── pinned_pool.rs Pre-allocated pinned memory slabs
│ │ ├── expert_prefetch.rs MoE expert prefetch prediction
│ │ ├── thrash_detector.rs Detects tier-thrashing patterns
│ │ └── ... Tier managers, victim selection, demand paging
│ ├── scheduler/
│ │ ├── hcal.rs HCAL: heterogeneous compute scheduler
│ │ ├── batch.rs Non-preemptive batch execution
│ │ └── sharding.rs Proportional work distribution
│ └── arch/mod.rs CPU feature detection (AVX2, AVX-512, AMX)
├── runtime/ Multi-model inference orchestration
│ └── src/
│ ├── cascade.rs Cascade pipeline executor (Think→Code→Review→Fix)
│ ├── thinking_buffer.rs Shared natural-language memory (named slots)
│ ├── engine.rs MultiModelEngine — FFI wrapper around BlueRuntime
│ ├── speculative.rs Universal draft engine (adaptive K, KL tracking)
│ ├── kv_cache.rs KV cache library and persistence
│ ├── pipelines.rs Pre-built cascade templates
│ ├── wal.rs Write-ahead log for Thinking Buffer durability
│ └── profiler.rs Per-stage timing and telemetry
├── bluefs/ Weight-optimized I/O
│ └── src/
│ ├── gguf_parser.rs GGUF format parsing
│ ├── weight_store.rs Weight streaming and caching
│ └── entropy_stream.rs Entropy-aware I/O prioritization
└── blueir/ Custom micro-ISA for sparsity fast-path
└── src/
├── isa.rs 6-instruction ISA: LOAD, STORE, SPARSE, DEQUANT, FMAC, JUMP
├── nasm_lower.rs Compiles Blue-IR to x86-64 via NASM
└── kernels/dequant.asm Hand-written dequantization kernel
sovereign/ C++/CUDA inference backend
├── CMakeLists.txt Build system (CUDA sm_89, llama.cpp integration)
├── Makefile Convenience build/serve/bench targets
├── llama.cpp Git submodule — ggml tensor ops foundation
└── blue/
├── runtime.h PUBLIC C API — single source of truth for FFI
├── runtime.cpp API implementation
├── model_pool.h/.cpp Multi-model registry, weight switching, VRAM budgets
├── speculative.h/.cpp Draft engine: adaptive K, verification, KL divergence
├── cpu_attention.h/.cpp CPU-side attention using AVX2+FMA on P-cores
├── cpu_gpu_split.h/.cpp Parallel CPU ∥ GPU execution coordinator
├── kernel_dispatch.h/.cpp Custom CUDA kernel router
├── dma_bridge.h/.cpp DMA transfer management
├── gds_bridge.cpp GPUDirect Storage (NVMe→VRAM bypass)
├── streaming.cpp Token streaming ring buffer
├── watchdog.cpp Generation timeout watchdog
├── memory/
│ ├── tv_vram.cpp Three-tier memory (C++ side)
│ └── pinned_pool.cpp Pre-allocated pinned CUDA memory slabs
├── kernels/
│ ├── fused_dequant_gemv.cu INT4→FP16 during GEMV (fused)
│ ├── fused_dequant_gemm_tc.cu Tensor core GEMM with inline dequant
│ ├── fused_rmsnorm.cu Fused RMSNorm + scale
│ ├── sparse_ffn.cu Skip inactive MoE experts
│ ├── async_prefetch.cu Background weight DMA
│ ├── verify_crc.cu CRC32 DMA integrity verification
│ └── test_correctness.cu Kernel correctness tests
└── cli/main.cpp CLI entry point
tools/
└── nanoquant/admm_ptq.py ADMM-based post-training quantization
build.sh Unified build script (CMake → Cargo, mock/native modes)
- Rust toolchain (stable, 2021 edition) —
rustup - CMake 3.18+
- C++17 compiler (GCC 11+ or Clang 14+)
- CUDA Toolkit 12.0+ (optional — for GPU kernels, targets sm_89)
- NASM assembler (optional — for Blue-IR compiled kernels)
# Full build: C++ sovereign runtime → Rust workspace (release)
./build.sh
# Rust only (mock mode — no GPU required, all tests pass)
./build.sh rust-only
# C++ only (sovereign runtime + llama.cpp)
./build.sh cpp-only
# Debug build
./build.sh debug
# Run all tests
./build.sh test
# Clean everything
./build.sh cleanThe build system auto-detects CUDA availability. If nvcc is not found, it builds without GPU kernel support. If libblueruntime.a is not found, the Rust workspace builds in mock mode — all orchestration logic is testable without a GPU.
- CMake compiles CUDA kernels (.cu) → links llama.cpp + BlueRuntime → produces
libblueruntime.a - Cargo detects
libblueruntime.a→ enablesnativefeature; if missing → builds inmockmode
Both modes compile and pass tests.
| Module | Status | Notes |
|---|---|---|
| Boot sequence (6-phase init) | ✅ Complete | PCIe topology, CPU isolation, power lockdown |
| TV-VRAM (3-tier memory) | ✅ Complete | A-B double buffering, tier management |
| DMA engine | ✅ Complete | Async transfers, CRC32 verification |
| Unified Memory Hub | ✅ Complete | Global address space across tiers |
| HCAL scheduler | ✅ Complete | Compute scoring, proportional sharding |
| Cascade pipeline | ✅ Complete | 4-stage Think→Code→Review→Fix |
| Thinking Buffer | ✅ Complete | Named slots, WAL, context assembly |
| Speculative decoding | ✅ Complete | Adaptive K, KL divergence tracking |
| Entropy monitor | ✅ Complete | Skip logic, budget trimming |
| KV cache persistence | ✅ Complete | Save/load across model switches |
| Blue-IR (6-op ISA) | ✅ Complete | NASM lowering to x86-64 |
| CUDA kernels (6 kernels) | ✅ Complete | fused_dequant_gemv, rmsnorm, sparse_ffn, etc. |
| C API (runtime.h) | ✅ Complete | Full FFI surface, streaming, hooks |
| Expert prefetch prediction | ✅ Complete | Co-activation matrix, MI scoring |
| Mock mode (GPU-free testing) | ✅ Complete | All Rust tests pass without C++ backend |
| GPUDirect Storage | 🚧 In Progress | GDS bridge exists, topology gating implemented |
| Preemptive GPU scheduling | 📋 Planned | blue_abort_stream() for entropy-triggered preemption |
| Continuous training context | 📋 Planned | DPO from cascade traces, LoRA fine-tune on idle |
| Graph-aware VRAM scheduler | 📋 Planned | Timeline-driven DMA pre-scheduling |
| Blue-IR expansion (~25 ops) | 📋 Planned | tinygrad-aligned universal compilation target |
These are architectural invariants, not guidelines. Every commit, every optimization, every design decision is measured against them.
1. Zero Degradation — No abstraction may reduce output quality below the raw model baseline. Speculative decoding is mathematically lossless. Entropy-based skipping only removes redundant stages. If you can't prove losslessness, it doesn't ship.
2. Specialization > Generalization — Four focused models each doing one thing well outperform one giant model doing everything. The Thinking Buffer Cascade is the central innovation. Each stage has full attention on ONE concern: reasoning, coding, reviewing, or fixing.
3. Hardware-First — Software is shaped by the silicon. Pin CPU cores. Steer interrupts. Manage VRAM tiers. Fuse CUDA kernels. Know your bandwidth numbers. Generic abstractions that hide hardware are the enemy. Measure the silicon, then write the code.
-
Blue-IR Expansion — Expand from 6 ops to ~25 tinygrad-aligned ops. Blue-IR becomes a universal compilation target for compute hardware. New backends: PTX (sm_89), future RDNA/Metal.
-
Graph-Aware VRAM Manager — Replace reactive LRU eviction with planned timeline-driven scheduling. Pre-schedule all DMA transfers based on the deterministic cascade plan. Zero OOM surprises.
-
Preemptive GPU Scheduling — When the entropy monitor detects degenerate output, kill the CUDA stream, flush partial KV to checkpoint, and advance to the next cascade stage.
-
Continuous Training Context — Capture cascade traces as DPO training data (code_draft = rejected, final_code = chosen). LoRA fine-tune the draft model during idle time. The system improves with use.
-
Scheduling Hint API — Lock-free SPSC channel from entropy monitor to HCAL. Hint types:
LOW_ENTROPY,HIGH_ENTROPY,EARLY_EXIT,PREFETCH_NOW,SKIP_STAGE.