Skip to content

TomOstt/BlueOS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BlueOS

A GPU-first user-space inference runtime that orchestrates multiple frontier LLMs on a single consumer machine.

BlueOS treats your GPU as the primary processor and your CPU as an I/O coprocessor. Four specialized models (1T, 685B, 80B params) collaborate through shared natural-language memory on hardware with only 16GB VRAM — no cloud, no API keys, no subscription.

Rust orchestration (12K LOC) + C++/CUDA backend (5.7K LOC), targeting Linux userspace.


Why BlueOS Exists

Commodity hardware is massively underutilized for LLM inference. A single RTX 4080 has 700 GB/s memory bandwidth, 16,384 CUDA cores, and tensor cores designed for matrix math — but generic OS abstractions treat GPUs as a peripheral device behind a slow PCIe bus. The result: most consumer hardware achieves 10-20% of its theoretical throughput during inference.

The insight: treat the GPU as the primary compute engine and the CPU as an I/O coprocessor, not the other way around. This is the same architecture that made the N64 RCP revolutionary — dedicate the fastest silicon to the hot path, and use everything else to keep it fed. BlueOS runs entirely in userspace (no kernel modules), maximizing performance within SCHED_DEADLINE + mlock + CUDA streams.

The second insight: one model doing everything is worse than four models each doing one thing. A 1T-parameter thinker reasons about the problem. A 685B coder writes the solution. An 80B reviewer finds the bugs. The coder fixes them. They communicate through a shared natural-language memory called the Thinking Buffer — each model reads enriched context from prior stages without knowing other models exist.


Architecture

System Layer Stack

The full BlueOS stack, from hardware to API surface. Data flows bottom-up during boot, top-down during inference.

graph TB
    subgraph API["Layer 4: Inference API"]
        GEN["blue_generate()"]
        STREAM["blue_generate_stream_poll()"]
        CANCEL["blue_cancel()"]
        KVAPI["blue_kv_save / blue_kv_load"]
    end

    subgraph CASCADE["Layer 3: Cascade Pipeline"]
        TB["Thinking Buffer<br/>(named slots)"]
        SPEC["Speculative Decoding<br/>(Qwen3-1.5B draft)"]
        STAGES["4-Stage Cascade<br/>Think → Code → Review → Fix"]
        ENTROPY["Entropy Monitor<br/>(skip / trim / budget)"]
    end

    subgraph MEM["Layer 2: Memory Subsystem"]
        T0["Tier 0: VRAM<br/>16GB · 700 GB/s"]
        T1["Tier 1: Pinned RAM<br/>48GB · 38 GB/s DDR5"]
        T2["Tier 2: NVMe RAID-0<br/>4TB · 28 GB/s"]
        DMA["DMA Engine<br/>(A-B double buffer)"]
        UMH["Unified Memory Hub"]
    end

    subgraph SCHED["Layer 1: HCAL Scheduler"]
        HCAL["Heterogeneous Compute<br/>Abstraction Layer"]
        PCORES["P-cores 0-7<br/>(attention, orchestration)"]
        ECORES["E-cores 8-23<br/>(data prep, tokenization)"]
        GPU["GPU sm_89<br/>(FFN, MoE, dequant)"]
    end

    subgraph HAL["Layer 0: Hardware Abstraction"]
        PCIE["PCIe Topology<br/>Gen4 x16 · 25 GB/s"]
        GDS["GDS / pread() fallback"]
        CPUDET["CPU Feature Detection<br/>(AVX2, FMA, AMX)"]
        BLUEIR["Blue-IR 6-op ISA<br/>(sparsity fast-path)"]
    end

    subgraph HW["Hardware"]
        HWCPU["i9-13900F<br/>8P + 16E cores"]
        HWGPU["RTX 4080 16GB<br/>sm_89 · Ada Lovelace"]
        HWRAM["64GB DDR5-5600"]
        HWNVM["4× NVMe RAID-0"]
    end

    API --> CASCADE
    CASCADE --> MEM
    MEM --> SCHED
    SCHED --> HAL
    HAL --> HW

    classDef compute fill:#2563eb,stroke:#1e40af,color:#fff
    classDef memory fill:#059669,stroke:#047857,color:#fff
    classDef io fill:#d97706,stroke:#b45309,color:#fff
    classDef hw fill:#6b7280,stroke:#4b5563,color:#fff

    class GEN,STREAM,CANCEL,KVAPI,TB,SPEC,STAGES,ENTROPY,HCAL,PCORES,ECORES,GPU compute
    class T0,T1,T2,DMA,UMH memory
    class PCIE,GDS,CPUDET,BLUEIR io
    class HWCPU,HWGPU,HWRAM,HWNVM hw
Loading

Memory Hierarchy (TV-VRAM)

Three-tier virtual VRAM makes 16GB act like 64GB+. A-B double-buffering hides PCIe latency: while the GPU computes on Buffer A, the next layer's weights stream into Buffer B via DMA. Compute time per layer (~10ms) exceeds DMA time (~8ms), so the GPU never stalls.

graph LR
    subgraph TIER0["Tier 0 — GPU VRAM (16GB)"]
        ACT["Activations<br/>KV Cache"]
        BUF_A["Buffer A<br/>(computing)"]
        BUF_B["Buffer B<br/>(DMA filling)"]
        DRAFT["Draft Model<br/>(permanent)"]
    end

    subgraph TIER1["Tier 1 — Pinned Host RAM (48GB)"]
        WARM["Warm Weights<br/>(queued models)"]
        PREFETCH["Prefetched Experts<br/>(MoE co-activation)"]
        POOL["MemAscend Pools<br/>(adaptive sizing)"]
    end

    subgraph TIER2["Tier 2 — NVMe RAID-0 (4TB)"]
        GGUF["GGUF Model Files"]
        KVSNAPSHOT["KV Cache Snapshots"]
        COLD["Cold Weights"]
    end

    TIER2 -- "28 GB/s<br/>io_uring async" --> TIER1
    TIER1 -- "25 GB/s<br/>PCIe DMA" --> TIER0
    BUF_A <-.-> BUF_B

    classDef hot fill:#dc2626,stroke:#b91c1c,color:#fff
    classDef warm fill:#d97706,stroke:#b45309,color:#fff
    classDef cold fill:#2563eb,stroke:#1e40af,color:#fff

    class ACT,BUF_A,BUF_B,DRAFT hot
    class WARM,PREFETCH,POOL warm
    class GGUF,KVSNAPSHOT,COLD cold
Loading

Thinking Buffer Cascade

The central innovation. Four specialized models collaborate through shared natural-language memory. Each stage reads from previous slots, writes to its own, and the entropy monitor decides whether downstream stages can be skipped.

graph LR
    TASK["Task Input"]

    subgraph S1["Stage 1: THINKER"]
        M1["Kimi K2<br/>(1T MoE, 62B active)"]
    end

    subgraph S2["Stage 2: CODER"]
        M2["DeepSeek V3.2<br/>(685B MoE, 37B active)"]
    end

    subgraph S3["Stage 3: REVIEWER"]
        M3["Dense 80B"]
    end

    subgraph S4["Stage 4: FIXER"]
        M4["DeepSeek V3.2<br/>(685B MoE, 37B active)"]
    end

    OUTPUT["Final Output"]

    TASK -- "task" --> S1
    S1 -- "reasoning" --> S2
    S2 -- "code_draft" --> S3
    S3 -- "review" --> S4
    S4 -- "final_code" --> OUTPUT

    ENT{"Entropy<br/>Monitor"}
    ENT -. "H < threshold<br/>→ skip stage" .-> S3
    ENT -. "H < threshold<br/>→ skip stage" .-> S4

    classDef think fill:#7c3aed,stroke:#6d28d9,color:#fff
    classDef code fill:#2563eb,stroke:#1e40af,color:#fff
    classDef review fill:#059669,stroke:#047857,color:#fff
    classDef entropy fill:#dc2626,stroke:#b91c1c,color:#fff

    class M1 think
    class M2,M4 code
    class M3 review
    class ENT entropy
Loading

Slot rules: Each model sees only the Thinking Buffer slots it's configured to read. No model knows about the others — they just see progressively enriched context. The entropy monitor tracks output entropy per-token: if a stage produces low-entropy output (high confidence), downstream stages that would add minimal value are skipped.

FFI Boundary

Rust handles orchestration, scheduling, and memory management. C++/CUDA handles kernel launches and raw GPU compute. They meet at a strict C ABI boundary — no C++ types cross the FFI.

graph TB
    subgraph RUST["Rust Side (blueos/)"]
        RAPI["API Surface<br/>CascadeExecutor, ThinkingBuffer"]
        RSCHED["HCAL Scheduler"]
        RMEM["Memory Manager<br/>TV-VRAM, DMA, UMH"]
        RSPEC["Speculative Engine"]
        RFFI["common/ffi.rs<br/>extern C declarations<br/>#[repr(C)] structs"]
    end

    subgraph FFI["FFI Boundary — C ABI"]
        HEADER["runtime.h<br/>Single source of truth"]
        TYPES["Only: i32, u32, f32, *const u8,<br/>*mut opaque handles"]
    end

    subgraph CPP["C++ / CUDA Side (sovereign/blue/)"]
        RUNTIME["BlueRuntime<br/>blue_pool_*, blue_generate"]
        MPOOL["Model Pool<br/>weight switching"]
        KERN["CUDA Kernels (sm_89)<br/>fused_dequant_gemv<br/>fused_rmsnorm<br/>sparse_ffn<br/>async_prefetch"]
        LLAMA["llama.cpp / ggml<br/>tensor ops foundation"]
    end

    RUST --> FFI
    FFI --> CPP

    classDef rust fill:#d97706,stroke:#b45309,color:#fff
    classDef ffi fill:#6b7280,stroke:#4b5563,color:#fff
    classDef cpp fill:#2563eb,stroke:#1e40af,color:#fff

    class RAPI,RSCHED,RMEM,RSPEC,RFFI rust
    class HEADER,TYPES ffi
    class RUNTIME,MPOOL,KERN,LLAMA cpp
Loading

Inference Data Flow

What happens on a single blue_generate() call, end to end.

graph TB
    REQ["Inference Request"]
    SCHED["HCAL: select execution target"]
    CHECK{"Weights in<br/>VRAM?"}
    HIT["Proceed to compute"]
    MISS["Tier 2→1→0 DMA pipeline<br/>(A-B double buffer)"]
    DRAFT["Stage 1: Draft tokens<br/>(Qwen3-1.5B on E-cores)"]
    VERIFY["Stage 2: Verify batch<br/>(target model, 1 forward pass)"]
    ACCEPT{"Tokens<br/>accepted?"}
    COMMIT["Commit to KV cache"]
    REJECT["Rewind, reduce K"]
    EXTEND["Extend generation"]
    STREAM["Stream tokens to caller<br/>(ring buffer, per-token entropy)"]
    DONE["Response complete"]

    REQ --> SCHED
    SCHED --> CHECK
    CHECK -- "hit" --> HIT
    CHECK -- "miss" --> MISS
    MISS --> HIT
    HIT --> DRAFT
    DRAFT --> VERIFY
    VERIFY --> ACCEPT
    ACCEPT -- "yes" --> COMMIT
    ACCEPT -- "no" --> REJECT
    REJECT --> DRAFT
    COMMIT --> EXTEND
    EXTEND --> STREAM
    STREAM --> DONE

    classDef hot fill:#dc2626,stroke:#b91c1c,color:#fff
    classDef compute fill:#2563eb,stroke:#1e40af,color:#fff
    classDef mem fill:#059669,stroke:#047857,color:#fff

    class DRAFT,VERIFY,EXTEND compute
    class MISS,COMMIT mem
    class STREAM,DONE hot
Loading

Key Design Decisions

  • User-space over kernel module. No insmod, no root required, no kernel version coupling. SCHED_DEADLINE + mlock + io_uring give us everything we need from userspace. The blast radius of a bug is one process, not a kernel panic.

  • Rust + C++/CUDA split at FFI boundary. Rust owns orchestration (scheduling, memory management, cascade logic) because those are complex state machines where memory safety matters. C++/CUDA owns GPU compute because CUDA's C++ API is the only way to access tensor cores, shared memory, and warp-level primitives. The FFI boundary is a strict C ABI — no C++ types, no Rust types, just i32, f32, and opaque pointers.

  • Tiered virtual VRAM over unified memory. CUDA Unified Memory hides the memory hierarchy, which means you can't optimize for it. TV-VRAM makes the tiers explicit: Tier 0 (VRAM, 700 GB/s) for hot data, Tier 1 (pinned RAM, 38 GB/s) for warm weights, Tier 2 (NVMe, 28 GB/s) for cold storage. A-B double buffering ensures the GPU never stalls waiting for PCIe.

  • Speculative decoding cascade over standard autoregressive. A small draft model (Qwen3-1.5B, always resident in VRAM) generates K candidate tokens. The target model verifies them in a single forward pass. Mathematically lossless — identical output distribution. 2-4x throughput increase at zero quality cost. Adaptive K tracks acceptance rate and adjusts speculation depth.

  • Four specialized models over one generalist. A 1T thinker with full attention on reasoning outperforms a generalist splitting attention between reasoning, coding, reviewing, and fixing. Each stage builds on verified prior work in the Thinking Buffer rather than maintaining everything in one attention window.

  • PCIe topology detection at boot. Not all NVMe drives have equal latency — CPU-direct drives bypass the PCH, saving ~2µs per I/O. BlueOS enumerates PCIe topology at boot, identifies which drives are CPU-direct vs. PCH-routed, and enables GPUDirect Storage only where the topology actually supports it. If GDS isn't available, it falls back to pread() without crashing.


Hardware Target

Component Spec Role in BlueOS
CPU i9-13900F (8P + 16E cores) P-cores: attention, orchestration. E-cores: tokenization, data prep
GPU RTX 4080 16GB (sm_89, Ada Lovelace) Expert FFN, dequantization, MoE compute. 700 GB/s VRAM bandwidth
RAM 64GB DDR5-5600 Tier 1 warm weights, prefetched experts. 38 GB/s bandwidth
Storage 4× NVMe RAID-0 Tier 2 cold storage, GGUF files. 28 GB/s aggregate
PCIe Gen 4 x16 GPU ↔ RAM highway. 25 GB/s effective

Bandwidth Cheat Sheet

GPU VRAM internal:    ~700 GB/s    (Tier 0 — activations live here)
DDR5 RAM:             ~38  GB/s    (Tier 1 — warm weights)
NVMe RAID-0:          ~28  GB/s    (Tier 2 — cold storage)
PCIe Gen4 x16:        ~25  GB/s    (GPU ↔ RAM transfer)
L2 cache (GPU):       ~3   TB/s    (on-chip, exploited by fused kernels)
L3 cache (CPU):       ~300 GB/s    (36MB shared, used by CPU attention)

Project Structure

blueos/                          Rust workspace root
├── Cargo.toml                   Workspace configuration
├── common/                      Shared types: tensors, GPU, entropy, FFI declarations
│   └── src/
│       ├── entropy.rs           Shannon entropy, KL divergence, mutual information
│       ├── ffi.rs               All extern "C" FFI bindings (mirrors runtime.h)
│       ├── tensor.rs            Tensor primitives and quantization types
│       ├── gpu.rs               GPU device types, topology, compute scoring
│       ├── memory.rs            Memory region types, VRAM tier definitions
│       └── error.rs             BlueError types (thiserror)
├── boot/                        6-phase hardware init → runtime handoff
│   └── src/
│       ├── main.rs              Boot sequence entry point
│       ├── cpu_isolation.rs     Core pinning, IRQ steering, power lockdown
│       ├── hal.rs               Hardware abstraction layer
│       └── pcie_topology.rs     PCIe BDF enumeration, GDS capability gating
├── kernel/                      Resource management (not a real OS kernel)
│   └── src/
│       ├── memory/
│       │   ├── tv_vram.rs       Three-tier virtual VRAM manager
│       │   ├── dma_engine.rs    A-B double-buffered DMA transfers
│       │   ├── umh.rs           Unified Memory Hub (global address space)
│       │   ├── pinned_pool.rs   Pre-allocated pinned memory slabs
│       │   ├── expert_prefetch.rs  MoE expert prefetch prediction
│       │   ├── thrash_detector.rs  Detects tier-thrashing patterns
│       │   └── ...              Tier managers, victim selection, demand paging
│       ├── scheduler/
│       │   ├── hcal.rs          HCAL: heterogeneous compute scheduler
│       │   ├── batch.rs         Non-preemptive batch execution
│       │   └── sharding.rs      Proportional work distribution
│       └── arch/mod.rs          CPU feature detection (AVX2, AVX-512, AMX)
├── runtime/                     Multi-model inference orchestration
│   └── src/
│       ├── cascade.rs           Cascade pipeline executor (Think→Code→Review→Fix)
│       ├── thinking_buffer.rs   Shared natural-language memory (named slots)
│       ├── engine.rs            MultiModelEngine — FFI wrapper around BlueRuntime
│       ├── speculative.rs       Universal draft engine (adaptive K, KL tracking)
│       ├── kv_cache.rs          KV cache library and persistence
│       ├── pipelines.rs         Pre-built cascade templates
│       ├── wal.rs               Write-ahead log for Thinking Buffer durability
│       └── profiler.rs          Per-stage timing and telemetry
├── bluefs/                      Weight-optimized I/O
│   └── src/
│       ├── gguf_parser.rs       GGUF format parsing
│       ├── weight_store.rs      Weight streaming and caching
│       └── entropy_stream.rs    Entropy-aware I/O prioritization
└── blueir/                      Custom micro-ISA for sparsity fast-path
    └── src/
        ├── isa.rs               6-instruction ISA: LOAD, STORE, SPARSE, DEQUANT, FMAC, JUMP
        ├── nasm_lower.rs        Compiles Blue-IR to x86-64 via NASM
        └── kernels/dequant.asm  Hand-written dequantization kernel

sovereign/                       C++/CUDA inference backend
├── CMakeLists.txt               Build system (CUDA sm_89, llama.cpp integration)
├── Makefile                     Convenience build/serve/bench targets
├── llama.cpp                    Git submodule — ggml tensor ops foundation
└── blue/
    ├── runtime.h                PUBLIC C API — single source of truth for FFI
    ├── runtime.cpp              API implementation
    ├── model_pool.h/.cpp        Multi-model registry, weight switching, VRAM budgets
    ├── speculative.h/.cpp       Draft engine: adaptive K, verification, KL divergence
    ├── cpu_attention.h/.cpp     CPU-side attention using AVX2+FMA on P-cores
    ├── cpu_gpu_split.h/.cpp     Parallel CPU ∥ GPU execution coordinator
    ├── kernel_dispatch.h/.cpp   Custom CUDA kernel router
    ├── dma_bridge.h/.cpp        DMA transfer management
    ├── gds_bridge.cpp           GPUDirect Storage (NVMe→VRAM bypass)
    ├── streaming.cpp            Token streaming ring buffer
    ├── watchdog.cpp             Generation timeout watchdog
    ├── memory/
    │   ├── tv_vram.cpp          Three-tier memory (C++ side)
    │   └── pinned_pool.cpp      Pre-allocated pinned CUDA memory slabs
    ├── kernels/
    │   ├── fused_dequant_gemv.cu    INT4→FP16 during GEMV (fused)
    │   ├── fused_dequant_gemm_tc.cu Tensor core GEMM with inline dequant
    │   ├── fused_rmsnorm.cu         Fused RMSNorm + scale
    │   ├── sparse_ffn.cu            Skip inactive MoE experts
    │   ├── async_prefetch.cu        Background weight DMA
    │   ├── verify_crc.cu            CRC32 DMA integrity verification
    │   └── test_correctness.cu      Kernel correctness tests
    └── cli/main.cpp             CLI entry point

tools/
└── nanoquant/admm_ptq.py        ADMM-based post-training quantization

build.sh                         Unified build script (CMake → Cargo, mock/native modes)

Building

Prerequisites

  • Rust toolchain (stable, 2021 edition) — rustup
  • CMake 3.18+
  • C++17 compiler (GCC 11+ or Clang 14+)
  • CUDA Toolkit 12.0+ (optional — for GPU kernels, targets sm_89)
  • NASM assembler (optional — for Blue-IR compiled kernels)

Build Commands

# Full build: C++ sovereign runtime → Rust workspace (release)
./build.sh

# Rust only (mock mode — no GPU required, all tests pass)
./build.sh rust-only

# C++ only (sovereign runtime + llama.cpp)
./build.sh cpp-only

# Debug build
./build.sh debug

# Run all tests
./build.sh test

# Clean everything
./build.sh clean

The build system auto-detects CUDA availability. If nvcc is not found, it builds without GPU kernel support. If libblueruntime.a is not found, the Rust workspace builds in mock mode — all orchestration logic is testable without a GPU.

Two-Phase Build

  1. CMake compiles CUDA kernels (.cu) → links llama.cpp + BlueRuntime → produces libblueruntime.a
  2. Cargo detects libblueruntime.a → enables native feature; if missing → builds in mock mode

Both modes compile and pass tests.


Current Status

Module Status Notes
Boot sequence (6-phase init) ✅ Complete PCIe topology, CPU isolation, power lockdown
TV-VRAM (3-tier memory) ✅ Complete A-B double buffering, tier management
DMA engine ✅ Complete Async transfers, CRC32 verification
Unified Memory Hub ✅ Complete Global address space across tiers
HCAL scheduler ✅ Complete Compute scoring, proportional sharding
Cascade pipeline ✅ Complete 4-stage Think→Code→Review→Fix
Thinking Buffer ✅ Complete Named slots, WAL, context assembly
Speculative decoding ✅ Complete Adaptive K, KL divergence tracking
Entropy monitor ✅ Complete Skip logic, budget trimming
KV cache persistence ✅ Complete Save/load across model switches
Blue-IR (6-op ISA) ✅ Complete NASM lowering to x86-64
CUDA kernels (6 kernels) ✅ Complete fused_dequant_gemv, rmsnorm, sparse_ffn, etc.
C API (runtime.h) ✅ Complete Full FFI surface, streaming, hooks
Expert prefetch prediction ✅ Complete Co-activation matrix, MI scoring
Mock mode (GPU-free testing) ✅ Complete All Rust tests pass without C++ backend
GPUDirect Storage 🚧 In Progress GDS bridge exists, topology gating implemented
Preemptive GPU scheduling 📋 Planned blue_abort_stream() for entropy-triggered preemption
Continuous training context 📋 Planned DPO from cascade traces, LoRA fine-tune on idle
Graph-aware VRAM scheduler 📋 Planned Timeline-driven DMA pre-scheduling
Blue-IR expansion (~25 ops) 📋 Planned tinygrad-aligned universal compilation target

Three Laws of BlueOS

These are architectural invariants, not guidelines. Every commit, every optimization, every design decision is measured against them.

1. Zero Degradation — No abstraction may reduce output quality below the raw model baseline. Speculative decoding is mathematically lossless. Entropy-based skipping only removes redundant stages. If you can't prove losslessness, it doesn't ship.

2. Specialization > Generalization — Four focused models each doing one thing well outperform one giant model doing everything. The Thinking Buffer Cascade is the central innovation. Each stage has full attention on ONE concern: reasoning, coding, reviewing, or fixing.

3. Hardware-First — Software is shaped by the silicon. Pin CPU cores. Steer interrupts. Manage VRAM tiers. Fuse CUDA kernels. Know your bandwidth numbers. Generic abstractions that hide hardware are the enemy. Measure the silicon, then write the code.


Roadmap

  1. Blue-IR Expansion — Expand from 6 ops to ~25 tinygrad-aligned ops. Blue-IR becomes a universal compilation target for compute hardware. New backends: PTX (sm_89), future RDNA/Metal.

  2. Graph-Aware VRAM Manager — Replace reactive LRU eviction with planned timeline-driven scheduling. Pre-schedule all DMA transfers based on the deterministic cascade plan. Zero OOM surprises.

  3. Preemptive GPU Scheduling — When the entropy monitor detects degenerate output, kill the CUDA stream, flush partial KV to checkpoint, and advance to the next cascade stage.

  4. Continuous Training Context — Capture cascade traces as DPO training data (code_draft = rejected, final_code = chosen). LoRA fine-tune the draft model during idle time. The system improves with use.

  5. Scheduling Hint API — Lock-free SPSC channel from entropy monitor to HCAL. Hint types: LOW_ENTROPY, HIGH_ENTROPY, EARLY_EXIT, PREFETCH_NOW, SKIP_STAGE.


License

MIT

About

GPU-first LLM inference runtime in Rust + CUDA. Tiered virtual VRAM, speculative decoding, 4-stage cascade.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors