Adaptive Test-time Learning and Autonomous Specialization
A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen Qwen3-14B model on a single consumer GPU — up from 36-41% in V2 — through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure — structured generation, energy-based verification, self-verified repair — and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted — no data leaves the machine, no API keys required, no usage metering. One GPU, one box.
V3.0.1 ships ATLAS as an interactive coding assistant powered by a local 9B model that you can download and use today. The 9B model (Qwen3.5-9B) has not yet been formally benchmarked under the V3 pipeline — that is V3.1 work — but the V3 pipeline architecture is identical to what scored 74.6% on Qwen3-14B, and the 9B model's published baselines suggest it should score similarly or higher. Type atlas in any project directory and start building.
I'm a business student at Virginia Tech. My background is in marketing, not computer science. I'm a hobbyist who got curious about what's possible when you stop assuming only the biggest players can build meaningful things.
My twin sister was born with Loeys-Dietz syndrome. When we were five, doctors told my parents she would never walk. A year later, she walked into that same doctor's office. She remembered looking back at him and seeing tears in his eyes. She passed away last year on March 29th. But that memory stayed with me. The people who tell you what's impossible are usually just describing the limits of their own experience. Sometimes all it takes is a single moment to realize the barrier was never technical — it was assumption.
ATLAS isn't the destination. It's proof of what we can build.
Prerequisites: NVIDIA GPU (16GB+ VRAM) with proprietary drivers, Docker (with nvidia-container-toolkit) or Podman, Python 3.9+, pip, wget.
# 1. Clone
git clone https://github.com/itigges22/ATLAS.git
cd ATLAS
# 2. Download model weights (~7GB)
mkdir -p models
wget https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q6_K.gguf \
-O models/Qwen3.5-9B-Q6_K.gguf
# 3. Install the ATLAS CLI
pip install -e .
# 4. Configure environment
cp .env.example .env
# Defaults work if your model is in ./models/ — edit .env only if you changed the path
# 5. Start all services (requires NVIDIA GPU — model loading takes ~2 minutes)
docker compose up -d # or: podman-compose up -d
# 6. Verify everything is healthy (wait for all services to show "healthy")
docker compose ps
# 7. Start coding
atlasStep 5 builds container images on first run, which can take several minutes. Subsequent starts are fast. Step 6 should show all 5 services (llama-server, geometric-lens, v3-service, sandbox, atlas-proxy) as healthy before proceeding.
See docs/SETUP.md for detailed setup (Docker, bare-metal, K3s).
Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)
| Benchmark | Score | Tasks | Method |
|---|---|---|---|
| LiveCodeBench v5 | 74.6% pass@1-v(k=3)* | 599 | V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score |
| GPQA Diamond | 47.0% | 198 | k=5, multiple-choice knowledge reasoning, V2 Score |
| SciCode | 14.7% (sub-problems) | 341 | k=1, cross-domain scientific coding, V2 Score |
*pass@1-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation — it is not pass@1. See methodology.
Important: Only LiveCodeBench was tested on V3 infrastructure. GPQA Diamond and SciCode scores are from V2 — they were not optimized for and perform accordingly. The CLI currently runs Qwen3.5-9B (V3.0.1). Formal benchmarks on the 9B model have not yet been run — that is V3.1 work.
V3 ablation breakdown (Qwen3-14B)
| Condition | Configuration | Pass Rate | Delta |
|---|---|---|---|
| A | Baseline (no V3) | 54.9% | — |
| B | +Phase 1 (PlanSearch + BudgetForcing + DivSampling) | 67.3% | +12.4pp |
| C | +Phase 1+2 (Lens routing) | 67.3% | +0.0pp |
| D | +Phase 1+3 (self-verified refinement) | 74.6% | +7.3pp |
Phase 3 uses self-generated test cases for internal verification — the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md
Raw ablation data: v3_ablation_results/ | Full traces: HuggingFace
| System | LCB pass@1 | Est. cost/task | Notes |
|---|---|---|---|
| DeepSeek V3.2 Reasoning | 86.2% | ~$0.002 | API, single-shot (low cost due to aggressive pricing strategy) |
| GPT-5 (high) | 84.6% | ~$0.043 | API, single-shot |
| ATLAS V3, Qwen3-14B (pass@1-v(k=3)) | 74.6% | ~$0.004 | Local electricity only, best-of-3 + repair pipeline |
| Claude 4.5 Sonnet | 71.4% | ~$0.066 | API, single-shot |
| Claude 4 Sonnet | 65.5% | ~$0.066 | API, single-shot |
DeepSeek's cost is lower than ATLAS despite being an API because DeepSeek operates at subsidized pricing — their per-token costs are significantly below market rate as a growth strategy. ATLAS's cost is pure electricity (~$0.12/kWh × 165W GPU × 1h 55m for 599 tasks). ATLAS trades latency for privacy — no data leaves the machine.
Methodology notes & sources
ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model — "pass@1-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems — not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS trades latency for cost — the pipeline takes longer per task than a single API call, but no data leaves the machine.
Sources: Artificial Analysis LCB Leaderboard | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace)
CLI Reliability (Qwen3.5-9B, V3.0.1)
The interactive CLI has been validated across 8 difficulty levels × 3 iterations:
| Test | Description | Pass Rate |
|---|---|---|
| L1 | Conversational response | 100% |
| L2 | Create snake game (curses) | 100% |
| L3 | Fix broken collision detection | 100% |
| L4 | Add persistent high scores | 100% |
| L5 | Create multi-file Next.js project | 100% |
| L6 | Add JWT auth to existing project | 67% |
| L7 | Delete files from project | 100% |
| L8 | Lint and fix TypeScript errors | 100% |
| Overall | 95.8% |
5-language integration: Python, Rust, Go, C, Shell — all pass (compile + run).
Note from Isaac: I am very skeptical that it can accomplish all of this at 100%. V3.1 will include a more robust set of reliability testing.
Full training data and benchmark traces: ATLAS Dataset on HuggingFace
flowchart LR
Probe["Probe"] --> GL1["C(x)/G(x) Score"] --> SB1["Sandbox"] --> Pass1{"Pass?"}
Pass1 -->|"Yes"| Done["Write Winner"]
Pass1 -->|"No"| PS["PlanSearch"] --> DS["DivSampling"] --> BF["Budget Forcing"] --> GL2["Score + Test K"] --> Pass2{"Any pass?"}
Pass2 -->|"Yes"| Select["Best-of-K\nC(x)/G(x) select"] --> Done
Pass2 -->|"No"| PR["PR-CoT Repair"] --> RL["Refinement Loop"] --> DC["Derivation Chains"] --> Done
style Probe fill:#1a3a5c,color:#fff
style GL1 fill:#2d5016,color:#fff
style SB1 fill:#2d5016,color:#fff
style PS fill:#1a3a5c,color:#fff
style DS fill:#1a3a5c,color:#fff
style BF fill:#1a3a5c,color:#fff
style GL2 fill:#2d5016,color:#fff
style Select fill:#2d5016,color:#fff
style PR fill:#5c3a1a,color:#fff
style RL fill:#5c3a1a,color:#fff
style DC fill:#5c3a1a,color:#fff
style Done fill:#333,color:#fff
The model writes code. The infrastructure makes it reliable.
The ATLAS CLI wraps this pipeline in a tool-call agent loop. The model emits structured JSON tool calls (write_file, edit_file, run_command, etc.) with grammar enforcement guaranteeing 100% valid output. Feature files with complex logic (T2) automatically route through the V3 pipeline for diverse candidate generation, build verification, and energy-based selection. Config files and boilerplate (T1) skip the pipeline for instant writes.
Full architecture: docs/ARCHITECTURE.md
| Resource | Minimum | Tested |
|---|---|---|
| GPU VRAM | 16 GB | RTX 5060 Ti 16 GB |
| System RAM | 14 GB | 16 GB |
| Disk | 20 GB free | For model weights + containers |
| Python | 3.10+ | 3.11 |
| OS | Linux (RHEL, Ubuntu, Arch) | RHEL 9 |
These are actively being addressed in V3.1:
- 9B model not yet formally benchmarked. The 74.6% result was achieved on Qwen3-14B. The CLI runs Qwen3.5-9B with the same V3 pipeline — formal LCB benchmarks on the 9B model are V3.1 work.
- GPQA and SciCode scores are from V2. V3 phases were not designed specifically for any single benchmark — they are general-purpose code generation improvements. GPQA (47.0%) and SciCode (14.7%) were tested on V2 infrastructure only. Cross-benchmark evaluation is a V3.1 priority.
- L6 reliability at 67%. Adding features to existing projects fails ~1/3 of the time — the 9B model sometimes over-explores instead of writing code. Exploration budget and context injection mitigate but don't fully solve this.
- Inference speed. Grammar-constrained output runs at ~51 tok/s on llama-server. Fox (with PagedAttention and prefix caching) achieves only 14 tok/s with grammar due to Tokio async overhead. C-side sampler chain fix planned for V3.1.
| Document | Description |
|---|---|
| SETUP.md | Installation — Docker, bare-metal, K3s |
| CLI.md | CLI usage, streaming output, getting best results |
| TROUBLESHOOTING.md | Common issues and solutions |
| ARCHITECTURE.md | Two-layer architecture, component design |
| CONFIGURATION.md | All environment variables and config |
| API.md | HTTP API endpoints and formats |
| MAP.md | Visual guide to every file in the repo |
| V3_ABLATION_STUDY.md | Ablation methodology and results |
| CHANGELOG.md | Release history |
Historical documentation
| Document | Description |
|---|---|
| V2_5_ABLATION_STUDY.md | V2.5 Geometric Lens ablation |
| V2_TO_V2_5_MIGRATION.md | V2 to V2.5 migration |
For a complete guide to every directory and file, see docs/MAP.md.
V3.0 — Complete (2026-03-05). 74.6% LCB pass@1-v(k=3) on frozen Qwen3-14B. Full ablation report.
V3.0.1 — Complete (2026-04-05). Interactive CLI with tool-call agent loop, Docker Compose deployment, V3 pipeline integration, 95.8% reliability. This is the current release.
V3.1 — In Progress.
- Benchmarks (not yet run): LiveCodeBench v5 on Qwen3.5-9B with CLI pipeline, GPQA Diamond, SciCode, AA-LCR, AA-Omniscience, Humanity's Last Exam, CritPt
- CLI reliability testing: Expand 8-level test to 10 iterations, target L6 ≥ 90%
- Fox optimization: C-side sampler chain for grammar speed (14→50 tok/s target)
- Geometric Lens: Further improving Geometric Lens datasets through V3.1 full-suite benchmark data
- ROCm support: AMD GPU inference via llama.cpp ROCm backend (expanding beyond NVIDIA-only)
- Target: 80-90% LCB pass@1-v(k=3)
Licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). Commercial licensing available — contact the copyright holder for details.
See CONTRIBUTING.md.

