A Docker-based template for running llama.cpp llama-server in router mode with support for NVIDIA CUDA (NVIDIA GPUs) and AMD Vulkan (AMD integrated graphics). Optimized for high-performance inference with 24GB or more of VRAM using TurboQuant technology.
TurboQuant is advanced KV cache quantization technology developed primarily by GitHub contributors TheTom and SpiritBuun, enabling efficient long-context inference with minimal quality loss.
- GPU-Accelerated Inference: Full support for NVIDIA CUDA (RTX 3090, RTX 4090, etc.) and AMD Vulkan (Radeon 780M, Strix Halo configurations, and discrete AMD GPUs)
- Router Mode: Efficiently manage and load multiple models, serving one at a time with full context window
- Advanced Quantization: Built with turboquant support for optimized KV cache handling
- Long Context: Up to 262,144 token context window with adaptive KV cache quantization
- Flexible Configuration: Simple INI-based model and server configuration
- Docker Containerization: Easy deployment and reproducibility
- NVIDIA: NVIDIA RTX 3090 (24GB) β primary target platform
- AMD: AMD Radeon 780M, Strix Halo configurations, and other AMD integrated/discrete GPUs with sufficient VRAM
Note: AMD configurations have been tested by the author but have not yet been integrated into this repository, plan is to add them once doing further TurboQuant testing on the AMD platform!
This repository requires a turboquant-enabled build of llama.cpp. You have two options:
cd llama-build
./rc.fetch-TheTom
./rc.build-TheTomcd llama-build
./rc.fetch-buun
./rc.build-buunBoth forks support turboquant and work well with the configurations provided. The author is currently using buun's variant, which has integrated much of TheTom's work. Both are excellent choices and provide significantly better quality and performance for KV cache quantization.
- Docker: Install Docker
- NVIDIA Container Runtime: Required for CUDA support
- Install NVIDIA Container Runtime
- Verify with:
docker run --rm --runtime=nvidia nvidia/cuda:13.1.0-base nvidia-smi
Place GGUF model files in the ./models directory. Models are automatically downloaded by the server based on the configuration in config.ini using Hugging Face Hub integration.
Pro Tip (faster Hugging Face downloads): You can pre-download models much faster with
huggingface_hub+hf_transfer(often ~10x faster), and write directly into./models, which is where the llama.cpp container already expects them.pipx install --force "huggingface_hub[hf_transfer]" hf auth login # Enter your HuggingFace API key (login via web) # Fast-download Qwen3.6-35b-MoE and Gemma-4-31B HF_HUB_CACHE=$PWD/models HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-UD-Q4_K_M.gguf HF_HUB_CACHE=$PWD/models HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/gemma-4-31B-it-GGUF gemma-4-31B-it-Q4_K_S.gguf
After cloning and building llama.cpp, the Docker image will be available as llama.cpp:server-cuda-turbo.
To rebuild or update:
cd llama-build
./rc.build-TheTom # or ./rc.build-buunEdit config.ini to specify:
- Models to load (name, Hugging Face repository, quantization)
- Which model should load automatically at startup (
load-on-startup) - Server settings (port, context size, KV cache strategy)
- Performance tuning (GPU layers, parallel slots, etc.)
Example configuration (already provided):
[*]
host = 0.0.0.0
port = 8080
ctx-size = 262144
cache-type-k = q8_0 # K-cache: full 8-bit quantization
cache-type-v = turbo4 # V-cache: aggressive turbo4 quantization
flash-attn = true
fit = on
fit-target = 256
n-gpu-layers = 999
models-max = 1
parallel = 1# Start the server in the background
./rc.start
# View logs
docker compose logs -f
# Stop the server
./rc.stopThe server will be available at http://localhost:8080.
Once the service is running, you can open the built-in llama.cpp web UI at http://<server-hostname>:8080 to quickly test prompts in a browser (localhost if you're running it on the same machine).
No model is auto-loaded at startup right now, so the first model you select in the web UI will be the one that gets loaded.
- Health Check:
curl http://localhost:8080/health - Load Model:
POST /model/loadwith model name - Inference:
POST /completionwith prompt and parameters - Slots API:
GET /slotsto inspect active inference slots - Full Documentation: See llama.cpp API documentation
compose.yml: No functional changes from the original template. It still runsllama.cpp:server-cuda-turbo, exposes8080:8080, mounts./modelsand./config.ini, and starts with--models-preset /config.ini.config.ini:- Added/tuned GPU-offload settings:
flash-attn = true,fit-target = 256, andn-gpu-layersincreased from99to999. - Disabled auto-loading on startup by commenting out
load-on-startup, so the first model selected in the web UI/API is loaded. - MoE preset quantization updated to
Q4_K_M(Qwen3.5-35B-MoEchanged from...Q4_K_XL.ggufto...Q4_K_M.gguf), and a new[Qwen3.6-35B-MoE]preset was added. - Added
[Gemma-4-31B-IT]preset (unsloth/gemma-4-31B-it-GGUF) withQ4_K_S, model-levelturbo4/turbo4cache override, and reducedctx-size = 224000for 24GB VRAM. - CPU thread tuning guidance changed to prefer leaving
threads/threads-batchcommented on asymmetric-core systems.
- Added/tuned GPU-offload settings:
Based on testing with a 24GB NVIDIA RTX 3090:
Presets for Qwen3.5-27b (dense), Qwen3.5-35b-MoE, Qwen3.6-35b-MoE, and Gemma-4-31B-IT.
No model is currently configured to auto-load on startup. This means the first model you pick in the llama.cpp web UI (or via POST /model/load) is what will load.
To change the default model, uncomment this one line in the model section you want:
# load-on-startup = trueKeep load-on-startup = true enabled for only one model section, then restart the container (./rc.stop && ./rc.start).
Setting a default startup model is especially useful with agent workflows (for example Hermes, OpenClaw, or OpenCode), so your preferred model is already loaded and ready to serve requests immediately.
Best Quality + Performance (Default):
- K-cache:
q8_0(full 8-bit precision) - V-cache:
turbo4(4-bit quantization) - Quantization: Q4_K_M (better perplexity than Q4_K_XL which also fits?)
- Context: Up to 256K tokens
Balanced Configurations to Experiment With:
turbo4/turbo4β More aggressive quantization, increased throughputturbo3/turbo3β Even more aggressive, lower qualityq8_0/turbo3β Full K-cache precision, aggressive V-cache
Adjust based on your specific use case:
- Prioritizing output quality β Use
q8_0for K-cache - Maximizing throughput β Use turbo quantization for both caches
- Limited VRAM β Reduce context size or use more aggressive quantization
| Parameter | Purpose |
|---|---|
ctx-size |
Maximum context window size (default: 262144) |
cache-type-k |
K-cache quantization: q8_0, q4_0, turbo4, etc. |
cache-type-v |
V-cache quantization: turbo4, turbo3, etc. |
kv-unified |
Use shared memory pool for KV cache |
flash-attn |
Enable Flash Attention for faster attention kernels on supported builds |
fit-target |
VRAM usage target percentage used by fit = on |
n-gpu-layers |
Number of model layers to offload to GPU (999 = effectively all layers) |
models-max |
Maximum simultaneous models (1 = prevent OOM) |
parallel |
Number of concurrent inference slots |
mlock |
Lock model in RAM to prevent swapping |
mmap |
Memory-map model file for faster loading |
sleep-idle-seconds |
Prevent automatic model unloading (-1 = never unload) |
.
βββ README.md # This file
βββ compose.yml # Docker Compose configuration
βββ config.ini # Server & model configuration
βββ rc.start # Start the server
βββ rc.stop # Stop the server
βββ models/ # Model storage directory
β βββ .gitkeep
βββ llama-build/ # Build scripts and Dockerfile
β βββ rc.fetch-TheTom # Clone TheTom's llama.cpp fork
β βββ rc.build-TheTom # Build Docker image from TheTom's fork
β βββ rc.fetch-buun # Clone buun's llama.cpp fork
β βββ rc.build-buun # Build Docker image from buun's fork
β βββ cuda-turbo.Dockerfile # Multi-stage Dockerfile for CUDA builds
β βββ llama-cpp-turboquant/ # (Created by rc.fetch-TheTom)
β βββ buun-llama-cpp/ # (Created by rc.fetch-buun)
βββ .gitignore
./rc.start
curl http://localhost:8080/health # Wait for healthy statusThe server will load the model specified by load-on-startup = true in config.ini.
For a quick smoke test from a browser, open http://<server-hostname>:8080 (or http://localhost:8080 locally) to use the llama.cpp web portal.
curl http://localhost:8080/completion \
-d '{
"prompt": "Once upon a time",
"n_predict": 100,
"temperature": 0.7
}'curl -X POST http://localhost:8080/model/load \
-H "Content-Type: application/json" \
-d '{"name": "Qwen3.5-35B-MoE"}'curl http://localhost:8080/slots- Verify NVIDIA Container Runtime is installed:
docker run --rm --runtime=nvidia nvidia/cuda:13.1.0-base nvidia-smi - Check
CUDA_VISIBLE_DEVICESincompose.ymlmatches your available GPUs - Ensure NVIDIA drivers are up-to-date on your host system
- Reduce
ctx-sizeinconfig.ini - Use more aggressive V-cache quantization (
turbo3instead ofturbo4) - Set
models-max = 1to prevent multiple models in memory - Reduce
parallelto decrease concurrent inference slots
- Verify the Hugging Face repository and file names in
config.ini - Ensure sufficient disk space for model downloads
- Check internet connectivity and Hugging Face Hub availability
- Reduce
threadsandthreads-batchif CPU bottlenecked - Increase
batch-sizeandubatch-sizefor higher throughput - Disable
mlockif experiencing swap issues (not recommended)
To customize the build for a specific CUDA architecture:
cd llama-build
docker build \
-f cuda-turbo.Dockerfile \
--target server \
--build-arg CUDA_DOCKER_ARCH=86 \
-t llama.cpp:server-cuda-turbo-custom \
./llama-cpp-turboquantEdit config.ini and add a new section:
[MyModel-Name]
hf-repo = username/model-repo-name
hf-file = model-filename.gguf
load-on-startup = false
ctx-size = 131072- Expand AMD Vulkan documentation and configuration examples
- Test and provide optimized configurations for AMD Radeon 780M and Strix Halo
- Add Vulkan-specific build scripts and Dockerfile variants
- Performance benchmarking suite for different quantization modes
- Extended examples with various inference patterns (streaming, multi-turn chat, etc.)
Note: AMD configurations have been tested by the author but will be documented more thoroughly in future versions.
Contributions are welcome! Please submit issues and pull requests for bugs, improvements, or additional configurations.
This repository template is provided as-is. Refer to llama.cpp's original licensing for the core application.
- llama.cpp GitHub
- llama.cpp Server Documentation
- TheTom's llama-cpp-turboquant
- buun's llama-cpp
- NVIDIA Container Runtime
Author: pdscomp
Last Updated: April 2026