Skip to content

feat: native multi-backend inference support (Metal/ROCm/CUDA/CPU) + GitHub Actions build pipeline#1

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/add-metal-support-without-cuda
Draft

feat: native multi-backend inference support (Metal/ROCm/CUDA/CPU) + GitHub Actions build pipeline#1
Copilot wants to merge 3 commits intomainfrom
copilot/add-metal-support-without-cuda

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 7, 2026

Summary

Removes hard NVIDIA/CUDA dependencies and adds native multi-backend support. All builds are now owned by GitHub Actions.

Motivation

  • The ATLAS patch to llama.cpp (params_dft.embedding = false) existed to support speculative decoding + embeddings in the same server, but speculative decoding is not supported for Qwen3.5 anyway — the patch served a dormant component.
  • The only hard NVIDIA dependencies in the repo were three lines in the Dockerfiles: the base image, DGGML_CUDA=ON, and DCMAKE_CUDA_ARCHITECTURES. Everything else (proxy, sandbox, Lens scorer) is plain HTTP with no GPU awareness.

Changes

Inference Dockerfiles (inference/Dockerfile, inference/Dockerfile.v31)

  • Removed the ATLAS llama.cpp patch (sed -i params_dft.embedding = false)
  • Added GGML_BACKEND build arg (cuda / rocm / cpu)
  • Added CUDA_ARCHITECTURES build arg (Ada/Hopper/Blackwell, narrowable per-GPU)
  • Switched CUDA base from rockylinux9ubuntu22.04 for consistency across backends
  • Metal explicitly excluded from the Dockerfile with a note — it requires native macOS toolchain (see GH Actions below)

GitHub Actions (.github/workflows/build-inference.yml) — new

Job Runner Output
build-linux (cuda) ubuntu-latest Docker image → GHCR …/llama-server:*-cuda
build-linux (rocm) ubuntu-latest Docker image → GHCR …/llama-server:*-rocm
build-linux (cpu) ubuntu-latest Docker image → GHCR …/llama-server:*-cpu
build-metal macos-latest Native arm64 binary (artifact + release asset)

Metal must run on macos-latest because Metal.framework and MetalPerformanceShaders are macOS-only and are unavailable inside any Linux Docker container.

docker-compose

  • Replaced hard /dev/nvidia* device mounts + NVIDIA_VISIBLE_DEVICES env vars with the standard deploy.resources.reservations.devices spec (works with the nvidia container runtime)
  • Added GGML_BACKEND and CUDA_ARCHITECTURES build args
  • Added docker-compose.rocm.yml override for AMD ROCm (/dev/kfd, /dev/dri, group_add: [video, render])

Entrypoint scripts

  • entrypoint-v3.1-9b.sh, entrypoint-mtp.sh, entrypoint-v3-specdec.sh: CUDA-specific env vars (GGML_CUDA_NO_PINNED, CUDA_DEVICE_MAX_CONNECTIONS, CUDA_MODULE_LOADING) now guarded behind command -v nvidia-smi so they are silently skipped on ROCm/CPU

Benchmark (benchmark/analysis/hardware_info.py)

  • get_gpu_info() now tries nvidia-smi → rocm-smi → system_profiler in order
  • get_cuda_version() now also detects ROCm version (hipconfig) and Metal/macOS version

Usage

# NVIDIA (default)
docker compose up --build

# AMD ROCm
GGML_BACKEND=rocm docker compose -f docker-compose.yml -f docker-compose.rocm.yml up --build

# CPU-only
GGML_BACKEND=cpu docker compose up --build

# Apple Metal — use the GH Actions artifact, or build natively:
git clone https://github.com/ggml-org/llama.cpp /tmp/llama.cpp
cd /tmp/llama.cpp && cmake -B build -DGGML_METAL=ON -DBUILD_SHARED_LIBS=OFF && cmake --build build -j$(sysctl -n hw.logicalcpu)

Removed CPU backend configuration from build-inference.yml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants