feat: native multi-backend inference support (Metal/ROCm/CUDA/CPU) + GitHub Actions build pipeline by Copilot · Pull Request #1 · audiohacking/ATLAS

Copilot · 2026-04-07T08:16:03Z

Summary

Removes hard NVIDIA/CUDA dependencies and adds native multi-backend support. All builds are now owned by GitHub Actions.

Motivation

The ATLAS patch to llama.cpp (params_dft.embedding = false) existed to support speculative decoding + embeddings in the same server, but speculative decoding is not supported for Qwen3.5 anyway — the patch served a dormant component.
The only hard NVIDIA dependencies in the repo were three lines in the Dockerfiles: the base image, DGGML_CUDA=ON, and DCMAKE_CUDA_ARCHITECTURES. Everything else (proxy, sandbox, Lens scorer) is plain HTTP with no GPU awareness.

Changes

Inference Dockerfiles (`inference/Dockerfile`, `inference/Dockerfile.v31`)

Removed the ATLAS llama.cpp patch (sed -i params_dft.embedding = false)
Added GGML_BACKEND build arg (cuda / rocm / cpu)
Added CUDA_ARCHITECTURES build arg (Ada/Hopper/Blackwell, narrowable per-GPU)
Switched CUDA base from rockylinux9 → ubuntu22.04 for consistency across backends
Metal explicitly excluded from the Dockerfile with a note — it requires native macOS toolchain (see GH Actions below)

GitHub Actions (`.github/workflows/build-inference.yml`) — new

Job	Runner	Output
`build-linux (cuda)`	`ubuntu-latest`	Docker image → GHCR `…/llama-server:*-cuda`
`build-linux (rocm)`	`ubuntu-latest`	Docker image → GHCR `…/llama-server:*-rocm`
`build-linux (cpu)`	`ubuntu-latest`	Docker image → GHCR `…/llama-server:*-cpu`
`build-metal`	`macos-latest`	Native arm64 binary (artifact + release asset)

Metal must run on macos-latest because Metal.framework and MetalPerformanceShaders are macOS-only and are unavailable inside any Linux Docker container.

docker-compose

Replaced hard /dev/nvidia* device mounts + NVIDIA_VISIBLE_DEVICES env vars with the standard deploy.resources.reservations.devices spec (works with the nvidia container runtime)
Added GGML_BACKEND and CUDA_ARCHITECTURES build args
Added docker-compose.rocm.yml override for AMD ROCm (/dev/kfd, /dev/dri, group_add: [video, render])

Entrypoint scripts

entrypoint-v3.1-9b.sh, entrypoint-mtp.sh, entrypoint-v3-specdec.sh: CUDA-specific env vars (GGML_CUDA_NO_PINNED, CUDA_DEVICE_MAX_CONNECTIONS, CUDA_MODULE_LOADING) now guarded behind command -v nvidia-smi so they are silently skipped on ROCm/CPU

Benchmark (`benchmark/analysis/hardware_info.py`)

get_gpu_info() now tries nvidia-smi → rocm-smi → system_profiler in order
get_cuda_version() now also detects ROCm version (hipconfig) and Metal/macOS version

Usage

# NVIDIA (default)
docker compose up --build

# AMD ROCm
GGML_BACKEND=rocm docker compose -f docker-compose.yml -f docker-compose.rocm.yml up --build

# CPU-only
GGML_BACKEND=cpu docker compose up --build

# Apple Metal — use the GH Actions artifact, or build natively:
git clone https://github.com/ggml-org/llama.cpp /tmp/llama.cpp
cd /tmp/llama.cpp && cmake -B build -DGGML_METAL=ON -DBUILD_SHARED_LIBS=OFF && cmake --build build -j$(sysctl -n hw.logicalcpu)

Agent-Logs-Url: https://github.com/audiohacking/ATLAS/sessions/2e0fa3f5-ba39-4576-ba5c-df25a5b193be Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Removed CPU backend configuration from build-inference.yml.

Copilot AI and others added 2 commits April 7, 2026 08:11

chore: update plan to include GH Actions native builds

ad686b9

Agent-Logs-Url: https://github.com/audiohacking/ATLAS/sessions/2e0fa3f5-ba39-4576-ba5c-df25a5b193be Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

feat: add multi-backend inference builds with native Metal on GH Actions

0ecfed6

Agent-Logs-Url: https://github.com/audiohacking/ATLAS/sessions/2e0fa3f5-ba39-4576-ba5c-df25a5b193be Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Copilot AI assigned Copilot and lmangani Apr 7, 2026

Copilot created this pull request from a session on behalf of lmangani April 7, 2026 08:18 View session

Copilot finished work on behalf of lmangani April 7, 2026 08:18

Copilot AI requested a review from lmangani April 7, 2026 08:18

Remove CPU backend from inference workflow

856d61f

Removed CPU backend configuration from build-inference.yml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: native multi-backend inference support (Metal/ROCm/CUDA/CPU) + GitHub Actions build pipeline#1

feat: native multi-backend inference support (Metal/ROCm/CUDA/CPU) + GitHub Actions build pipeline#1
Copilot wants to merge 3 commits intomainfrom
copilot/add-metal-support-without-cuda

Copilot AI commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 7, 2026

Summary

Motivation

Changes

Inference Dockerfiles (inference/Dockerfile, inference/Dockerfile.v31)

GitHub Actions (.github/workflows/build-inference.yml) — new

docker-compose

Entrypoint scripts

Benchmark (benchmark/analysis/hardware_info.py)

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Inference Dockerfiles (`inference/Dockerfile`, `inference/Dockerfile.v31`)

GitHub Actions (`.github/workflows/build-inference.yml`) — new

Benchmark (`benchmark/analysis/hardware_info.py`)