Skip to content

Conversation

@doringeman
Copy link
Contributor

This is needed as people might not want to have vllm-metal installed by default, especially for Docker Desktop users.
Eventually we'll switch to an approach when all the backends will be so self-contained and independently installable.

To test:

make -C cmd/cli install
rm -rf ~/.docker/model-runner/vllm-metal/ # make sure vllm-metal is not installed
MODEL_RUNNER_PORT=8080 make run
$ MODEL_RUNNER_HOST=http://localhost:8080 docker model run huggingface.co/mlx-community/llama-3.2-1b-instruct-4bit hi
Failed to generate a response: error response: status=412 body=backend "vllm-metal" is not installed; run: docker model install-runner --backend vllm-metal

$ MODEL_RUNNER_HOST=http://localhost:8080 docker model install-runner --backend vllm-metal
Installing vllm-metal backend...
vllm-metal backend installed successfully

$ MODEL_RUNNER_HOST=http://localhost:8080 docker model run huggingface.co/mlx-community/llama-3.2-1b-instruct-4bit hi
Hi there! How can I help you today?

$ curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "hf.co/mlx-community/llama-3.2-1b-instruct-4bit",
      "messages": [{"role": "user", "content": "hi"}],
      "max_tokens": 512
    }'
{"id":"chatcmpl-06e20393","object":"chat.completion","created":1770745785,"model":"hf.co/mlx-community/llama-3.2-1b-instruct-4bit","choices":[{"index":0,"message":{"role":"assistant","content":"Hi there! How can I help you today?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":3,"completion_tokens":8,"total_tokens":11}}

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @doringeman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the backend installation process to make the vllm-metal backend opt-in rather than a default installation. This improves the user experience by allowing users to install only the backends they need, reducing initial setup overhead and potential conflicts, especially for Docker Desktop environments. The changes involve introducing new API endpoints and updating the internal scheduling and installation mechanisms to support this deferred installation model.

Highlights

  • Opt-in VLLM-Metal Backend Installation: The vllm-metal backend is no longer installed by default and now requires explicit installation via docker model install-runner --backend vllm-metal. This change addresses concerns for users who may not need it, particularly Docker Desktop users, and moves towards a more modular backend installation approach.
  • New Backend Installation API Endpoint: A new HTTP POST endpoint, /inference/install-backend, has been introduced to programmatically trigger the on-demand installation of deferred backends like vllm-metal.
  • Enhanced Backend Management Logic: The core scheduling and installation logic has been updated to support deferred backend installations, including checks for already installed backends and appropriate error handling when a deferred backend is requested but not yet installed.
  • Updated CLI and Documentation: The command-line interface (docker model install-runner, reinstall-runner, start-runner) and associated documentation have been updated to include vllm-metal as a selectable backend option.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cmd/cli/commands/install-runner.go
    • Imported the vllmmetal package to support its backend functionality.
    • Updated the backendUsage string to include vllm-metal as a valid option for backend specification.
    • Added conditional logic to runInstallOrStart to handle vllm-metal as a deferred backend, triggering its installation via desktopClient.InstallBackend.
    • Expanded the validBackends list to include vllmmetal.Name for validation.
  • cmd/cli/desktop/desktop.go
    • Added a new InstallBackend method to the Client struct, which sends an HTTP POST request to the model runner's /inference/install-backend endpoint to initiate backend installation.
  • cmd/cli/docs/reference/docker_model_install-runner.yaml
    • Updated the description for the --backend option to include vllm-metal as a valid choice.
  • cmd/cli/docs/reference/docker_model_reinstall-runner.yaml
    • Updated the description for the --backend option to include vllm-metal as a valid choice.
  • cmd/cli/docs/reference/docker_model_start-runner.yaml
    • Updated the description for the --backend option to include vllm-metal as a valid choice.
  • cmd/cli/docs/reference/model_install-runner.md
    • Updated the documentation table for the --backend flag to list vllm-metal as an available option.
  • cmd/cli/docs/reference/model_reinstall-runner.md
    • Updated the documentation table for the --backend flag to list vllm-metal as an available option.
  • cmd/cli/docs/reference/model_start-runner.md
    • Updated the documentation table for the --backend flag to list vllm-metal as an available option.
  • main.go
    • Introduced a deferredBackends slice to track backends that require explicit installation.
    • Conditionally added vllmmetal.Name to deferredBackends if the vllmMetalBackend is available.
    • Modified the NewScheduler constructor call to pass the deferredBackends list.
  • pkg/inference/scheduling/http_handler.go
    • Added a new POST route for /inference/install-backend to handle on-demand backend installations.
    • Inserted a debug log message (ALOHA) within handleOpenAIInference for installation errors.
    • Added error handling for errBackendNotInstalled, returning a 412 Precondition Failed status with instructions to install the backend.
    • Implemented the InstallBackend handler to parse the request body and trigger backend installation via the scheduler.
  • pkg/inference/scheduling/installer.go
    • Imported the sync package for mutex functionality.
    • Defined a new error errBackendNotInstalled to indicate a deferred backend has not been installed.
    • Added deferredBackends map and mu mutex to the installer struct to manage deferred installations.
    • Updated newInstaller to accept a list of deferredBackends and initialize the deferredBackends map.
    • Modified the run method to skip automatic installation for deferred backends and instead check for existing installations on disk.
    • Updated the wait method to immediately return errBackendNotInstalled for deferred backends that are not yet installed, without blocking.
    • Added installBackend method to trigger on-demand installation of a deferred backend, handling idempotency and retries.
    • Added isInstalled method to non-blockingly check if a backend has completed installation.
  • pkg/inference/scheduling/scheduler.go
    • Added a deferredBackends field to the Scheduler struct.
    • Modified the NewScheduler constructor to accept and pass deferredBackends to the newInstaller.
    • Updated selectBackendForModel to check if vllm-metal is actually installed using s.installer.isInstalled before preferring it for safetensors models, logging a message if it's available but not installed.
    • Updated ResetInstaller to pass the deferredBackends list when creating a new installer.
    • Added an InstallBackend method to the Scheduler to expose the installer's on-demand installation functionality.
  • pkg/inference/scheduling/scheduler_test.go
    • Updated the NewScheduler call in TestCors to include the new deferredBackends parameter (nil in this test case).
Activity
  • The pull request introduces new functionality and modifications across multiple core components.
  • The author provided detailed testing steps and examples demonstrating the new opt-in installation flow for the vllm-metal backend.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces an opt-in installation mechanism for the vllm-metal backend, addressing the need for users, especially Docker Desktop users, to avoid default installations. This change involves modifying several Go files to include vllm-metal in backend lists, adding a new InstallBackend function in the desktop client, and updating the scheduler to handle deferred backend installations. Documentation files for install-runner, reinstall-runner, and start-runner commands have also been updated to reflect the inclusion of vllm-metal as a backend option. The changes improve flexibility for users by allowing them to explicitly install vllm-metal when needed, while maintaining the existing backend management for others.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In handleOpenAIInference there's a leftover Warnln("ALOHA", err) debug log that should be removed or made more descriptive before merge.
  • In installer.run, when a deferred backend has on-disk data but Install returns an error, the error is silently ignored and the status channels remain open; consider recording the failure on status.failed so callers of wait get a concrete error instead of treating the backend as never-installed.
  • The new /install-backend endpoint currently wraps any non-ErrBackendNotFound error as a 500 with the raw error string; consider distinguishing transient install failures (e.g., network issues) with a 5xx status but a more user-facing message and optionally exposing the underlying error via logs instead of the HTTP response.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `handleOpenAIInference` there's a leftover `Warnln("ALOHA", err)` debug log that should be removed or made more descriptive before merge.
- In `installer.run`, when a deferred backend has on-disk data but `Install` returns an error, the error is silently ignored and the status channels remain open; consider recording the failure on `status.failed` so callers of `wait` get a concrete error instead of treating the backend as never-installed.
- The new `/install-backend` endpoint currently wraps any non-`ErrBackendNotFound` error as a 500 with the raw error string; consider distinguishing transient install failures (e.g., network issues) with a 5xx status but a more user-facing message and optionally exposing the underlying error via logs instead of the HTTP response.

## Individual Comments

### Comment 1
<location> `pkg/inference/scheduling/scheduler.go:114-117` </location>
<code_context>
 	}

 	if config.GetFormat() == types.FormatSafetensors {
-		// Prefer vllm-metal for safetensors models on macOS (most feature-rich for Metal)
+		// Prefer vllm-metal for safetensors models on macOS (most feature-rich for Metal),
+		// but only if it has been installed.
 		if vllmMetalBackend, ok := s.backends[vllmmetal.Name]; ok && vllmMetalBackend != nil {
+			if s.installer.isInstalled(vllmmetal.Name) {
+				return vllmMetalBackend
</code_context>

<issue_to_address>
**issue (bug_risk):** vllm-metal branch returns the backend even when it is not installed, contradicting the comment and log message.

If `isInstalled` returns false, the code still returns `vllmMetalBackend` right after logging the warning, so requests are routed to an uninstalled backend, contradicting the comment and log message and likely failing later.

Instead, only return vllm-metal when `isInstalled` is true and otherwise fall through to the existing MLX/other backend selection, e.g.:

```go
if vllmMetalBackend, ok := s.backends[vllmmetal.Name]; ok && vllmMetalBackend != nil {
    if s.installer.isInstalled(vllmmetal.Name) {
        return vllmMetalBackend
    }
    s.log.Infof("vllm-metal backend is available but not installed. To install, run: docker model install-runner --backend %s", vllmmetal.Name)
    // fall through to MLX/other fallback
}
```
</issue_to_address>

### Comment 2
<location> `pkg/inference/scheduling/installer.go:105-110` </location>
<code_context>
 	// ubiquitous backend and mlx as a relatively lightweight backend (on macOS
 	// only), this granularity is probably less of a concern.
 	for name, backend := range i.backends {
+		// For deferred backends, check if they are already installed on disk
+		// from a previous session. Only call Install() (which verifies the
+		// existing installation) when files are present, to avoid triggering
+		// a download.
+		if i.deferredBackends[name] {
+			status := i.statuses[name]
+			if diskUsage, err := backend.GetDiskUsage(); err == nil && diskUsage > 0 {
</code_context>

<issue_to_address>
**issue (bug_risk):** Deferred-backend Install errors during run() are silently ignored and never surface via wait().

In the deferred-backend branch, when `GetDiskUsage` > 0 you call `Install`, but if `Install` returns an error you just `continue`, leaving both `installed` and `failed` channels open. As a result:

* `wait()` always returns `errBackendNotInstalled`, hiding the real install error.
* There’s no logging or propagation of the failure.

Please handle this like the normal install path: set `status.err`, close `status.failed`, and (optionally) log the error so `wait()` callers see the actual failure and state is consistent with other install flows.
</issue_to_address>

### Comment 3
<location> `pkg/inference/scheduling/installer.go:237-207` </location>
<code_context>
+
+// isInstalled returns true if the given backend has completed installation.
+// It is non-blocking.
+func (i *installer) isInstalled(name string) bool {
+	status, ok := i.statuses[name]
+	if !ok {
+		return false
+	}
+	select {
+	case <-status.installed:
+		return true
</code_context>

<issue_to_address>
**issue (bug_risk):** isInstalled reads statuses without synchronization while installBackend can mutate the map entry, which is a potential data race.

`installBackend` replaces entries in `i.statuses` under `i.mu`, but `isInstalled` reads both `i.statuses` and `status.installed` without any locking. This creates a data race between these methods when used concurrently.

Consider either:
- Taking `i.mu` in `isInstalled` while accessing `statuses`, or
- Reusing the existing `installStatus` instance instead of replacing it, or
- Protecting `statuses` with a dedicated sync primitive (e.g., `sync.RWMutex`).

Any of these would make the concurrent interaction between these methods safe.
</issue_to_address>

### Comment 4
<location> `pkg/inference/scheduling/http_handler.go:204` </location>
<code_context>
 	// don't allow any requests to be scheduled for a backend until it has
 	// completed installation.
 	if err := h.scheduler.installer.wait(r.Context(), backend.Name()); err != nil {
+		h.scheduler.log.Warnln("ALOHA", err)
+
</code_context>

<issue_to_address>
**suggestion:** Leftover debug log message (“ALOHA”) is noisy and not actionable in production.

The `Warnln("ALOHA", err)` call appears to be temporary debug logging and emits an opaque message on every backend readiness error. Please remove it or replace it with a structured, descriptive log (e.g., including backend name and context) that follows existing logging conventions.

Suggested implementation:

```golang
	if err := h.scheduler.installer.wait(r.Context(), backend.Name()); err != nil {
		h.scheduler.log.
			WithError(err).
			WithField("backend", backend.Name()).
			Warn("failed waiting for backend installation to complete")

		if errors.Is(err, ErrBackendNotFound) {

```

If `h.scheduler.log` does not yet support `WithError` / `WithField` (e.g., if it's not a logrus-style logger), you should:
1. Replace the logging call with whatever structured logging helpers are used elsewhere in this file or package (for example, `h.scheduler.log.Warnf("failed waiting for backend %s installation to complete: %v", backend.Name(), err)`).
2. Keep the message similarly descriptive and include the backend name and the error.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
…allBackend handler

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
}

// Backends whose installation is deferred until explicitly requested.
var deferredBackends []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this defferedBackends idea 👏

m["GET "+inference.InferencePrefix+"/v1/models"] = h.handleModels
m["GET "+inference.InferencePrefix+"/v1/models/{name...}"] = h.handleModels

m["POST "+inference.InferencePrefix+"/install-backend"] = h.InstallBackend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use only a POST request to /engines to make the API more RESTful, or keep verbs in the path like we do for actions such as ps and df?

@ericcurtin
Copy link
Contributor

ericcurtin commented Feb 10, 2026

I'd rather we auto-pulled it on-demand tbh, I want the huggingface help for example to be only this:

docker model run huggingface.co/mlx-community/llama-3.2-1b-instruct-4bit

If a user wants to Ctrl-C that's fine, we should auto-pull and install like a "docker run" . If we have to change the huggingface help from a single "docker run" to multiple DMR commands, we have failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants