Skip to content

[R3] Enable R3 with new inference#1428

Open
hao-aaron wants to merge 10 commits intoNovaSky-AI:mainfrom
hao-aaron:r3-new-inference
Open

[R3] Enable R3 with new inference#1428
hao-aaron wants to merge 10 commits intoNovaSky-AI:mainfrom
hao-aaron:r3-new-inference

Conversation

@hao-aaron
Copy link
Copy Markdown
Contributor

@hao-aaron hao-aaron commented Apr 2, 2026

SumanthRH and others added 9 commits March 19, 2026 17:45
…nference codepath

Adds a custom `/skyrl/v1/generate` endpoint to `VLLMServerActor` that
calls the vLLM engine directly and returns `routed_experts` alongside
token output. The standard `/inference/v1/generate` endpoint's
`GenerateResponseChoice` does not include `routed_experts` (only available
on the Python `CompletionOutput` object), so a custom endpoint is required.

Changes:
- `vllm_server_actor.py`: Add `/skyrl/v1/generate` endpoint with correct
  logprobs serialisation (placeholder `-9999.0` for missing entries,
  matching vLLM's `ChatCompletionLogProb` default) and `routed_experts`
  extraction. Raises `NotImplementedError` if LoRA is enabled.
- `remote_inference_client.py`: Switch `_generate_single` to
  `/skyrl/v1/generate`; extract and propagate `routed_experts` through
  to `InferenceEngineOutput.rollout_expert_indices`.
- `inference_servers/utils.py`: Pass `enable_return_routed_experts` to
  vLLM CLI args so the engine computes routed experts.
- `train/utils/utils.py`: Gate the `mp` backend assertion for R3 behind
  `if not _SKYRL_USE_NEW_INFERENCE` (new path uses ray backend); remove
  the `ValueError` blocking R3 on the new inference path; add startup
  validation that LoRA + R3 cannot be combined on the new path.
- `main_base.py`, `tests/gpu/utils.py`: Pass `enable_return_routed_experts`
  when constructing `RemoteInferenceClient`.
- `test_remote_inference_client.py`: Update mock endpoint to
  `/skyrl/v1/generate` returning a single choice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pported

R3 requires the mp backend to avoid hangs, but mp is not yet supported
on the new inference path (tracked in NovaSky-AI#1309). Restore the ValueError
blocking R3 on new inference, and un-gate the mp assertion so it applies
to both old and new inference paths consistently.

The infrastructure changes (/skyrl/v1/generate endpoint, RemoteInferenceClient
propagation) remain as pre-work for when mp support lands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Made-with: Cursor

# Conflicts:
#	skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py
#	skyrl/train/entrypoints/main_base.py
#	tests/backends/skyrl_train/gpu/utils.py
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as ready for review April 2, 2026 22:46
devin-ai-integration[bot]

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

}

@app.post("/inference/v1/generate")
@app.post("/skyrl/v1/generate")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Mock test server removed /inference/v1/generate endpoint, breaking sample() tests

The mock server endpoint was renamed from /inference/v1/generate to /skyrl/v1/generate, but the sample() method in RemoteInferenceClient still uses /inference/v1/generate (skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py:442). The test_sample and test_sample_n2 tests call client.sample(), which will hit the now-nonexistent /inference/v1/generate on the mock server, causing a 404 error. The mock server needs both endpoints: /skyrl/v1/generate for the _generate_single() path and /inference/v1/generate for the sample() path.

Suggested change
@app.post("/skyrl/v1/generate")
@app.post("/skyrl/v1/generate")
@app.post("/inference/v1/generate")
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async def _skyrl_generate(request: Request):
"""SkyRL generate endpoint that returns routed_experts alongside token output."""
if getattr(cli_args, "enable_lora", False):
raise NotImplementedError("/skyrl/v1/generate does not support LoRA.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 NotImplementedError instead of HTTPException causes 30 silent retries and lost error message

The _skyrl_generate endpoint raises a bare NotImplementedError for the LoRA guard clause. FastAPI converts unhandled exceptions to a plain-text HTTP 500 ("Internal Server Error"). The client's _post method (remote_inference_client.py:227-253) then fails to parse the plain-text body as JSON; since 500 is outside the 4xx range the error is treated as transient and retried 30 times (~30 seconds wasted), ultimately surfacing a confusing JSON-decode error instead of the real "LoRA not supported" message. The correct pattern is already used two lines below at line 369 (raise HTTPException(status_code=500, ...)), so this is an inconsistency within the same function.

Suggested change
raise NotImplementedError("/skyrl/v1/generate does not support LoRA.")
raise HTTPException(status_code=400, detail="/skyrl/v1/generate does not support LoRA.")
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@SumanthRH
Copy link
Copy Markdown
Member

For reference: We ran the Moonlight-16B script with the old and the new inference and got matching curves:

https://api.wandb.ai/links/sky-posttraining-uc-berkeley/lwaaqy73

Screenshot 2026-04-03 at 4 20 37 PM image

Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's address the issue with the sample API

}

@app.post("/inference/v1/generate")
@app.post("/skyrl/v1/generate")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +10 to +11
export _SKYRL_USE_NEW_INFERENCE=1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert?

Suggested change
export _SKYRL_USE_NEW_INFERENCE=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants