[R3] Enable R3 with new inference#1428
Conversation
…nference codepath Adds a custom `/skyrl/v1/generate` endpoint to `VLLMServerActor` that calls the vLLM engine directly and returns `routed_experts` alongside token output. The standard `/inference/v1/generate` endpoint's `GenerateResponseChoice` does not include `routed_experts` (only available on the Python `CompletionOutput` object), so a custom endpoint is required. Changes: - `vllm_server_actor.py`: Add `/skyrl/v1/generate` endpoint with correct logprobs serialisation (placeholder `-9999.0` for missing entries, matching vLLM's `ChatCompletionLogProb` default) and `routed_experts` extraction. Raises `NotImplementedError` if LoRA is enabled. - `remote_inference_client.py`: Switch `_generate_single` to `/skyrl/v1/generate`; extract and propagate `routed_experts` through to `InferenceEngineOutput.rollout_expert_indices`. - `inference_servers/utils.py`: Pass `enable_return_routed_experts` to vLLM CLI args so the engine computes routed experts. - `train/utils/utils.py`: Gate the `mp` backend assertion for R3 behind `if not _SKYRL_USE_NEW_INFERENCE` (new path uses ray backend); remove the `ValueError` blocking R3 on the new inference path; add startup validation that LoRA + R3 cannot be combined on the new path. - `main_base.py`, `tests/gpu/utils.py`: Pass `enable_return_routed_experts` when constructing `RemoteInferenceClient`. - `test_remote_inference_client.py`: Update mock endpoint to `/skyrl/v1/generate` returning a single choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pported R3 requires the mp backend to avoid hangs, but mp is not yet supported on the new inference path (tracked in NovaSky-AI#1309). Restore the ValueError blocking R3 on new inference, and un-gate the mp assertion so it applies to both old and new inference paths consistently. The infrastructure changes (/skyrl/v1/generate endpoint, RemoteInferenceClient propagation) remain as pre-work for when mp support lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Made-with: Cursor # Conflicts: # skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py # skyrl/train/entrypoints/main_base.py # tests/backends/skyrl_train/gpu/utils.py
| } | ||
|
|
||
| @app.post("/inference/v1/generate") | ||
| @app.post("/skyrl/v1/generate") |
There was a problem hiding this comment.
🔴 Mock test server removed /inference/v1/generate endpoint, breaking sample() tests
The mock server endpoint was renamed from /inference/v1/generate to /skyrl/v1/generate, but the sample() method in RemoteInferenceClient still uses /inference/v1/generate (skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py:442). The test_sample and test_sample_n2 tests call client.sample(), which will hit the now-nonexistent /inference/v1/generate on the mock server, causing a 404 error. The mock server needs both endpoints: /skyrl/v1/generate for the _generate_single() path and /inference/v1/generate for the sample() path.
| @app.post("/skyrl/v1/generate") | |
| @app.post("/skyrl/v1/generate") | |
| @app.post("/inference/v1/generate") |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
@hao-aaron this is a great catch. Can you ensure that the sample method is also using our generate API endpoint ?
| async def _skyrl_generate(request: Request): | ||
| """SkyRL generate endpoint that returns routed_experts alongside token output.""" | ||
| if getattr(cli_args, "enable_lora", False): | ||
| raise NotImplementedError("/skyrl/v1/generate does not support LoRA.") |
There was a problem hiding this comment.
🔴 NotImplementedError instead of HTTPException causes 30 silent retries and lost error message
The _skyrl_generate endpoint raises a bare NotImplementedError for the LoRA guard clause. FastAPI converts unhandled exceptions to a plain-text HTTP 500 ("Internal Server Error"). The client's _post method (remote_inference_client.py:227-253) then fails to parse the plain-text body as JSON; since 500 is outside the 4xx range the error is treated as transient and retried 30 times (~30 seconds wasted), ultimately surfacing a confusing JSON-decode error instead of the real "LoRA not supported" message. The correct pattern is already used two lines below at line 369 (raise HTTPException(status_code=500, ...)), so this is an inconsistency within the same function.
| raise NotImplementedError("/skyrl/v1/generate does not support LoRA.") | |
| raise HTTPException(status_code=400, detail="/skyrl/v1/generate does not support LoRA.") |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
For reference: We ran the Moonlight-16B script with the old and the new inference and got matching curves: https://api.wandb.ai/links/sky-posttraining-uc-berkeley/lwaaqy73
|
SumanthRH
left a comment
There was a problem hiding this comment.
Let's address the issue with the sample API
| } | ||
|
|
||
| @app.post("/inference/v1/generate") | ||
| @app.post("/skyrl/v1/generate") |
There was a problem hiding this comment.
@hao-aaron this is a great catch. Can you ensure that the sample method is also using our generate API endpoint ?
| export _SKYRL_USE_NEW_INFERENCE=1 | ||
|
|
There was a problem hiding this comment.
Revert?
| export _SKYRL_USE_NEW_INFERENCE=1 |


Uh oh!
There was an error while loading. Please reload this page.