(lib) Add support for vLLM-backed text embedder#1494
(lib) Add support for vLLM-backed text embedder#1494charlesbluca wants to merge 85 commits intoNVIDIA:mainfrom
Conversation
…embeddings-1 Made-with: Cursor # Conflicts: # nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm - batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval - Extend test_evaluate_lancedb_beir to assert Retriever kwargs Made-with: Cursor
The upstream refactor replaced ingest_modes/ with text_embed/operators.py. Add vLLM branching to _BatchEmbedGPUActor.__init__() so that embed_use_vllm=True in EmbedParams routes through create_local_embedder() with use_vllm=True, matching the interface used for recall queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds vLLM-backed local text embedding as an alternative to the existing HuggingFace path, introducing
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/text_embed/vllm.py | New module for vLLM batched embedding inference — clean, well-structured, with proper batching, empty-slot handling, and multimodal support. |
| nemo_retriever/src/nemo_retriever/text_embed/processor.py | Injects a local embedder into task_config; passes prefix="" kwarg to LlamaNemotronEmbed1BV2HFEmbedder.embed() which doesn't accept it — crashes the HF ingest backend. |
| nemo_retriever/src/nemo_retriever/model/init.py | Factory functions updated to route VL and non-VL models to vLLM or HF backends; create_local_query_embedder correctly defaults to "hf" for queries. |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py | vLLM-backed embedder with _finalize_vectors padding; still loads eagerly in __post_init__ (known issue from prior review rounds). |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py | New HF-only embedder with correct lazy loading via _ensure_loaded(); prefix logic handled internally without a prefix kwarg. |
| nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py | Adds LlamaNemotronEmbedVL1BV2VLLMEmbedder; existing HF VL embedder still loads model eagerly in __post_init__, violating model-lazy-loading. |
| nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py | Refactored to use create_local_embedder; device kwarg is not forwarded, silently ignoring user-configured device for the HF backend. |
| nemo_retriever/tests/test_vllm_embed.py | Thorough unit tests for embed_with_vllm_llm, embed_multimodal_with_vllm_llm, and the VL vLLM embedder; covers batching, None embedding slots, and L2 normalization. |
| nemo_retriever/tests/test_create_local_embedder.py | Factory routing tests cover all four embedder classes including the new VL vLLM class; fixture correctly stubs both LlamaNemotronEmbedVL1BV2Embedder and LlamaNemotronEmbedVL1BV2VLLMEmbedder. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[embed request] --> B{has endpoint?}
B -- yes --> C[Remote NIM HTTP]
B -- no --> D[maybe_inject_local_hf_embedder]
D --> E{local_ingest_backend}
E -- vllm --> F[create_local_embedder backend=vllm]
E -- hf --> G[create_local_embedder backend=hf]
F --> H{is_vl_embed_model?}
G --> I{is_vl_embed_model?}
H -- yes --> J[LlamaNemotronEmbedVL1BV2VLLMEmbedder]
H -- no --> K[LlamaNemotronEmbed1BV2Embedder]
I -- yes --> L[LlamaNemotronEmbedVL1BV2Embedder]
I -- no --> M[LlamaNemotronEmbed1BV2HFEmbedder]
K --> N["_embed closure: embed(prefixed, prefix='')"]
M --> O["_embed closure: embed(prefixed, prefix='') ❌ TypeError"]
J --> P["_embed closure: embed(texts) ✓"]
L --> Q["_embed closure: embed(texts) ✓"]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 124-129
Comment:
**`prefix=""` kwarg crashes HF non-VL ingest backend**
When `local_ingest_backend="hf"`, `create_local_embedder` returns a `LlamaNemotronEmbed1BV2HFEmbedder`. The `_embed` closure then calls `embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")`, but `LlamaNemotronEmbed1BV2HFEmbedder.embed()` is declared as `def embed(self, texts, *, batch_size=64)` with no `prefix` parameter. Python raises `TypeError: embed() got an unexpected keyword argument 'prefix'` immediately, crashing the ingest pipeline for any user who sets `local_ingest_backend="hf"`.
The `LlamaNemotronEmbed1BV2HFEmbedder.embed()` already prepends the `"passage: "` prefix internally, so the texts should be passed unprefixed and without `prefix=`:
```python
def _embed(texts):
if ingest_backend == "hf":
# HF embedder manages its own prefix internally
vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
else:
prefixed = [prefix + t for t in texts] if prefix else texts
vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
return vecs.tolist()
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py
Line: 51-86
Comment:
**Eager model load in `__post_init__` violates `model-lazy-loading` rule**
`LlamaNemotronEmbedVL1BV2Embedder.__post_init__` calls `AutoModel.from_pretrained()` (plus a `.to(dev)` CUDA transfer) unconditionally at construction time. The `model-lazy-loading` rule requires models to load on first use, not at construction. Any code path that constructs this class — e.g., `create_local_embedder(..., backend="hf")` for a VL model inside `gpu_operator.py` — will immediately trigger a full HF download and GPU transfer before any embedding call is made.
Apply the same `_ensure_loaded()` pattern used by `LlamaNemotronEmbed1BV2HFEmbedder`: set `self._model = None` in `__post_init__`, and defer the `AutoModel.from_pretrained` call to a private `_ensure_loaded()` helper invoked in `embed()`, `embed_queries()`, `embed_images()`, and `embed_text_image()`.
**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))
How can I resolve this? If you propose a fix, please make it concise.Reviews (38): Last reviewed commit: "Merge branch 'main' into retriever-vllm-..." | Re-trigger Greptile
The torch 2.10 / vLLM 0.17 bump requires additional work not yet ready. Revert to the upstream baseline (vLLM 0.16.0, torch~=2.9.1, torchvision >=0.24,<0.25, flashinfer 0.6.3). The VLM embedder vLLM code path is preserved but now raises a clear RuntimeError at init time if the installed vLLM is < 0.17.0, since that version is required for VLM pooling support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig (query), matching the same flag added to the old inprocess/fused examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Text embedding via vLLM works on 0.16.0; only the VLM embedder requires 0.17.0 for pooling support. Move the version check accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nemotron_vlm_captioner and nemotron_parse_v1_2 translate device to CUDA_VISIBLE_DEVICES and call configure_global_hf_cache_base() before constructing LLM(). The vLLM path in both local embedders was silently ignoring these fields. Apply the same pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern Separate HF and vLLM embedder classes rather than toggling `use_vllm` on a single class. `create_local_embedder()` dispatches to the appropriate class. `_BatchEmbedGPUActor` simplified to a single `create_local_embedder()` call. VLM vLLM embedder retains the vLLM >= 0.17.0 runtime guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ard HF_HUB_OFFLINE to Ray workers graph_pipeline.py was missing the --embed-local-ingest-backend option that run.py was already generating; wires the flag through to EmbedParams. graph_ingestor.py initialized Ray before executor.py could run, bypassing the HF_HUB_OFFLINE forwarding in executor.py; adds the env var directly to graph_ingestor's ray_env_vars dict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <charlesb@nvidia.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
HarnessConfig: - reranker, reranker_model_name, rerank_modality, embed_local_ingest_backend, embed_local_query_backend fields + validation + env overrides - fix preset/dataset merge order: dataset now wins over preset as intended - load_nightly_config: expose sweep-level overrides dict harness/run.py: - _build_command: pass reranker flags and embed/query backend to graph_pipeline - _run_single: include new fields in run metadata - _wait_for_gpu_clear: poll nvidia-smi between runs to prevent vLLM CUDA poisoning - execute_runs: global_overrides merged under per-run overrides; GPU drain before first run and between each subsequent run recall/beir.py: - BeirConfig.rerank_modality field plumbed through to Retriever Sweep files (vllm_bo767_sweep + jp20_vl_sweep): 18-run design covering text model and VL model × text/text_image modality × HF/vLLM ingest × HF/vLLM query × reranker on/off. test_configs.yaml updated with all 36 dataset configs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation The flag does not exist in graph_pipeline.py — rerank modality inherits from embed_modality by default, which is the correct behavior. Passing an unrecognized flag caused all reranked harness runs to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
huggingface_hub reads HF_HUB_OFFLINE as a module-level constant at import time. Ray workers import the module during startup before runtime_env env_vars are injected, so setting the flag only in runtime_env has no effect on workers. Setting os.environ before ray.init() ensures worker processes inherit the value at the OS level (fork/exec), suppressing model_info() network calls that cause 429 errors when the HF token is absent or rate-limited. Fixed in all four Ray initialization sites: harness/run.py, harness/runner.py, graph/executor.py, graph_ingestor.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Removes reranker/backend config fields, _build_command wiring, _wait_for_gpu_clear, sweep YAML files, and test updates added in the previous three commits. These will land separately once the core vLLM embedder work is merged. Retains the os.environ["HF_HUB_OFFLINE"] fix in graph/executor.py and graph_ingestor.py, which are core pipeline files, not harness. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Remove jp20_vl_sweep.yaml and restore test_configs.yaml to upstream/main state. These harness configs will land in a follow-up PR alongside the harness feature work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
… default to HF Drop the implicit "auto" backend alias in favour of explicit "vllm" / "hf" strings throughout create_local_embedder, create_local_query_embedder, Retriever, and the CLI. Defaults are now vllm for ingest and hf for query, matching the recommended mixed-backend configuration. Add HF text-embed support to create_local_embedder (previously non-VL always used vLLM regardless of backend). This lets gpu_operator.py consolidate to a single create_local_embedder call instead of branching on create_local_query_embedder for the ingest-HF path, and collapses create_local_query_embedder to a thin validation wrapper that delegates entirely to create_local_embedder. Update test_create_local_embedder to cover all four class paths (text-vllm, text-hf, vl-hf, vl-vllm), explicit/default backend selection for both factory functions, invalid backend rejection, and guard the torch-dependent smoke test with pytest.importorskip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ear edge cases Address code-review feedback: lazy-load the HF embedder via _ensure_loaded() so GPU allocation is deferred to the first embed call, add _to_bool() to correctly handle string "false"/"0" from env/task-config dictionaries, and log rather than silently swallow the cuda.empty_cache() exception on error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RecallConfig.local_query_embed_backend defaults to "auto", but Retriever._get_local_embedder only accepts "hf" or "vllm", causing a ValueError at query time with the default config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ults vllm 'auto' was inconsistently defined (vLLM in the CLI, HF in Retriever) and added no real value. Replace with explicit defaults: query embed → 'hf', ingest embed → 'vllm' (already the default in pipeline/__main__.py). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove false claim that VL models always use HF regardless of local_query_embed_backend — the code does not enforce this. Fix stacklevel=2 → 4 in LlamaNemotronEmbed1BV2Embedder.__post_init__ so the DeprecationWarning points to the caller, not the dataclass __init__. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tests Forward normalize and max_length from task kwargs in _BatchEmbedActor so user-configured values are not silently dropped. Add pytest.importorskip guard to TestCreateVllmLlm so it skips cleanly when vLLM is not installed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…LLM ingest / HF recall VL models bypass the local_ingest_backend setting and always load via LlamaNemotronEmbedVL1BV2Embedder (HF). Non-VL models keep their existing defaults: vLLM for ingest, HF for recall. Updated comments and CLI help text to reflect the new behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vllm The VL branch was unconditionally using LlamaNemotronEmbedVL1BV2Embedder (HF) regardless of the backend argument. Now it dispatches to the vllm class by default (matching non-VL behaviour) and falls back to the HF class only when backend="hf" is explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mbedder Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
maybe_inject_local_hf_embedder was hardcoding backend="vllm", ignoring EmbedParams.local_ingest_backend. Reads the value from task_config now, falling back to "vllm" if unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… torch import Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| else: | ||
|
|
||
| def _embed(texts): | ||
| prefixed = [prefix + t for t in texts] if prefix else texts | ||
| vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="") | ||
| return vecs.tolist() |
There was a problem hiding this comment.
prefix="" kwarg crashes HF non-VL ingest backend
When local_ingest_backend="hf", create_local_embedder returns a LlamaNemotronEmbed1BV2HFEmbedder. The _embed closure then calls embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix=""), but LlamaNemotronEmbed1BV2HFEmbedder.embed() is declared as def embed(self, texts, *, batch_size=64) with no prefix parameter. Python raises TypeError: embed() got an unexpected keyword argument 'prefix' immediately, crashing the ingest pipeline for any user who sets local_ingest_backend="hf".
The LlamaNemotronEmbed1BV2HFEmbedder.embed() already prepends the "passage: " prefix internally, so the texts should be passed unprefixed and without prefix=:
def _embed(texts):
if ingest_backend == "hf":
# HF embedder manages its own prefix internally
vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
else:
prefixed = [prefix + t for t in texts] if prefix else texts
vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
return vecs.tolist()Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 124-129
Comment:
**`prefix=""` kwarg crashes HF non-VL ingest backend**
When `local_ingest_backend="hf"`, `create_local_embedder` returns a `LlamaNemotronEmbed1BV2HFEmbedder`. The `_embed` closure then calls `embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")`, but `LlamaNemotronEmbed1BV2HFEmbedder.embed()` is declared as `def embed(self, texts, *, batch_size=64)` with no `prefix` parameter. Python raises `TypeError: embed() got an unexpected keyword argument 'prefix'` immediately, crashing the ingest pipeline for any user who sets `local_ingest_backend="hf"`.
The `LlamaNemotronEmbed1BV2HFEmbedder.embed()` already prepends the `"passage: "` prefix internally, so the texts should be passed unprefixed and without `prefix=`:
```python
def _embed(texts):
if ingest_backend == "hf":
# HF embedder manages its own prefix internally
vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
else:
prefixed = [prefix + t for t in texts] if prefix else texts
vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
return vecs.tolist()
```
How can I resolve this? If you propose a fix, please make it concise.Co-authored-by: polazilber <pzilberman@nvidia.com> Co-authored-by: Yuval Shkolar <yshkolar@nvidia.com>
LlamaNemotronEmbed1BV2HFEmbedder.embed() has no prefix parameter and manages its own prefix internally; passing prefix="" raises TypeError for any user with local_ingest_backend="hf". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Description
Checklist