Skip to content

(lib) Add support for vLLM-backed text embedder#1494

Open
charlesbluca wants to merge 85 commits intoNVIDIA:mainfrom
charlesbluca:retriever-vllm-for-embeddings-1
Open

(lib) Add support for vLLM-backed text embedder#1494
charlesbluca wants to merge 85 commits intoNVIDIA:mainfrom
charlesbluca:retriever-vllm-for-embeddings-1

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@charlesbluca charlesbluca requested a review from a team as a code owner March 5, 2026 20:04
@charlesbluca charlesbluca requested a review from ChrisJar March 5, 2026 20:04
…embeddings-1

Made-with: Cursor

# Conflicts:
#	nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
- Add BeirConfig.use_vllm (default False), forward as embed_use_vllm
- batch_pipeline: pass embed_use_vllm into BeirConfig for BEIR eval
- Extend test_evaluate_lancedb_beir to assert Retriever kwargs

Made-with: Cursor
@charlesbluca charlesbluca changed the title (lib) Add support for vLLM-backed embedder (lib) Add support for vLLM-backed text embedder Mar 23, 2026
@charlesbluca charlesbluca requested a review from a team as a code owner March 24, 2026 19:40
charlesbluca and others added 3 commits April 7, 2026 18:21
The upstream refactor replaced ingest_modes/ with text_embed/operators.py.
Add vLLM branching to _BatchEmbedGPUActor.__init__() so that
embed_use_vllm=True in EmbedParams routes through create_local_embedder()
with use_vllm=True, matching the interface used for recall queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

This PR adds vLLM-backed local text embedding as an alternative to the existing HuggingFace path, introducing LlamaNemotronEmbed1BV2Embedder (vLLM), LlamaNemotronEmbed1BV2HFEmbedder (lazy-loaded HF), and LlamaNemotronEmbedVL1BV2VLLMEmbedder (vLLM VL), along with factory routing in create_local_embedder and create_local_query_embedder.

  • P1 — processor.py HF backend crash: maybe_inject_local_hf_embedder builds a _embed closure that passes prefix=\"\" to embedder_instance.embed(), but LlamaNemotronEmbed1BV2HFEmbedder.embed() has no prefix parameter — any user setting local_ingest_backend=\"hf\" hits an immediate TypeError.
  • P1 — gpu_operator.py device silently dropped (noted in previous outside-diff comment, still present): create_local_embedder call omits the device kwarg, so user-configured device is ignored for the HF backend.

Confidence Score: 3/5

Not safe to merge as-is: the HF ingest backend crashes with a TypeError, and multiple lifecycle and device-placement issues remain unresolved.

Two P1 defects are present: processor.py crashes when local_ingest_backend="hf" due to an unexpected prefix="" kwarg, and gpu_operator.py silently drops the user-configured device for the HF path. Multiple P2 issues (eager loading in the HF VL embedder, missing unload methods) remain from earlier review rounds. Score is pulled below the P1 ceiling of 4 because the processor crash affects a documented configuration path.

text_embed/processor.py (HF backend crash), text_embed/gpu_operator.py (missing device forwarding), model/local/llama_nemotron_embed_vl_1b_v2_embedder.py (eager HF model load).

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/text_embed/vllm.py New module for vLLM batched embedding inference — clean, well-structured, with proper batching, empty-slot handling, and multimodal support.
nemo_retriever/src/nemo_retriever/text_embed/processor.py Injects a local embedder into task_config; passes prefix="" kwarg to LlamaNemotronEmbed1BV2HFEmbedder.embed() which doesn't accept it — crashes the HF ingest backend.
nemo_retriever/src/nemo_retriever/model/init.py Factory functions updated to route VL and non-VL models to vLLM or HF backends; create_local_query_embedder correctly defaults to "hf" for queries.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py vLLM-backed embedder with _finalize_vectors padding; still loads eagerly in __post_init__ (known issue from prior review rounds).
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_hf_embedder.py New HF-only embedder with correct lazy loading via _ensure_loaded(); prefix logic handled internally without a prefix kwarg.
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py Adds LlamaNemotronEmbedVL1BV2VLLMEmbedder; existing HF VL embedder still loads model eagerly in __post_init__, violating model-lazy-loading.
nemo_retriever/src/nemo_retriever/text_embed/gpu_operator.py Refactored to use create_local_embedder; device kwarg is not forwarded, silently ignoring user-configured device for the HF backend.
nemo_retriever/tests/test_vllm_embed.py Thorough unit tests for embed_with_vllm_llm, embed_multimodal_with_vllm_llm, and the VL vLLM embedder; covers batching, None embedding slots, and L2 normalization.
nemo_retriever/tests/test_create_local_embedder.py Factory routing tests cover all four embedder classes including the new VL vLLM class; fixture correctly stubs both LlamaNemotronEmbedVL1BV2Embedder and LlamaNemotronEmbedVL1BV2VLLMEmbedder.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[embed request] --> B{has endpoint?}
    B -- yes --> C[Remote NIM HTTP]
    B -- no --> D[maybe_inject_local_hf_embedder]
    D --> E{local_ingest_backend}
    E -- vllm --> F[create_local_embedder backend=vllm]
    E -- hf --> G[create_local_embedder backend=hf]
    F --> H{is_vl_embed_model?}
    G --> I{is_vl_embed_model?}
    H -- yes --> J[LlamaNemotronEmbedVL1BV2VLLMEmbedder]
    H -- no --> K[LlamaNemotronEmbed1BV2Embedder]
    I -- yes --> L[LlamaNemotronEmbedVL1BV2Embedder]
    I -- no --> M[LlamaNemotronEmbed1BV2HFEmbedder]
    K --> N["_embed closure: embed(prefixed, prefix='')"]
    M --> O["_embed closure: embed(prefixed, prefix='') ❌ TypeError"]
    J --> P["_embed closure: embed(texts) ✓"]
    L --> Q["_embed closure: embed(texts) ✓"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 124-129

Comment:
**`prefix=""` kwarg crashes HF non-VL ingest backend**

When `local_ingest_backend="hf"`, `create_local_embedder` returns a `LlamaNemotronEmbed1BV2HFEmbedder`. The `_embed` closure then calls `embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")`, but `LlamaNemotronEmbed1BV2HFEmbedder.embed()` is declared as `def embed(self, texts, *, batch_size=64)` with no `prefix` parameter. Python raises `TypeError: embed() got an unexpected keyword argument 'prefix'` immediately, crashing the ingest pipeline for any user who sets `local_ingest_backend="hf"`.

The `LlamaNemotronEmbed1BV2HFEmbedder.embed()` already prepends the `"passage: "` prefix internally, so the texts should be passed unprefixed and without `prefix=`:

```python
def _embed(texts):
    if ingest_backend == "hf":
        # HF embedder manages its own prefix internally
        vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
    else:
        prefixed = [prefix + t for t in texts] if prefix else texts
        vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
    return vecs.tolist()
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_vl_1b_v2_embedder.py
Line: 51-86

Comment:
**Eager model load in `__post_init__` violates `model-lazy-loading` rule**

`LlamaNemotronEmbedVL1BV2Embedder.__post_init__` calls `AutoModel.from_pretrained()` (plus a `.to(dev)` CUDA transfer) unconditionally at construction time. The `model-lazy-loading` rule requires models to load on first use, not at construction. Any code path that constructs this class — e.g., `create_local_embedder(..., backend="hf")` for a VL model inside `gpu_operator.py` — will immediately trigger a full HF download and GPU transfer before any embedding call is made.

Apply the same `_ensure_loaded()` pattern used by `LlamaNemotronEmbed1BV2HFEmbedder`: set `self._model = None` in `__post_init__`, and defer the `AutoModel.from_pretrained` call to a private `_ensure_loaded()` helper invoked in `embed()`, `embed_queries()`, `embed_images()`, and `embed_text_image()`.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

Reviews (38): Last reviewed commit: "Merge branch 'main' into retriever-vllm-..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/text_embed/processor.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py
Comment thread nemo_retriever/src/nemo_retriever/text_embed/vllm.py Outdated
Comment thread nemo_retriever/README.md Outdated
charlesbluca and others added 3 commits April 7, 2026 18:29
The torch 2.10 / vLLM 0.17 bump requires additional work not yet ready.
Revert to the upstream baseline (vLLM 0.16.0, torch~=2.9.1, torchvision
>=0.24,<0.25, flashinfer 0.6.3).

The VLM embedder vLLM code path is preserved but now raises a clear
RuntimeError at init time if the installed vLLM is < 0.17.0, since that
version is required for VLM pooling support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plumbs embed_use_vllm through to EmbedParams (ingest) and RecallConfig
(query), matching the same flag added to the old inprocess/fused examples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Text embedding via vLLM works on 0.16.0; only the VLM embedder requires
0.17.0 for pooling support. Move the version check accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nemotron_vlm_captioner and nemotron_parse_v1_2 translate device to
CUDA_VISIBLE_DEVICES and call configure_global_hf_cache_base() before
constructing LLM(). The vLLM path in both local embedders was silently
ignoring these fields. Apply the same pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern

Separate HF and vLLM embedder classes rather than toggling `use_vllm`
on a single class. `create_local_embedder()` dispatches to the appropriate
class. `_BatchEmbedGPUActor` simplified to a single `create_local_embedder()`
call. VLM vLLM embedder retains the vLLM >= 0.17.0 runtime guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
charlesbluca and others added 7 commits April 22, 2026 14:47
…ard HF_HUB_OFFLINE to Ray workers

graph_pipeline.py was missing the --embed-local-ingest-backend option that
run.py was already generating; wires the flag through to EmbedParams.

graph_ingestor.py initialized Ray before executor.py could run, bypassing
the HF_HUB_OFFLINE forwarding in executor.py; adds the env var directly
to graph_ingestor's ray_env_vars dict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <charlesb@nvidia.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
HarnessConfig:
- reranker, reranker_model_name, rerank_modality, embed_local_ingest_backend,
  embed_local_query_backend fields + validation + env overrides
- fix preset/dataset merge order: dataset now wins over preset as intended
- load_nightly_config: expose sweep-level overrides dict

harness/run.py:
- _build_command: pass reranker flags and embed/query backend to graph_pipeline
- _run_single: include new fields in run metadata
- _wait_for_gpu_clear: poll nvidia-smi between runs to prevent vLLM CUDA poisoning
- execute_runs: global_overrides merged under per-run overrides; GPU drain before
  first run and between each subsequent run

recall/beir.py:
- BeirConfig.rerank_modality field plumbed through to Retriever

Sweep files (vllm_bo767_sweep + jp20_vl_sweep): 18-run design covering
text model and VL model × text/text_image modality × HF/vLLM ingest × HF/vLLM
query × reranker on/off.  test_configs.yaml updated with all 36 dataset configs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation

The flag does not exist in graph_pipeline.py — rerank modality inherits
from embed_modality by default, which is the correct behavior. Passing
an unrecognized flag caused all reranked harness runs to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
huggingface_hub reads HF_HUB_OFFLINE as a module-level constant at
import time. Ray workers import the module during startup before
runtime_env env_vars are injected, so setting the flag only in
runtime_env has no effect on workers. Setting os.environ before
ray.init() ensures worker processes inherit the value at the OS level
(fork/exec), suppressing model_info() network calls that cause 429
errors when the HF token is absent or rate-limited.

Fixed in all four Ray initialization sites: harness/run.py,
harness/runner.py, graph/executor.py, graph_ingestor.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Removes reranker/backend config fields, _build_command wiring,
_wait_for_gpu_clear, sweep YAML files, and test updates added in the
previous three commits. These will land separately once the core vLLM
embedder work is merged.

Retains the os.environ["HF_HUB_OFFLINE"] fix in graph/executor.py and
graph_ingestor.py, which are core pipeline files, not harness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Remove jp20_vl_sweep.yaml and restore test_configs.yaml to
upstream/main state. These harness configs will land in a
follow-up PR alongside the harness feature work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Comment thread nemo_retriever/src/nemo_retriever/text_embed/runtime.py Outdated
charlesbluca and others added 6 commits April 23, 2026 16:42
… default to HF

Drop the implicit "auto" backend alias in favour of explicit "vllm" / "hf"
strings throughout create_local_embedder, create_local_query_embedder,
Retriever, and the CLI.  Defaults are now vllm for ingest and hf for query,
matching the recommended mixed-backend configuration.

Add HF text-embed support to create_local_embedder (previously non-VL always
used vLLM regardless of backend).  This lets gpu_operator.py consolidate to a
single create_local_embedder call instead of branching on create_local_query_embedder
for the ingest-HF path, and collapses create_local_query_embedder to a thin
validation wrapper that delegates entirely to create_local_embedder.

Update test_create_local_embedder to cover all four class paths (text-vllm,
text-hf, vl-hf, vl-vllm), explicit/default backend selection for both
factory functions, invalid backend rejection, and guard the torch-dependent
smoke test with pytest.importorskip.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ear edge cases

Address code-review feedback: lazy-load the HF embedder via _ensure_loaded()
so GPU allocation is deferred to the first embed call, add _to_bool() to
correctly handle string "false"/"0" from env/task-config dictionaries, and
log rather than silently swallow the cuda.empty_cache() exception on error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RecallConfig.local_query_embed_backend defaults to "auto", but
Retriever._get_local_embedder only accepts "hf" or "vllm", causing a
ValueError at query time with the default config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ults vllm

'auto' was inconsistently defined (vLLM in the CLI, HF in Retriever) and
added no real value. Replace with explicit defaults: query embed → 'hf',
ingest embed → 'vllm' (already the default in pipeline/__main__.py).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove false claim that VL models always use HF regardless of
local_query_embed_backend — the code does not enforce this.

Fix stacklevel=2 → 4 in LlamaNemotronEmbed1BV2Embedder.__post_init__ so
the DeprecationWarning points to the caller, not the dataclass __init__.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nemo_retriever/tests/test_beir_evaluation.py Outdated
charlesbluca and others added 4 commits April 24, 2026 15:40
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tests

Forward normalize and max_length from task kwargs in _BatchEmbedActor so
user-configured values are not silently dropped. Add pytest.importorskip
guard to TestCreateVllmLlm so it skips cleanly when vLLM is not installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…LLM ingest / HF recall

VL models bypass the local_ingest_backend setting and always load via
LlamaNemotronEmbedVL1BV2Embedder (HF). Non-VL models keep their existing
defaults: vLLM for ingest, HF for recall. Updated comments and CLI help
text to reflect the new behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nemo_retriever/src/nemo_retriever/model/__init__.py
charlesbluca and others added 2 commits April 24, 2026 19:52
…vllm

The VL branch was unconditionally using LlamaNemotronEmbedVL1BV2Embedder
(HF) regardless of the backend argument.  Now it dispatches to the vllm
class by default (matching non-VL behaviour) and falls back to the HF class
only when backend="hf" is explicit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mbedder

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nemo_retriever/src/nemo_retriever/text_embed/processor.py
charlesbluca and others added 3 commits April 24, 2026 20:22
maybe_inject_local_hf_embedder was hardcoding backend="vllm", ignoring
EmbedParams.local_ingest_backend. Reads the value from task_config now,
falling back to "vllm" if unset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… torch import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +124 to +129
else:

def _embed(texts):
prefixed = [prefix + t for t in texts] if prefix else texts
vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
return vecs.tolist()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 prefix="" kwarg crashes HF non-VL ingest backend

When local_ingest_backend="hf", create_local_embedder returns a LlamaNemotronEmbed1BV2HFEmbedder. The _embed closure then calls embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix=""), but LlamaNemotronEmbed1BV2HFEmbedder.embed() is declared as def embed(self, texts, *, batch_size=64) with no prefix parameter. Python raises TypeError: embed() got an unexpected keyword argument 'prefix' immediately, crashing the ingest pipeline for any user who sets local_ingest_backend="hf".

The LlamaNemotronEmbed1BV2HFEmbedder.embed() already prepends the "passage: " prefix internally, so the texts should be passed unprefixed and without prefix=:

def _embed(texts):
    if ingest_backend == "hf":
        # HF embedder manages its own prefix internally
        vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
    else:
        prefixed = [prefix + t for t in texts] if prefix else texts
        vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
    return vecs.tolist()
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/text_embed/processor.py
Line: 124-129

Comment:
**`prefix=""` kwarg crashes HF non-VL ingest backend**

When `local_ingest_backend="hf"`, `create_local_embedder` returns a `LlamaNemotronEmbed1BV2HFEmbedder`. The `_embed` closure then calls `embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")`, but `LlamaNemotronEmbed1BV2HFEmbedder.embed()` is declared as `def embed(self, texts, *, batch_size=64)` with no `prefix` parameter. Python raises `TypeError: embed() got an unexpected keyword argument 'prefix'` immediately, crashing the ingest pipeline for any user who sets `local_ingest_backend="hf"`.

The `LlamaNemotronEmbed1BV2HFEmbedder.embed()` already prepends the `"passage: "` prefix internally, so the texts should be passed unprefixed and without `prefix=`:

```python
def _embed(texts):
    if ingest_backend == "hf":
        # HF embedder manages its own prefix internally
        vecs = embedder_instance.embed(texts, batch_size=local_batch_size)
    else:
        prefixed = [prefix + t for t in texts] if prefix else texts
        vecs = embedder_instance.embed(prefixed, batch_size=local_batch_size, prefix="")
    return vecs.tolist()
```

How can I resolve this? If you propose a fix, please make it concise.

tomer-levin-nv and others added 3 commits April 27, 2026 14:17
Co-authored-by: polazilber <pzilberman@nvidia.com>
Co-authored-by: Yuval Shkolar <yshkolar@nvidia.com>
LlamaNemotronEmbed1BV2HFEmbedder.embed() has no prefix parameter and
manages its own prefix internally; passing prefix="" raises TypeError
for any user with local_ingest_backend="hf".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants