Skip to content

TL/NCCL: add user buffer registration via memmap#1260

Open
wfaderhold21 wants to merge 1 commit intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub
Open

TL/NCCL: add user buffer registration via memmap#1260
wfaderhold21 wants to merge 1 commit intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub

Conversation

@wfaderhold21
Copy link
Copy Markdown
Collaborator

@wfaderhold21 wfaderhold21 commented Jan 30, 2026

What

This PR implements NCCL User Buffer Registration (UBR) in TL/NCCL. It replaces the three stub functions (mem_map, mem_unmap, memh_pack) that previously returned UCC_ERR_NOT_SUPPORTED with an implementation that registers GPU memory buffers with NCCL communicators.

Why

NCCL UBR (available since NCCL 2.19.0) lets applications pre-register GPU buffers so NCCL can bypass its internal memory registration on each collective call. Without this, NCCL must re-pin and register the buffer every time it's used in a collective, which adds latency. This is particularly impactful for AI/ML workloads where the same buffers (e.g., model gradients) are used in AllReduce repeatedly across training steps. This PR wires UCC's existing mem_map API into NCCL's ncclCommRegister/ncclCommDeregister, enabling that optimization.

How

  • Lazy registration: mem_map only allocates a ucc_tl_nccl_memh_data_t tracking struct — no NCCL call
    yet. The actual ncclCommRegister call is deferred to the first collective operation that uses the
    buffer (ucc_tl_nccl_lazy_register_memh in tl_nccl_coll.c). This avoids registering with comms that
    never use the buffer.
  • Per-comm tracking: Each memh_data maintains dynamic arrays of (ncclComm_t, handle) pairs — one entry
    per NCCL communicator that has registered the buffer — growing as needed.
  • Thread safety: Spinlocks guard both the per-handle arrays and the context-level handle list. The
    NCCL calls (ncclCommRegister/ncclCommDeregister) are made outside spinlocks to avoid holding them
    during potentially slow CUDA operations.
  • Clean teardown: team_destroy calls ucc_tl_nccl_team_deregister_memhs to null out all slots for the
    dying comm before calling ncclCommDestroy, preventing use-after-free in a later mem_unmap. The reverse
    order (mem_unmap before team_destroy) is also handled cleanly.
  • Export/import support: memh_pack packs the buffer address and length for cross-process sharing.
    mem_map in import mode unpacks the local address from the blob.
  • Config toggle: A new UCC_TL_NCCL_ENABLE_UBR config option (try/yes/no) controls the feature, with
    version gating against NCCL 2.19.0 at compile time.

@wfaderhold21
Copy link
Copy Markdown
Collaborator Author

Tested allgather performance on GAIA with 8 nodes, 8 PPN:
allgather_ubr_bandwidth_clean

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 30, 2026

Greptile Summary

This PR implements NCCL User Buffer Registration (UBR) via lazy ncclCommRegister calls, replacing three stub functions with a full lifecycle: mem_map allocates a tracking struct, registration is deferred to first collective use, and mem_unmap/team_destroy handle cleanup in either order. Most previously-flagged concerns have been addressed: the CUDA event leak on error paths is fixed (ucc_tl_nccl_free_task is used correctly), the IMPORT mode now unpacks address/length from pack_buffer instead of relying on garbage fields, the memh_pack null-deref is fixed, and ncclCommDeregister is correctly called outside spinlocks.

Two remaining quality issues:

  • num_comms is not decremented when team_deregister_memhs nulls a slot; NULL holes accumulate and cause unnecessary array growth over repeated team create/destroy cycles.
  • The context cleanup function does not check or warn when memh_list is non-empty, silently leaking any m_data objects the caller forgot to unmap.

Confidence Score: 4/5

Safe to merge after addressing the num_comms accounting issue; remaining finding is P2 style/quality

The major correctness concerns raised in prior review rounds (CUDA event leak, IMPORT address reading garbage, null deref in memh_pack, spinlock held during NCCL calls) have all been addressed. Two P2 quality issues remain: unbounded array growth from un-decremented num_comms, and silent resource leak in context cleanup. Neither causes incorrect behavior in typical usage but could surface in long-running workloads or misuse scenarios.

src/components/tl/nccl/tl_nccl_team.c (num_comms accounting) and src/components/tl/nccl/tl_nccl_context.c (cleanup warning)

Important Files Changed

Filename Overview
src/components/tl/nccl/tl_nccl_context.c Implements mem_map/mem_unmap/memh_pack for UBR. IMPORT mode now correctly reads from pack_buffer; NULL-deref on memh_pack fixed. Region-end overflow check still missing.
src/components/tl/nccl/tl_nccl_coll.c Lazy registration logic added; uses ucc_tl_nccl_free_task (handles CUDA event) on error paths correctly. NULL holes accumulate in registered_comms/nccl_handles arrays when slots are nulled without decrementing num_comms.
src/components/tl/nccl/tl_nccl_team.c team_deregister_memhs correctly collects comm/handle pairs under locks then deregisters outside locks; good fix for the spinlock-held-during-NCCL-call issue.
src/components/tl/nccl/tl_nccl.h Adds ucc_tl_nccl_memh_data_t struct and context fields for UBR; NCCL version macro added correctly.
src/components/tl/nccl/tl_nccl.c Adds ENABLE_UBR config option with TERNARY type; formatting cleanup only.
src/core/ucc_context.c Adds local_address/local_len fields to the import memh before calling TL mem_map, so TLs have access to the local VA without corrupting the remote address.
src/core/ucc_context.h Adds local_address/local_len fields to ucc_mem_map_memh_t; local_memh allocated with calloc so new fields are zero-initialized on export.
test/gtest/core/test_mem_map.cc New tests cover reversed unmap order, import address preservation, and repeated cycles; memory management looks correct.

Reviews (11): Last reviewed commit: "TL/NCCL: add user buffer registration vi..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/components/tl/nccl/tl_nccl_coll.c Outdated
Comment on lines +229 to +244
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource leak: if ucc_realloc for new_comms succeeds but the second ucc_realloc for new_handles fails (line 240-244), the newly allocated new_comms is assigned to m_data->registered_comms (line 236) but the function returns UCC_ERR_NO_MEMORY without deregistering the NCCL buffer. This leaks the NCCL registration created at line 208.

Suggested change
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;
new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
if (!new_handles) {
/* Failed to grow handles array - must deregister to avoid leak */
ncclCommDeregister(team->nccl_comm, nccl_handle);
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
}

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/components/tl/nccl/tl_nccl_coll.c Outdated
Comment on lines +344 to +347
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer size calculation is incorrect for variable-size collectives (ALLTOALLV, ALLGATHERV, GATHERV, SCATTERV). For these collectives, coll_args->args.src uses the info_v union member with counts* array, not info.count scalar. Accessing .info.count reads the counts pointer as an integer, resulting in incorrect buffer size. Need to check coll_args->args.coll_type and use appropriate buffer size calculation for each collective type.

Suggested change
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
// TODO: Need to calculate correct buffer size based on coll_type
// For variable-size collectives (ALLTOALLV, ALLGATHERV, etc),
// must sum counts array instead of using single count value
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),

Comment thread src/components/tl/nccl/tl_nccl_coll.c Outdated
Comment on lines +368 to +371
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.dst.info.buffer,
coll_args->args.dst.info.count *
ucc_dt_size(coll_args->args.dst.info.datatype),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same buffer size calculation bug for dst buffer - incorrect for variable-size collectives that use info_v.counts* array instead of info.count scalar

Comment on lines +167 to +168
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for potential integer overflow when buf_start + length exceeds UINTPTR_MAX. Should validate that length <= UINTPTR_MAX - buf_start before addition to prevent wraparound

Suggested change
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
buf_start = (uintptr_t)buffer;
if (length > UINTPTR_MAX - buf_start) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
buf_end = buf_start + length;

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +433 to +438

/* If UBR is not available/disabled or no TL data, return empty pack */
if (!ctx->ubr_available || !tl_h || !tl_h->tl_data) {
tl_h->packed_size = 0;
*pack_buffer = NULL;
return UCC_OK;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null deref on pack

In ucc_tl_nccl_memh_pack, the early-return condition includes !tl_h, but the block still does tl_h->packed_size = 0. If tl_h == NULL this will crash. This can trigger if the iface is ever invoked with a NULL tl_h (the function explicitly allows it in the condition), so the guard should avoid dereferencing tl_h in that path.

Comment on lines +290 to +339
ucc_status_t ucc_tl_nccl_mem_map(
const ucc_base_context_t *context, ucc_mem_map_mode_t mode,
ucc_mem_map_memh_t *memh, ucc_mem_map_tl_t *tl_h)
{
return UCC_ERR_NOT_SUPPORTED;
ucc_tl_nccl_context_t *ctx = ucc_derived_of(context, ucc_tl_nccl_context_t);
ucc_tl_nccl_memh_data_t *m_data;

/* Check if UBR is available and enabled */
if (!ctx->ubr_available) {
tl_debug(
ctx->super.super.lib, "NCCL UBR not available, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Support both EXPORT and IMPORT modes for global memh */
if (mode != UCC_MEM_MAP_MODE_EXPORT && mode != UCC_MEM_MAP_MODE_IMPORT) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: unsupported mode %d", mode);
return UCC_ERR_NOT_SUPPORTED;
}

/* Reject zero-length buffers */
if (memh->len == 0) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: zero-length buffer, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Allocate TL-specific memory handle data */
m_data = (ucc_tl_nccl_memh_data_t *)ucc_calloc(
1, sizeof(ucc_tl_nccl_memh_data_t), "tl_nccl_memh_data");
if (!m_data) {
tl_error(
ctx->super.super.lib, "failed to allocate TL memory handle data");
return UCC_ERR_NO_MEMORY;
}

/* Store buffer information - registration will happen lazily on first use */
m_data->address = memh->address;
m_data->length = memh->len;
m_data->registered_comms = NULL;
m_data->nccl_handles = NULL;
m_data->num_comms = 0;
m_data->max_comms = 0;

/* Set TL handle data */
tl_h->tl_data = m_data;
strncpy(tl_h->tl_name, "nccl", UCC_MEM_MAP_TL_NAME_LEN - 1);
tl_h->tl_name[UCC_MEM_MAP_TL_NAME_LEN - 1] = '\0';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPORT memh uses garbage

ucc_tl_nccl_mem_map claims to support UCC_MEM_MAP_MODE_IMPORT, but it ignores the packed payload in memh->pack_buffer and instead stores memh->address/memh->len into m_data. In the core import path, those fields are explicitly treated as “likely garbage” for imported handles (src/core/ucc_context.c:1212-1214), so the imported NCCL TL handle will contain invalid address/length and later lazy registration may attempt ncclCommRegister on bogus pointers.

Comment thread src/components/tl/nccl/tl_nccl_coll.c Outdated
Comment on lines +166 to +175
/* Verify that the entire buffer is within the registered memory region */
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;

if (length > (UINTPTR_MAX - buf_start)) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow check too late

In ucc_tl_nccl_lazy_register_memh, buf_end is computed as buf_start + length before checking length <= UINTPTR_MAX - buf_start. If the addition overflows, buf_end is already wrapped/UB, and the subsequent bounds check can misbehave. The overflow guard needs to run before computing buf_end.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +327 to +334
/* For NCCL UBR, we only store metadata (address/length) for lazy registration.
* When ncclCommRegister is called later, it stores this metadata locally.
* The NCCL communicator handles IPC handle exchange internally during collective
* operations (via point-to-point proxy calls), so we don't need special IMPORT
* handling. We can use memh->address/memh->len directly in both EXPORT and IMPORT
* modes - the address should be valid in the current process context. */
m_data->address = memh->address;
m_data->length = memh->len;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IMPORT mode, verify whether using memh->address/memh->len directly is correct. The core code treats these as "likely garbage" for imported handles (src/core/ucc_context.c:1212-1214), suggesting TLs should extract data from pack_buffer. If NCCL's IPC model truly makes local addresses valid for IMPORT mode (different from typical TL behavior), add documentation explaining this design choice. Otherwise, implement extraction from pack_buffer similar to the core's approach in ucc_context.c:1224-1238.

Comment on lines +357 to +364
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA event leak on UBR registration failure

Both UBR error paths (src at line 357 and dst at line 395) call ucc_mpool_put(task) directly after a failed ucc_tl_nccl_lazy_register_memh. By this point in the function, task->completed may already hold a valid CUDA event (created at line 309 when sync_type == UCC_TL_NCCL_COMPLETION_SYNC_TYPE_EVENT). Putting the task back to the pool without destroying the event leaks it.

The fix is to destroy the event before returning, mirroring the cleanup in ucc_tl_nccl_free_task:

Suggested change
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
if (task->completed) {
ucc_ec_destroy_event(task->completed, UCC_EE_CUDA_STREAM);
}
ucc_mpool_put(task);
return status;
}

The same fix is required for the dst-buffer error path at line 394–397.

Comment on lines +174 to +175
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing overflow check for region_end

An overflow guard was added for buf_start + length (line 166), but the symmetric computation region_start + m_data->length has no such check. If m_data->address + m_data->length wraps around UINTPTR_MAX, region_end becomes a small value. The subsequent bounds test buf_end > region_end would then be spuriously false for any legitimately-contained buffer, meaning a future out-of-bounds access would be missed. The same pattern used for the buffer should be applied here:

    if (m_data->length > (UINTPTR_MAX - (uintptr_t)m_data->address)) {
        tl_error(UCC_TL_TEAM_LIB(team),
                 "NCCL UBR: registered region size causes overflow");
        return UCC_ERR_INVALID_PARAM;
    }
    region_start = (uintptr_t)m_data->address;
    region_end   = region_start + m_data->length;

Comment on lines +81 to +95
ucc_spin_lock(&ctx->memh_lock);
ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
ucc_spin_lock(&m_data->lock);
for (i = 0; i < m_data->num_comms; i++) {
if (m_data->registered_comms[i] == team->nccl_comm) {
ncclCommDeregister(team->nccl_comm, m_data->nccl_handles[i]);
/* Null the slot so mem_unmap skips it */
m_data->registered_comms[i] = NULL;
m_data->nccl_handles[i] = NULL;
break;
}
}
ucc_spin_unlock(&m_data->lock);
}
ucc_spin_unlock(&ctx->memh_lock);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Spinlock held during blocking NCCL deregistration calls

ctx->memh_lock is a spinlock, yet ncclCommDeregister (a potentially slow synchronous NCCL/CUDA call) is invoked on line 86 while both ctx->memh_lock and m_data->lock are still held. Any concurrent call to mem_map, mem_unmap, or lazy_register_memh that needs either of these locks will busy-spin for the entire duration of all deregistrations. In a multi-GPU setup with many registered buffers this can waste significant CPU cycles and may even deadlock CUDA if the same CPU thread is required for NCCL progress.

The fix is to collect the (comm, handle) pairs while holding the locks, then release both locks before calling ncclCommDeregister:

/* collect pairs under lock */
ucc_spin_lock(&ctx->memh_lock);
ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
    ucc_spin_lock(&m_data->lock);
    for (i = 0; i < m_data->num_comms; i++) {
        if (m_data->registered_comms[i] == team->nccl_comm) {
            /* save for deregistration outside locks */
            pending[npending].comm   = m_data->registered_comms[i];
            pending[npending].handle = m_data->nccl_handles[i];
            m_data->registered_comms[i] = NULL;
            m_data->nccl_handles[i]     = NULL;
            npending++;
            break;
        }
    }
    ucc_spin_unlock(&m_data->lock);
}
ucc_spin_unlock(&ctx->memh_lock);

/* deregister outside of any spinlock */
for (i = 0; i < npending; i++) {
    ncclCommDeregister(pending[i].comm, pending[i].handle);
}

Comment thread src/components/tl/nccl/tl_nccl_coll.c Outdated
Comment on lines +253 to +266
m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
ucc_spin_unlock(&m_data->lock);
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
ncclCommDeregister(team->nccl_comm, nccl_handle);
return UCC_ERR_NO_MEMORY;
}
m_data->nccl_handles = new_handles;
m_data->max_comms = new_max;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 m_data->max_comms not updated when nccl_handles realloc fails, leaving arrays inconsistent

When new_comms realloc succeeds (line 241-252), m_data->registered_comms is immediately updated to the larger buffer (line 253). If the subsequent new_handles realloc fails (line 255-264), the function returns UCC_ERR_NO_MEMORY without updating m_data->max_comms to new_max.

This leaves the struct in an inconsistent state:

  • m_data->registered_comms has capacity new_max
  • m_data->nccl_handles still has capacity old_max
  • m_data->max_comms still reports old_max

On the next successful registration, both arrays are grown from old_max to 2 * old_max again (which equals new_max again), so correctness is self-recovering. However, the window of inconsistency – between the failed call and the next successful call – means any code that assumes registered_comms and nccl_handles have the same allocated capacity as max_comms will be wrong, making the logic harder to reason about.

The safest fix is to update m_data->max_comms = new_max immediately after the successful new_comms realloc, so the state always reflects the actual registered_comms capacity, even if nccl_handles is still at the old size.

@wfaderhold21
Copy link
Copy Markdown
Collaborator Author

/build

@janjust janjust added the ai-review start ai-review label Apr 9, 2026
@wfaderhold21
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants