Skip to content

coll/datatype: Add a collective checker function to UCC API that can … #1241

Closed
QiaoK wants to merge 7 commits intoopenucx:masterfrom
QiaoK:ucc_dt_check
Closed

coll/datatype: Add a collective checker function to UCC API that can … #1241
QiaoK wants to merge 7 commits intoopenucx:masterfrom
QiaoK:ucc_dt_check

Conversation

@QiaoK
Copy link
Copy Markdown
Collaborator

@QiaoK QiaoK commented Dec 16, 2025

Introduce the following function.
ucc_status_t ucc_dt_check_rooted_collective(ucc_team_h team,
const ucc_coll_args_t *args,
int rank)

This function syncs all processes to check if the arguments for UCC can be processed. There are 2 checks using Allreduce Sum: uniformity of datatype, uniformity of memory type. The function returns UCC_OK if and only if when these three checks are passed. Otherwise it will return the corresponding errors.

OMPI will call this function before posting Gather or Scatter, so all process will either choose to progress with UCC route or fallback to other route. This allows OMPI to prevent some process going into UCC while others fallback, causing a hang.

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Dec 16, 2025

as discussed, let's rework this

@janjust janjust marked this pull request as draft December 16, 2025 18:32
@QiaoK QiaoK force-pushed the ucc_dt_check branch 7 times, most recently from ff0fc5b to 0ec4438 Compare December 17, 2025 18:33
@QiaoK QiaoK force-pushed the ucc_dt_check branch 15 times, most recently from 4b5930d to 2e824f4 Compare January 8, 2026 03:43
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_global_opts.c
@QiaoK QiaoK force-pushed the ucc_dt_check branch 5 times, most recently from d96ba2b to 8f8a470 Compare January 13, 2026 15:03
@janjust janjust requested a review from ikryukov March 18, 2026 16:29
Copy link
Copy Markdown
Collaborator

@wfaderhold21 wfaderhold21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor comments / questions.

Comment thread src/components/tl/ucp/tl_ucp_service_coll.c
Comment thread src/core/ucc_coll.c Outdated
Comment thread src/core/ucc_global_opts.h
Comment thread src/core/ucc_coll.c
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 1, 2026

/build

@janjust janjust changed the title coll/datatype: Add a collective checker function to UCC API that can … coll/datatype: Add a collective checker function to UCC API that can … Apr 6, 2026
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 6, 2026

/build

@QiaoK
Copy link
Copy Markdown
Collaborator Author

QiaoK commented Apr 6, 2026

/build

QiaoK added 7 commits April 7, 2026 15:26
This commit introduces the infrastructure for transparent datatype
validation in rooted collectives (GATHER, GATHERV, SCATTER, SCATTERV)
without implementing the actual validation logic.

Key components:

1. Configuration:
   - Added UCC_CHECK_ASYMMETRIC_DT config option (disabled by default)
   - Allows runtime control of validation feature

2. Data structures (ucc_schedule.h):
   - ucc_dt_check_state_t: State for validation process
     * check_req: Service allgather request handle
     * gathered_values: Buffer for collecting datatype info
     * local_values: Local datatype and memory type
     * subset: Team subset for service allgather
     * validated: Flag indicating if validation passed
     * actual_task: Pointer to actual collective task
   - Added dt_check field to ucc_coll_task_t

3. Schedule architecture (ucc_coll.c):
   - ucc_dt_check_create_schedule(): Creates validation schedule
   - Schedule with two wrapper tasks:
     1. Allgather wrapper: Will validate datatypes (bare-bones)
     2. Actual wrapper: Will conditionally execute the real collective
   - Dependency: allgather wrapper → actual wrapper
   - If validation fails, actual task is not posted

4. Stub functions (return UCC_OK, no actual validation yet):
   - ucc_dt_check_allgather_post()
   - ucc_dt_check_allgather_progress()
   - ucc_dt_check_allgather_finalize()
   - ucc_dt_check_actual_wrapper_post()
   - ucc_dt_check_actual_wrapper_progress()
   - ucc_dt_check_actual_wrapper_finalize()

This bare-bones structure allows the full validation implementation
to be added in a subsequent commit, making the changes easier to
review and test incrementally.
Implement the full datatype validation logic for rooted collectives
(GATHER, GATHERV, SCATTER, SCATTERV) using service allgather.

Implementation details:

1. Validation setup (ucc_dt_check_start_validation):
   - Checks if validation is needed (UCC_CHECK_ASYMMETRIC_DT enabled)
   - Extracts local datatype and memory type
   - Checks if local datatype is contiguous
   - Allocates gathered_values buffer for all ranks
   - Sets up subset for full team

2. Allgather execution (ucc_dt_check_allgather_post/progress):
   - Uses ucc_service_allgather to collect [datatype, mem_type] from all ranks
   - Each rank contributes 2 int64_t values
   - Progresses service collective through normal UCC progress queue
   - Properly finalizes service collective when complete

3. Validation (ucc_dt_validate_results):
   - Checks if any rank has non-contiguous datatype → UCC_ERR_NOT_SUPPORTED
   - Checks if all ranks have matching datatype and memory type
   - Returns UCC_ERR_INVALID_PARAM if mismatch detected

4. Actual task wrapper:
   - Only posts actual collective if validation succeeded
   - If validation failed, sets schedule status to error
   - Allows graceful fallback to OMPI layer

5. Schedule finalization:
   - Properly cleans up validation state (gathered_values, dt_check)
   - Finalize actual collective task
   - Free schedule structure

Key design decisions:
- Validation always completes (sets validated flag)
- Schedule continues to actual wrapper regardless
- Actual wrapper checks validated flag before posting real task
- Graceful error propagation to user via schedule status
Switch the transparent datatype validation mechanism from service
allgather to service allreduce for better scalability.

Key improvements:
- Constant message size: O(1) instead of O(N) where N = number of ranks
- Uses MIN allreduce with min/max trick for mismatch detection:
  * Send [dt, -dt, mem_type, -mem_type]
  * After MIN reduction: if min(dt) == -min(-dt), all ranks agree
  * Same principle for memory type validation
- No heap allocation: uses stack arrays instead of gathered_values buffer
- Same validation guarantees as allgather-based approach

Technical changes:
- ucc_dt_check_state_t: Changed from gathered_values[2*N] to
  reduced_values[4] and local_values[4]
- Renamed functions: allgather_post/progress/finalize -> allreduce_*
- Updated comments to reflect allreduce-based validation
- Removed manual progress call in allreduce_progress - service
  collective progresses itself through normal UCC progress queue

The validation still ensures all ranks have matching contiguous
datatypes before executing rooted collectives (GATHER, GATHERV,
SCATTER, SCATTERV).
Optimize the transparent datatype validation mechanism to use int16
instead of int64 for the allreduce operation, reducing message size
from 32 bytes to 8 bytes (75% reduction).

At the OMPI layer, only UCC predefined datatypes are supported for
collectives. Generic (user-defined) datatypes are not used, which
allows us to make assumptions about the value range:

- Predefined datatype maximum value: 136 (UCC_DT_FLOAT128_COMPLEX = 17 << 3)
- Memory type maximum value: 5 (UCC_MEMORY_TYPE_LAST)
- Both values fit comfortably in int16 range (-32768 to 32767)

The optimization changes:
- int64_t arrays → int16_t arrays in ucc_dt_check_state_t
- UCC_DT_INT64 → UCC_DT_INT16 in service allreduce call
- Update all casts from (int64_t) to (int16_t)
- Message size: 32 bytes → 8 bytes per validation operation
- Added UCC_DT_IS_PREDEFINED() check to explicitly reject generic
  datatypes and prevent int16 overflow from user arguments

Benefits:
- 75% bandwidth reduction for validation messages
- Same O(1) message size (doesn't scale with number of ranks)
- Same correctness guarantees (min/max trick still works)
- No additional latency
- Safe against overflow from generic datatype pointer values

Note: This optimization is valid only for predefined datatypes.
Generic datatypes use full 64-bit pointer values and would differ
across processes anyway, causing validation to correctly reject them.
Address review feedback on datatype validation implementation:

1. Fix v-type collective datatype access (lines 83-88)
   - GATHERV root: use dst.info_v instead of dst.info for IN_PLACE
   - SCATTERV root: use src.info_v instead of src.info for IN_PLACE
   - Correctly access variable-count buffer metadata

2. Optimize to in-place allreduce (lines 90-91)
   - Combine reduced_values[4] and local_values[4] into values[4]
   - Use in-place allreduce (same buffer for send/recv)
   - Reduces memory footprint of dt_check_state by 8 bytes
   - Update comments to reflect in-place operation
Move all datatype validation logic from ucc_coll.c to ucc_service_coll.c
to avoid increasing ucc_coll_task_t size.

Key changes:

1. Remove dt_check field from ucc_coll_task_t
   - Saves 8 bytes per task (pointer eliminated)
   - dt_check only exists when validation is needed

2. Create ucc_dt_check_schedule_t extending ucc_schedule_t
   - Embeds ucc_dt_check_state_t directly in schedule
   - Only allocated when validation is required
   - No overhead for tasks without validation

3. Add ucc_service_dt_check() API in ucc_service_coll.h
   - Single entry point for datatype validation
   - Creates validation schedule if needed
   - Returns original task if validation not needed
   - Encapsulates all validation complexity

4. Move all validation functions to ucc_service_coll.c (~480 lines)
   - ucc_dt_check_allreduce_post/progress/finalize
   - ucc_dt_check_actual_wrapper_post/progress/finalize
   - ucc_dt_check_schedule_finalize
   - ucc_service_dt_check (main entry point)
   - Helper macros to access dt_check from schedule

5. Update ucc_coll.c to use new API
   - Replace ucc_dt_check_start_validation + ucc_dt_check_create_schedule
   - Single call to ucc_service_dt_check()
   - Remove ~500 lines of validation code

Benefits:
- Reduced memory footprint: no dt_check pointer in every task
- Better code organization: validation logic in service_coll module
- Cleaner API: single function instead of two-step process
- No performance impact: same validation mechanism

Addresses PR review comment openucx#4.
- Set UCC_COLL_ARGS_FLAG_IN_PLACE when sbuf == rbuf in ucc_tl_ucp_service_allreduce
  to properly support in-place service allreduce (e.g. dt_check validation)
- When service allreduce fails in dt_check, set allreduce_wrapper->status = UCC_OK
  so the completion chain proceeds and prevents dt_check hang on service allreduce failure
- Add status_out parameter to ucc_service_dt_check to propagate actual error status
  instead of always reporting UCC_ERR_NO_MEMORY
- Remove unreachable "predefined but non-contiguous" else branch and unused is_contig
- Remove redundant actual_task from ucc_dt_check_schedule_t (keep in dt_check state only)
- Change root variable type from int to ucc_rank_t in ucc_service_dt_check
- Propagate UCC_COLL_TASK_FLAG_EXECUTOR from actual_task to dt_check schedule
- Call ucc_coll_task_destruct before mpool_put in error path
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 8, 2026

/build

2 similar comments
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 8, 2026

/build

@ikryukov
Copy link
Copy Markdown
Collaborator

ikryukov commented Apr 8, 2026

/build

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 8, 2026

a new identical PR is merged

@janjust janjust closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants