coll/datatype: Add a collective checker function to UCC API that can … #1241
Closed
QiaoK wants to merge 7 commits intoopenucx:masterfrom
Closed
coll/datatype: Add a collective checker function to UCC API that can … #1241QiaoK wants to merge 7 commits intoopenucx:masterfrom
QiaoK wants to merge 7 commits intoopenucx:masterfrom
Conversation
Collaborator
|
as discussed, let's rework this |
ff0fc5b to
0ec4438
Compare
4b5930d to
2e824f4
Compare
d96ba2b to
8f8a470
Compare
wfaderhold21
approved these changes
Mar 24, 2026
Collaborator
wfaderhold21
left a comment
There was a problem hiding this comment.
LGTM. Minor comments / questions.
0425b95 to
647684c
Compare
Collaborator
|
/build |
janjust
approved these changes
Apr 1, 2026
Collaborator
|
/build |
Collaborator
Author
|
/build |
This commit introduces the infrastructure for transparent datatype
validation in rooted collectives (GATHER, GATHERV, SCATTER, SCATTERV)
without implementing the actual validation logic.
Key components:
1. Configuration:
- Added UCC_CHECK_ASYMMETRIC_DT config option (disabled by default)
- Allows runtime control of validation feature
2. Data structures (ucc_schedule.h):
- ucc_dt_check_state_t: State for validation process
* check_req: Service allgather request handle
* gathered_values: Buffer for collecting datatype info
* local_values: Local datatype and memory type
* subset: Team subset for service allgather
* validated: Flag indicating if validation passed
* actual_task: Pointer to actual collective task
- Added dt_check field to ucc_coll_task_t
3. Schedule architecture (ucc_coll.c):
- ucc_dt_check_create_schedule(): Creates validation schedule
- Schedule with two wrapper tasks:
1. Allgather wrapper: Will validate datatypes (bare-bones)
2. Actual wrapper: Will conditionally execute the real collective
- Dependency: allgather wrapper → actual wrapper
- If validation fails, actual task is not posted
4. Stub functions (return UCC_OK, no actual validation yet):
- ucc_dt_check_allgather_post()
- ucc_dt_check_allgather_progress()
- ucc_dt_check_allgather_finalize()
- ucc_dt_check_actual_wrapper_post()
- ucc_dt_check_actual_wrapper_progress()
- ucc_dt_check_actual_wrapper_finalize()
This bare-bones structure allows the full validation implementation
to be added in a subsequent commit, making the changes easier to
review and test incrementally.
Implement the full datatype validation logic for rooted collectives (GATHER, GATHERV, SCATTER, SCATTERV) using service allgather. Implementation details: 1. Validation setup (ucc_dt_check_start_validation): - Checks if validation is needed (UCC_CHECK_ASYMMETRIC_DT enabled) - Extracts local datatype and memory type - Checks if local datatype is contiguous - Allocates gathered_values buffer for all ranks - Sets up subset for full team 2. Allgather execution (ucc_dt_check_allgather_post/progress): - Uses ucc_service_allgather to collect [datatype, mem_type] from all ranks - Each rank contributes 2 int64_t values - Progresses service collective through normal UCC progress queue - Properly finalizes service collective when complete 3. Validation (ucc_dt_validate_results): - Checks if any rank has non-contiguous datatype → UCC_ERR_NOT_SUPPORTED - Checks if all ranks have matching datatype and memory type - Returns UCC_ERR_INVALID_PARAM if mismatch detected 4. Actual task wrapper: - Only posts actual collective if validation succeeded - If validation failed, sets schedule status to error - Allows graceful fallback to OMPI layer 5. Schedule finalization: - Properly cleans up validation state (gathered_values, dt_check) - Finalize actual collective task - Free schedule structure Key design decisions: - Validation always completes (sets validated flag) - Schedule continues to actual wrapper regardless - Actual wrapper checks validated flag before posting real task - Graceful error propagation to user via schedule status
Switch the transparent datatype validation mechanism from service allgather to service allreduce for better scalability. Key improvements: - Constant message size: O(1) instead of O(N) where N = number of ranks - Uses MIN allreduce with min/max trick for mismatch detection: * Send [dt, -dt, mem_type, -mem_type] * After MIN reduction: if min(dt) == -min(-dt), all ranks agree * Same principle for memory type validation - No heap allocation: uses stack arrays instead of gathered_values buffer - Same validation guarantees as allgather-based approach Technical changes: - ucc_dt_check_state_t: Changed from gathered_values[2*N] to reduced_values[4] and local_values[4] - Renamed functions: allgather_post/progress/finalize -> allreduce_* - Updated comments to reflect allreduce-based validation - Removed manual progress call in allreduce_progress - service collective progresses itself through normal UCC progress queue The validation still ensures all ranks have matching contiguous datatypes before executing rooted collectives (GATHER, GATHERV, SCATTER, SCATTERV).
Optimize the transparent datatype validation mechanism to use int16 instead of int64 for the allreduce operation, reducing message size from 32 bytes to 8 bytes (75% reduction). At the OMPI layer, only UCC predefined datatypes are supported for collectives. Generic (user-defined) datatypes are not used, which allows us to make assumptions about the value range: - Predefined datatype maximum value: 136 (UCC_DT_FLOAT128_COMPLEX = 17 << 3) - Memory type maximum value: 5 (UCC_MEMORY_TYPE_LAST) - Both values fit comfortably in int16 range (-32768 to 32767) The optimization changes: - int64_t arrays → int16_t arrays in ucc_dt_check_state_t - UCC_DT_INT64 → UCC_DT_INT16 in service allreduce call - Update all casts from (int64_t) to (int16_t) - Message size: 32 bytes → 8 bytes per validation operation - Added UCC_DT_IS_PREDEFINED() check to explicitly reject generic datatypes and prevent int16 overflow from user arguments Benefits: - 75% bandwidth reduction for validation messages - Same O(1) message size (doesn't scale with number of ranks) - Same correctness guarantees (min/max trick still works) - No additional latency - Safe against overflow from generic datatype pointer values Note: This optimization is valid only for predefined datatypes. Generic datatypes use full 64-bit pointer values and would differ across processes anyway, causing validation to correctly reject them.
Address review feedback on datatype validation implementation: 1. Fix v-type collective datatype access (lines 83-88) - GATHERV root: use dst.info_v instead of dst.info for IN_PLACE - SCATTERV root: use src.info_v instead of src.info for IN_PLACE - Correctly access variable-count buffer metadata 2. Optimize to in-place allreduce (lines 90-91) - Combine reduced_values[4] and local_values[4] into values[4] - Use in-place allreduce (same buffer for send/recv) - Reduces memory footprint of dt_check_state by 8 bytes - Update comments to reflect in-place operation
Move all datatype validation logic from ucc_coll.c to ucc_service_coll.c to avoid increasing ucc_coll_task_t size. Key changes: 1. Remove dt_check field from ucc_coll_task_t - Saves 8 bytes per task (pointer eliminated) - dt_check only exists when validation is needed 2. Create ucc_dt_check_schedule_t extending ucc_schedule_t - Embeds ucc_dt_check_state_t directly in schedule - Only allocated when validation is required - No overhead for tasks without validation 3. Add ucc_service_dt_check() API in ucc_service_coll.h - Single entry point for datatype validation - Creates validation schedule if needed - Returns original task if validation not needed - Encapsulates all validation complexity 4. Move all validation functions to ucc_service_coll.c (~480 lines) - ucc_dt_check_allreduce_post/progress/finalize - ucc_dt_check_actual_wrapper_post/progress/finalize - ucc_dt_check_schedule_finalize - ucc_service_dt_check (main entry point) - Helper macros to access dt_check from schedule 5. Update ucc_coll.c to use new API - Replace ucc_dt_check_start_validation + ucc_dt_check_create_schedule - Single call to ucc_service_dt_check() - Remove ~500 lines of validation code Benefits: - Reduced memory footprint: no dt_check pointer in every task - Better code organization: validation logic in service_coll module - Cleaner API: single function instead of two-step process - No performance impact: same validation mechanism Addresses PR review comment openucx#4.
- Set UCC_COLL_ARGS_FLAG_IN_PLACE when sbuf == rbuf in ucc_tl_ucp_service_allreduce to properly support in-place service allreduce (e.g. dt_check validation) - When service allreduce fails in dt_check, set allreduce_wrapper->status = UCC_OK so the completion chain proceeds and prevents dt_check hang on service allreduce failure - Add status_out parameter to ucc_service_dt_check to propagate actual error status instead of always reporting UCC_ERR_NO_MEMORY - Remove unreachable "predefined but non-contiguous" else branch and unused is_contig - Remove redundant actual_task from ucc_dt_check_schedule_t (keep in dt_check state only) - Change root variable type from int to ucc_rank_t in ucc_service_dt_check - Propagate UCC_COLL_TASK_FLAG_EXECUTOR from actual_task to dt_check schedule - Call ucc_coll_task_destruct before mpool_put in error path
Collaborator
|
/build |
2 similar comments
Collaborator
|
/build |
Collaborator
|
/build |
Collaborator
|
a new identical PR is merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce the following function.
ucc_status_t ucc_dt_check_rooted_collective(ucc_team_h team,
const ucc_coll_args_t *args,
int rank)
This function syncs all processes to check if the arguments for UCC can be processed. There are 2 checks using Allreduce Sum: uniformity of datatype, uniformity of memory type. The function returns UCC_OK if and only if when these three checks are passed. Otherwise it will return the corresponding errors.
OMPI will call this function before posting Gather or Scatter, so all process will either choose to progress with UCC route or fallback to other route. This allows OMPI to prevent some process going into UCC while others fallback, causing a hang.