Skip to content

forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU  #21487

@mojtababahrami

Description

@mojtababahrami

I just realized that the feature implemented in this PR: #20825 causes CPU-GPU synchronization at the end of each training step, which negatively affects the GPU utilization. In particular the torch.distributed.broadcast(sigterm_tensor, src=0) in method: _broadcast_sigterm_tensor needs CPU to wait for GPU. This is not the case before adding this feature.
You can see this in Pytorch profiler below:

image

Originally posted by @mojtababahrami in #20825 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions