forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU 

I just realized that the feature implemented in this PR: #20825  causes CPU-GPU synchronization at the end of each training step, which negatively affects the GPU utilization. In particular the `torch.distributed.broadcast(sigterm_tensor, src=0)` in method: `_broadcast_sigterm_tensor` needs CPU to wait for GPU. This is not the case before adding this feature.
You can see this in Pytorch profiler below:

<img width="846" height="225" alt="image" src="https://github.com/user-attachments/assets/8a4d3cbc-c014-4087-850f-0ce3d345b95d" />

_Originally posted by @mojtababahrami in https://github.com/Lightning-AI/pytorch-lightning/issues/20825#issuecomment-3649816413_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU #21487

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU #21487

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions