I just realized that the feature implemented in this PR: #20825 causes CPU-GPU synchronization at the end of each training step, which negatively affects the GPU utilization. In particular the torch.distributed.broadcast(sigterm_tensor, src=0) in method: _broadcast_sigterm_tensor needs CPU to wait for GPU. This is not the case before adding this feature.
You can see this in Pytorch profiler below:
Originally posted by @mojtababahrami in #20825 (comment)