NVLink seems to prevent PyTorch training loop from starting

mnazemi · June 15, 2022, 6:39pm

Thank you for your detailed explanation. I should have mentioned that I use AMD CPUs and I don’t see this issue when using two GPUs.

The problem was indeed resolved by setting iommu=soft.
Given your explanation, I found a similar solution that relied on setting pci=noats, but it didn’t work for me.
It appears that there is currently no way to use the hardware MMU to get better performance.

Ironically, I don’t get a noticeable speedup when setting iommu=soft and NCCL_DEBUG=INFO NCCL_ALGO=Ring NCCL_NET_GDR_LEVEL=4. Here is the output of nvidia-smi topo -m:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-63,128-191    0
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-63,128-191    0
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    64-127,192-254  1
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    64-127,192-254  1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     64-127,192-254  1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      64-127,192-254  1

Topic		Replies	Views
PyTorch DDP NCCL hangs on h100 server	1	403	February 5, 2025
PyTorch Data Parallel: Unexplained System Crash on Lambda Workstation Machine Learning Research	0	1655	December 18, 2020
Pytorch DataParallel two gpu Hang Technical Help	1	1936	September 27, 2020
Getting unhandled cuda error when trying to use TF distributed strategy for training on 4 GPUs Technical Help	1	2848	June 12, 2020
VRAM-hungry LSTM monster Technical Help	6	1834	June 13, 2022

NVLink seems to prevent PyTorch training loop from starting

Related topics