NVLink seems to prevent PyTorch training loop from starting

Thank you for your detailed explanation. I should have mentioned that I use AMD CPUs and I don’t see this issue when using two GPUs.

The problem was indeed resolved by setting iommu=soft.
Given your explanation, I found a similar solution that relied on setting pci=noats, but it didn’t work for me.
It appears that there is currently no way to use the hardware MMU to get better performance.

Ironically, I don’t get a noticeable speedup when setting iommu=soft and NCCL_DEBUG=INFO NCCL_ALGO=Ring NCCL_NET_GDR_LEVEL=4. Here is the output of nvidia-smi topo -m:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-63,128-191    0
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-63,128-191    0
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    64-127,192-254  1
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    64-127,192-254  1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     64-127,192-254  1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      64-127,192-254  1