Thank you for your detailed explanation. I should have mentioned that I use AMD CPUs and I don’t see this issue when using two GPUs.
The problem was indeed resolved by setting iommu=soft.
Given your explanation, I found a similar solution that relied on setting pci=noats, but it didn’t work for me.
It appears that there is currently no way to use the hardware MMU to get better performance.
Ironically, I don’t get a noticeable speedup when setting iommu=soft and NCCL_DEBUG=INFO NCCL_ALGO=Ring NCCL_NET_GDR_LEVEL=4. Here is the output of nvidia-smi topo -m:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV4 NODE NODE SYS SYS SYS SYS 0-63,128-191 0
GPU1 NV4 X NODE NODE SYS SYS SYS SYS 0-63,128-191 0
GPU2 NODE NODE X NV4 SYS SYS SYS SYS 0-63,128-191 0
GPU3 NODE NODE NV4 X SYS SYS SYS SYS 0-63,128-191 0
GPU4 SYS SYS SYS SYS X NV4 NODE NODE 64-127,192-254 1
GPU5 SYS SYS SYS SYS NV4 X NODE NODE 64-127,192-254 1
GPU6 SYS SYS SYS SYS NODE NODE X NV4 64-127,192-254 1
GPU7 SYS SYS SYS SYS NODE NODE NV4 X 64-127,192-254 1