Introduction
As of April 2024, Crusoe cloud deployed a new virtualization platform that was designed to significantly increase the stability and performance of GPU-enabled workloads. This update requires that customers who created an SXM VM with an image prior to this would need to update their VM with a new NCCL Topology file as well as update NCCL configs in order for distributed training to work as expected.
Prerequisites
This update only applies to a100-80gb-sxm-ib.8x or h100-80gb-sxm-ib.8x Instance Types .
Solution
If you're using an image below that is earlier than 2024-05-29
ubuntu22.04-nvidia-sxm-docker
or the image
ubuntu20.04-nvidia-sxm-docker
chances are that you may be using a NCCL topology file that is outdated and consequently, will not run distributed NCCL traffic on the new hypervisor.
1. If on an A100-SXM node with one of the images above, update the NCCL topology wherever you've set your NCCL topo file with the following:
<system version="1">
<cpu numaid="0" affinity="00000000,0000ffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
<pci busid="ffff:ff:01.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="16.0 GT/s PCIe" link_width="16"> <!-- Switch 0 begins -->
<pci busid="0002:00:01.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 0 -->
<pci busid="0002:00:02.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 1 -->
<pci busid="0002:00:09.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 0 -->
<pci busid="0002:00:0a.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 1 -->
</pci> <!-- Switch 0 ends -->
<pci busid="ffff:ff:02.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="16.0 GT/s PCIe" link_width="16"> <!-- Switch 1 begins -->
<pci busid="0002:00:03.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 2 -->
<pci busid="0002:00:04.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 3 -->
<pci busid="0002:00:0b.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 2 -->
<pci busid="0002:00:0c.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 3 -->
</pci> <!-- Switch 1 ends -->
</cpu>
<cpu numaid="1" affinity="ffffffff,ffff0000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
<pci busid="ffff:ff:03.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="16.0 GT/s PCIe" link_width="16"> <!-- Switch 2 begins -->
<pci busid="0003:00:01.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 4 -->
<pci busid="0003:00:02.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 5 -->
<pci busid="0003:00:09.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 4 -->
<pci busid="0003:00:0a.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 5 -->
</pci> <!-- Switch 2 ends -
<pci busid="ffff:ff:04.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="16.0 GT/s PCIe" link_width="16"> <!-- Switch 3 begins -->
<pci busid="0003:00:03.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 6 -->
<pci busid="0003:00:04.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- GPU 7 -->
<pci busid="0003:00:0b.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 6 -->
<pci busid="0003:00:0c.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="16.0 GT/s PCIe" link_width="16"/> <!-- NIC 7 -->
</pci> <!-- Switch 3 ends -->
</cpu>
</system>
2. If on a H100-SXM node, update the NCCL topology to the following:
<system version="1">
<cpu numaid="0" affinity="0000,00000000,00000000,00ffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
<pci busid="ffff:ff:01.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="32.0 GT/s PCIe" link_width="16"> <!-- Switch 0 begins -->
<pci busid="0002:00:01.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 0 -->
<pci busid="0002:00:02.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 1 -->
<pci busid="0002:00:09.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 0 -->
<pci busid="0002:00:0a.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 1 -->
</pci> <!-- Switch 0 ends -->
<pci busid="ffff:ff:02.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="32.0 GT/s PCIe" link_width="16"> <!-- Switch 1 begins -->
<pci busid="0002:00:03.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 2 -->
<pci busid="0002:00:04.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 3 -->
<pci busid="0002:00:0b.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 2 -->
<pci busid="0002:00:0c.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 3 -->
</pci> <!-- Switch 1 ends -->
</cpu>
<cpu numaid="1" affinity="ffff,ffffffff,ffffffff,ff000000,00000000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
<pci busid="ffff:ff:03.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="32.0 GT/s PCIe" link_width="16"> <!-- Switch 2 begins -->
<pci busid="0003:00:01.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 4 -->
<pci busid="0003:00:02.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 5 -->
<pci busid="0003:00:09.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 4 -->
<pci busid="0003:00:0a.0" class="0x020000" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 5 -->
</pci> <!-- Switch 2 ends -->
<pci busid="ffff:ff:04.0" class="0x060400" vendor="0xffff" device="0xffff" subsystem_vendor="0xffff" subsystem_device="0xffff" link_speed="32.0 GT/s PCIe" link_width="16"> <!-- Switch 3 begins -->
<pci busid="0003:00:03.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 6 -->
<pci busid="0003:00:04.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- GPU 7 -->
<pci busid="0003:00:0b.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 6 -->
<pci busid="0003:00:0c.0" class="0x020000" vendor="0x15b3" device="0x101e" subsystem_vendor="0x15b3" subsystem_device="0x0073" link_speed="32.0 GT/s PCIe" link_width="16"/> <!-- NIC 7 -->
</pci> <!-- Switch 3 ends -->
</cpu>
</system>
3. Disable crusoe_nccl_topo service if using a NCCL configuration store at a custom path.
- The default path for NCCL configuration is /etc/nccl.conf.
- The crusoe_nccl_topo service automatically sets the NCCL_TOPO_FILE path to /etc/nccl.conf
- If the above changes are performed on a custom topology file, you should disable the crusoe_nccl_topo service by running
systemctl disable crusoe_nccl_topo.service
4. NCCL Config Changes
In addition to updating the topology files with the XML output above, you will also need to add the following NCCL configs to your program. We recommend adding them to your topology file (/etc/nccl.conf) for simplicity. If you are working with containers, these values will need to be updated in your container as well.
NCCL_IB_MERGE_VFS=0
NCCL_IB_HCA=^mlx5_0:1
5. Container Changes
FROM ...
ENV NCCL_TOPO_FILE=/path/to/nccl_topo.xml ENV NCCL_IB_MERGE_VFS=0 ENV NCCL_IB_HCA=^mlx5_0:1
If you have any questions regarding these changes and how they impact your workloads, please reach out to Support.
Comments
0 comments
Please sign in to leave a comment.