Introduction
NVIDIA’s Multi-Instance GPU (MIG) technology enables partitioning a single GPU into multiple smaller, isolated instances. This is especially beneficial in environments with multiple workloads, allowing efficient resource sharing and maximizing GPU utilization. Each MIG partition operates as an independent GPU, ensuring that workloads running on different instances do not interfere with one another.
In this guide, you will learn how to enable MIG on H100/A100 GPUs, partition them into multiple instances, and run Docker containers using these partitions. By the end of this guide, you'll be able to effectively leverage MIG technology for isolating workloads on H100/A100 GPUs.
Benefits of MIG
-
Resource Sharing: MIG allows multiple users or workloads to share a single physical GPU, improving resource utilization and cost-efficiency.
-
Complete Isolation: Each MIG partition is fully isolated, with dedicated resources including memory, processors, and cache. This ensures that workloads running on one instance won’t impact those on another.
-
Guaranteed Resources: MIG enables setting specific resource allocations for each partition, providing predictable performance for workloads, which is particularly important in multi-tenant environments.
Prerequisites
Before starting this process, ensure that you meet the following prerequisites:
-
Compatible GPU: MIG is supported on GPUs from the NVIDIA Ampere generation onwards (compute capability ≥ 8.0).
-
NVIDIA Driver: Version 525 or higher should be installed.
-
NVIDIA Container Toolkit: Required for running Docker containers with GPU support.
-
Access to an Instance: SSH into a VM or instance with MIG-compatible hardware.
-
Basic Knowledge: Familiarity with Docker commands and GPU management.
Step-by-Step Instructions
Step 1: Check GPU Status
First, verify that your GPUs are recognized by the system. Run the following command:
nvidia-smi
This command will display the status of your GPUs, including usage, memory, and other relevant information.
Step 2: Enable MIG Mode on All GPUs
To enable MIG mode on your GPUs, run the following command. This will enable MIG on each GPU from 0 to 7:
for i in {0..7}; do sudo nvidia-smi -i $i -mig 1; done
After running this command, you should notice a change in the nvidia-smi
output.
Step 3: List Available MIG Profiles
Next, list the available MIG profiles, which represent predefined configurations for GPU partitions:
nvidia-smi mig -lgip
This command will display all available GPU instance profiles.
Step 4: Partition the GPUs into MIG Instances
To partition using profiles 9 and 19, run:
sudo nvidia-smi mig -cgi 9,19 -C
Step 5: Verify the MIG Configuration
After partitioning the GPUs, verify the configuration:
sudo nvidia-smi mig -lgi
Check that the output matches your expectations, including the number of partitions and the profile sizes.
Step 6: Display the UUIDs of MIG Partitions
Each MIG instance has a unique identifier (UUID). Display the UUIDs:
nvidia-smi -L
Step 7: Create a Docker Container with MIG Partitions
Create a Docker container utilizing specific MIG partitions:
docker run -d \
--gpus "device=MIG-UUIDs" \
nvidia/cuda:11.0.3-base-ubuntu20.04 \
tail -f /dev/null
For Example:
docker run -d \
--gpus "device=MIG-c9eac27d-60a0-5b42-b697-a1785b1acc17,MIG-8f8d5e1e-1a7c-5946-adf6-18e8cf3d5e9b,MIG-c7f033bb-4ad0-589f-9890-2ce057a7d0b9,MIG-99da62cf-9429-5b35-93c2-dda01d26a89b,MIG-5f6151da-6b55-5012-a9d3-e7665070988a,MIG-45425fa9-1dc2-585c-980f-73beee6f28ad,MIG-97e09940-221b-5249-b551-901592b57646,MIG-54e7b30e-9bfa-53fb-ad01-8d955820ac89,MIG-d25150d4-d329-5e31-8359-9e2a783ff796,MIG-317457c5-c4a5-51a7-b115-4d539cee38eb,MIG-37976542-affe-57c1-8eac-d305e08b76e4,MIG-e74fc7b9-f823-5644-9e76-c392f2e8a9e5,MIG-0ba25904-d04c-5bf5-9a1e-1d52148b9608,MIG-b29d0ef5-85f9-5379-95ae-995642e0c1ec,MIG-cff88ce6-003d-528b-a01a-55c3a8a9149a,MIG-2c9143a3-1e53-5fa7-80a9-647eed29c380" \
nvidia/cuda:11.0.3-base-ubuntu20.04 \
tail -f /dev/null
Step 8: Exec into the Docker Container and Verify GPU Allocation
Enter the container:
sudo docker exec -it <container_id> bash
Inside the container, run:
nvidia-smi
Common Issues and Resolutions
-
Issue: MIG profiles not available after enabling MIG mode.
-
Resolution: Ensure your NVIDIA drivers are up-to-date.
-
-
Issue: Docker container exits immediately.
-
Resolution: Use
tail -f /dev/null
to keep the container running.
-
-
Issue:
NVIDIA_VISIBLE_DEVICES
is not working.-
Resolution: Verify the UUIDs and ensure correct syntax.
-
Automating with Terraform
You can automate the process using Terraform. Below is an example Terraform script:
terraform {
required_providers {
crusoe = {
source = "registry.terraform.io/crusoecloud/crusoe"
}
}
}
locals {
ssh_key = file("~/.ssh/id_ed25519.pub")
}
# Fetch IB networks (optional - you can remove if hardcoding the ID)
data "crusoe_ib_networks" "ib_networks" {}
resource "crusoe_ib_partition" "my_partition" {
name = "my-ib-partition-mig"
ib_network_id = "36f543c2-0c3c-4c8d-b717-194b16e43fcd" # Replace with actual IB network ID
}
resource "crusoe_compute_instance" "a100_vm" {
name = "a100-vm"
type = "a100-80gb-sxm-ib.8x"
location = "us-east1-a"
image = "ubuntu22.04-nvidia-sxm-docker:latest"
ssh_key = local.ssh_key
host_channel_adapters = [{
ib_partition_id = crusoe_ib_partition.my_partition.id
}]
network_interfaces = [{
subnet = "37adc22e-7b4a-4445-ab7e-4c833fb3fa73" # Replace with the correct subnet ID
public_ipv4 = {
type = "static"
}
}]
}
resource "null_resource" "configure_mig_and_run_docker" {
depends_on = [crusoe_compute_instance.a100_vm]
connection {
type = "ssh"
user = "ubuntu"
host = crusoe_compute_instance.a100_vm.network_interfaces[0].public_ipv4.address
private_key = file("~/.ssh/id_ed25519")
}
provisioner "file" {
source = "configure_mig_and_run_docker.sh"
destination = "/tmp/configure_mig_and_run_docker.sh"
}
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/configure_mig_and_run_docker.sh",
"/tmp/configure_mig_and_run_docker.sh"
]
}
}
Bash Script for MIG Configuration
#!/bin/bash
# Step 1: Enable MIG mode on all 8 GPUs
echo "Step 1: Enabling MIG mode..."
for i in {0..7}; do
sudo nvidia-smi -i $i -mig 1
done
sleep 5
# Step 2: Delete existing instances (safe cleanup)
echo "Step 2: Deleting any existing MIG instances..."
for i in {0..7}; do
sudo nvidia-smi mig -dci -i $i &>/dev/null || true
sudo nvidia-smi mig -dgi -i $i &>/dev/null || true
done
sleep 5
# Step 3: Create 1 MIG instance per GPU using profile 9 (3g.40gb)
echo "Step 3: Creating 1 MIG instance per GPU..."
for i in {0..7}; do
echo "Creating MIG instance on GPU $i"
sudo nvidia-smi mig -cgi 9 -C -i $i
done
sleep 10
# Step 4: List MIG instances
echo "Listing all MIG instances..."
sudo nvidia-smi mig -lgi
# Step 5: Extract all MIG UUIDs
echo "Step 5: Extracting all MIG UUIDs..."
MIG_uuids=($(nvidia-smi -L | grep -oP 'MIG-[^\s)]*'))
# Ensure we have 8 UUIDs
if [ ${#MIG_uuids[@]} -ne 8 ]; then
echo "Error: Expected 8 MIG UUIDs, found ${#MIG_uuids[@]}"
exit 1
fi
# Prepare comma-separated list
MIG_UUID_STRING=$(IFS=,; echo "${MIG_uuids[*]}")
echo "All MIG UUIDs: $MIG_UUID_STRING"
# Step 6: Start Docker container using all 8 MIG UUIDs
echo "Step 6: Starting Docker container with all MIG instances..."
sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES="$MIG_UUID_STRING" -d nvidia/cuda:11.0.3-base-ubuntu20.04 tail -f /dev/null
if [ $? -eq 0 ]; then
echo "Docker container started successfully with all 8 MIG instances."
else
echo "Error: Docker container failed to start."
fi
This guide ensures a seamless process for setting up NVIDIA MIG, partitioning GPUs, and running workloads in Docker containers efficiently.
Comments
0 comments
Please sign in to leave a comment.