Introduction
You have created a Crusoe Managed Kubernetes (CMK) cluster with an AMD nodepool and are looking to leverage the AMD GPU operator to expose the GPUs to Kubernetes
Prerequisites
- Crusoe Managed Kubernetes (CMK) cluster
- AMD nodepool
- Helm
Step-by-Step Instructions
-
Step 1: Add cert-manager repo
helm repo add jetstack https://charts.jetstack.io --force-update && helm repo update
-
Step 2: Install cert-manager
helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --version v1.15.1 \ --set crds.enabled=true
-
Step 3: Add ROCM repo
helm repo add rocm https://rocm.github.io/gpu-operator && helm repo update
-
Step 4: Install AMD GPU Operator
helm install amd-gpu-operator rocm/gpu-operator-charts \ --namespace kube-amd-gpu --create-namespace \ --version v1.2.2
-
Step 5: Deploy the AMD metrics exporter configmap
apiVersion: v1 kind: ConfigMap metadata: name: exporter-configmap namespace: kube-amd-gpu data: config.json: | { "GPUConfig": { "Labels": [ "GPU_UUID", "SERIAL_NUMBER", "GPU_ID", "POD", "NAMESPACE", "CONTAINER", "JOB_ID", "JOB_USER", "JOB_PARTITION", "CLUSTER_NAME", "CARD_SERIES", "CARD_MODEL", "CARD_VENDOR", "DRIVER_VERSION", "VBIOS_VERSION", "HOSTNAME" ] } -
Step 6: Create a registry secret for AMD GPU Operator to push/pull driver images to/from
kubectl create secret docker-registry my-docker-secret -n kube-amd-gpu --docker-username $YOUR_USERNAME --docker-email $YOUR_EMAIL --docker-password $YOUR_PASSWORD
-
Step 7: Deploy the AMD DeviceConfig to kickstart the discovery & driver installation process.
NOTE: Edit the
spec.driver.imagepath to suit your Docker Hub username or your custom Docker repository URL.apiVersion: amd.com/v1alpha1 kind: DeviceConfig metadata: # the names for the device plugin, metrics exporter and node labeler pods will be prefixed with this name name: gpu-operator # it is highly recommended to use the namespace where AMD GPU Operator is running namespace: kube-amd-gpu spec: driver: # set to ture for deploying out-of-tree driver with specified ROCm version # set to false to directly use inbox or pre-installed driver on worker nodes # NOTE: Must be set to true because CMK does not ship with drivers on worker nodes enable: true # set to true to add blacklist for the amdgpu inbox driver kernel module, required for spec.driver.enable=true # set to false to remove blacklist for the amdgpu inbox driver kernel module, required for spec.driver.enable=false # the reboot of worker node is required to apply the updated blacklist # NOTE: Not sure if CMK worker nodes actually need a reboot since they don't load the AMD kernel driver by default anyways blacklist: true # Specify the out-of-tree driver version version: "6.4.1" # Specify driver image here # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you # e.g. docker.io/username/amdgpu-driver # NOTE: AMD GPU Operator uses this repo as a "cache" to build and upload driver images to. # NOTE: You DO NOT need to pre-build the driver image and push it to this repo! image: docker.io/your-username/your-amd-gpu-driver-repo # Specify the credential for your private registry if it requires credential to get pull/push access # you can create the docker-registry type secret by running command like: # kubectl create secret docker-registry mysecret -n kube-amd-gpu --docker-username=xxx --docker-password=xxx # Make sure you created the secret within the namespace that gpu operator controller is running imageRegistrySecret: name: docker-registry # NOTE: AMD GPU Operator uses this secret to push/pull from the driver image repo devicePlugin: # Specify the device plugin image # default value is rocm/k8s-device-plugin:latest devicePluginImage: rocm/k8s-device-plugin:latest # Specify the node labeller image # default value is rocm/k8s-device-plugin:labeller-latest nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest # Specify to enable/disable the node labeller # node labeller is required for adding / removing blacklist config of amdgpu kernel module # please set to true if you want to blacklist the inbox driver and use our-of-tree driver enableNodeLabeller: true # Specify the metrics exporter config metricsExporter: # To enable/disable the metrics exporter, disabled by default enable: true # kubernetes service type for metrics exporter, clusterIP(default) or NodePort serviceType: "ClusterIP" # internal service port used for in-cluster and node access to pull metrics from the metrics-exporter (default 5000) port: 5000 # Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort nodePort: 32500 # exporter image image: "rocm/device-metrics-exporter:v1.2.1" # metrics export config in configmap config: name: exporter-configmap # Specify the node to be managed by this DeviceConfig Custom Resource selector: feature.node.kubernetes.io/amd-gpu: "true" testRunner: enable: true logsLocation: mountPath: "/var/log/amd-test-runner" # mount path inside test runner container for logs hostPath: "/var/log/amd-test-runner" # host path to be mounted into test runner container for logs
Additional Resources