Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Install AMD GPU Operator on a Crusoe Managed Kubernetes Cluster

Chinmay Baikar
Chinmay Baikar
Updated

Introduction

You have created a Crusoe Managed Kubernetes (CMK) cluster with an AMD nodepool and are looking to leverage the AMD GPU operator to expose the GPUs to Kubernetes

Prerequisites

  • Crusoe Managed Kubernetes (CMK) cluster
  • AMD nodepool
  • Helm

Step-by-Step Instructions

  1. Step 1: Add cert-manager repo

    helm repo add jetstack https://charts.jetstack.io --force-update && helm repo update
  2. Step 2: Install cert-manager

    helm install cert-manager jetstack/cert-manager \
      --namespace cert-manager \
      --create-namespace \
      --version v1.15.1 \
      --set crds.enabled=true
  3. Step 3: Add ROCM repo

    helm repo add rocm https://rocm.github.io/gpu-operator && helm repo update
  4. Step 4: Install AMD GPU Operator

    helm install amd-gpu-operator rocm/gpu-operator-charts \
      --namespace kube-amd-gpu --create-namespace \
      --version v1.2.2
  5. Step 5: Deploy the AMD metrics exporter configmap

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: exporter-configmap
      namespace: kube-amd-gpu
    data:
      config.json: |
        {
          "GPUConfig": {
            "Labels": [
              "GPU_UUID",
              "SERIAL_NUMBER",
              "GPU_ID",
              "POD",
              "NAMESPACE",
              "CONTAINER",
              "JOB_ID",
              "JOB_USER",
              "JOB_PARTITION",
              "CLUSTER_NAME",
              "CARD_SERIES",
              "CARD_MODEL",
              "CARD_VENDOR",
              "DRIVER_VERSION",
              "VBIOS_VERSION",
              "HOSTNAME"
            ]
          }
    
  6. Step 6: Create a registry secret for AMD GPU Operator to push/pull driver images to/from

    kubectl create secret docker-registry my-docker-secret -n kube-amd-gpu --docker-username $YOUR_USERNAME --docker-email $YOUR_EMAIL --docker-password $YOUR_PASSWORD
  7. Step 7: Deploy the AMD DeviceConfig to kickstart the discovery & driver installation process. 

    NOTE: Edit the spec.driver.image path to suit your Docker Hub username or your custom Docker repository URL.

    apiVersion: amd.com/v1alpha1
    kind: DeviceConfig
    metadata:
      # the names for the device plugin, metrics exporter and node labeler pods will be prefixed with this name
      name: gpu-operator
      # it is highly recommended to use the namespace where AMD GPU Operator is running
      namespace: kube-amd-gpu
    spec:
      driver:
        # set to ture for deploying out-of-tree driver with specified ROCm version 
        # set to false to directly use inbox or pre-installed driver on worker nodes
        # NOTE: Must be set to true because CMK does not ship with drivers on worker nodes
        enable: true
    
        # set to true to add blacklist for the amdgpu inbox driver kernel module, required for spec.driver.enable=true
        # set to false to remove blacklist for the amdgpu inbox driver kernel module, required for spec.driver.enable=false
        # the reboot of worker node is required to apply the updated blacklist
        # NOTE: Not sure if CMK worker nodes actually need a reboot since they don't load the AMD kernel driver by default anyways
        blacklist: true
        
        # Specify the out-of-tree driver version
        version: "6.4.1"
    
        # Specify driver image here
        # DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
        # e.g. docker.io/username/amdgpu-driver
        # NOTE: AMD GPU Operator uses this repo as a "cache" to build and upload driver images to.
        # NOTE: You DO NOT need to pre-build the driver image and push it to this repo!
        image: docker.io/your-username/your-amd-gpu-driver-repo
    
        # Specify the credential for your private registry if it requires credential to get pull/push access
        # you can create the docker-registry type secret by running command like:
        # kubectl create secret docker-registry mysecret -n kube-amd-gpu --docker-username=xxx --docker-password=xxx
        # Make sure you created the secret within the namespace that gpu operator controller is running
        imageRegistrySecret:
          name: docker-registry  # NOTE: AMD GPU Operator uses this secret to push/pull from the driver image repo
    
      devicePlugin:
        # Specify the device plugin image
        # default value is rocm/k8s-device-plugin:latest
        devicePluginImage: rocm/k8s-device-plugin:latest
    
        # Specify the node labeller image
        # default value is rocm/k8s-device-plugin:labeller-latest
        nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
    
        # Specify to enable/disable the node labeller
        # node labeller is required for adding / removing blacklist config of amdgpu kernel module
        # please set to true if you want to blacklist the inbox driver and use our-of-tree driver
        enableNodeLabeller: true
      
      # Specify the metrics exporter config
      metricsExporter:
        # To enable/disable the metrics exporter, disabled by default
        enable: true
    
        # kubernetes service type for metrics exporter, clusterIP(default) or NodePort
        serviceType: "ClusterIP"
    
        # internal service port used for in-cluster and node access to pull metrics from the metrics-exporter (default 5000)
        port: 5000
    
        # Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort
        nodePort: 32500
    
        # exporter image
        image: "rocm/device-metrics-exporter:v1.2.1"
    
        # metrics export config in configmap
        config:
          name: exporter-configmap
      
      
      # Specify the node to be managed by this DeviceConfig Custom Resource
      selector:
        feature.node.kubernetes.io/amd-gpu: "true"
    
      testRunner:
        enable: true
        logsLocation:
          mountPath: "/var/log/amd-test-runner" # mount path inside test runner container for logs
          hostPath: "/var/log/amd-test-runner" # host path to be mounted into test runner container for logs

 

Additional Resources

 

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.