How-to Install GPU and Network Operators on Kubernetes Clusters

In order to leverage NVIDIA GPUs within a Kubernetes environment, NVIDIA provides an open source gpu-operator that you can leverage for the GPUs to become allocatable. Similarly, for nodes that are enabled with Infiniband RDMA networking, NVIDIA also provides a network-operator that can be used to enable internode Infiniband communication.

Pre-Requirements

Access to a Kubernetes cluster, either self managed or created through Crusoe Managed Kubernetes (CMK) offering. If you are unfamiliar with self-managed kubernetes architectures, we recommend using CMK for ease of deployment.

Steps to Install and Validate Operators

1. Add Nvidia Helm Repositories

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

2. Install GPU Operator

helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.useOpenKernelModules=true

3. Install Network Operator

helm install network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace

4. Create and apply the following NIC Cluster Policy

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
    name: nic-cluster-policy
spec:
   ofedDriver:
     image: mofed
     repository: nvcr.io/nvidia/mellanox 
     version: 23.10-3.2.6.0-0 
   nvIpam:
      enableWebhook: false
      image: nvidia-k8s-ipam
      imagePullSecrets: []
      repository: ghcr.io/mellanox
      version: v0.3.5
   sriovDevicePlugin:
      image: sriov-network-device-plugin
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.8.0
      config: |
        {
         "resourceList": [
           {
              "resourcePrefix": "nvidia.com",
              "resourceName": "hostdev",
              "selectors": {
                 "vendors": ["15b3"],
                 "devices": ["101e"],
                 "linkTypes": ["infiniband"],
                 "isRdma": true
              }
            }
          ]
        }
   secondaryNetwork:
     cniPlugins:
       image: plugins
       repository: ghcr.io/k8snetworkplumbingwg
       version: v1.3.0
       imagePullSecrets: []
     multus:
       image: multus-cni
       repository: ghcr.io/k8snetworkplumbingwg
       version: v4.1.4
       imagePullSecrets: []

Save the above to a nic_cluster_policy.yaml and apply with kubectl apply -f nic_cluster_policy.yaml .

5. After about 5 minutes, you should see nvidia.com/gpu and nvidia.com/hostdev resources as Allocatable in kubectl describe node .

Allocatable:
  cpu:                 176
  ephemeral-storage:   126353225220
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              990737488Ki
  nvidia.com/gpu:      8
  nvidia.com/hostdev:  8
  pods:                110

Related to

network how-to nvidia kubernetes gpu-operator

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-to Install GPU and Network Operators on Kubernetes Clusters

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments