In order to leverage NVIDIA GPUs within a Kubernetes environment, NVIDIA provides an open source gpu-operator that you can leverage for the GPUs to become allocatable. Similarly, for nodes that are enabled with Infiniband RDMA networking, NVIDIA also provides a network-operator that can be used to enable internode Infiniband communication.
Pre-Requirements
Access to a Kubernetes cluster, either self managed or created through Crusoe Managed Kubernetes (CMK) offering. If you are unfamiliar with self-managed kubernetes architectures, we recommend using CMK for ease of deployment.
Steps to Install and Validate Operators
1. Add Nvidia Helm Repositories
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
2. Install GPU Operator
helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.useOpenKernelModules=true
3. Install Network Operator
helm install network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace
4. Create and apply the following NIC Cluster Policy
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: mofed repository: nvcr.io/nvidia/mellanox version: 23.10-3.2.6.0-0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam imagePullSecrets: [] repository: ghcr.io/mellanox version: v0.3.5 sriovDevicePlugin: image: sriov-network-device-plugin repository: ghcr.io/k8snetworkplumbingwg version: v3.8.0 config: | { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "hostdev", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "linkTypes": ["infiniband"], "isRdma": true } } ] } secondaryNetwork: cniPlugins: image: plugins repository: ghcr.io/k8snetworkplumbingwg version: v1.3.0 imagePullSecrets: [] multus: image: multus-cni repository: ghcr.io/k8snetworkplumbingwg version: v4.1.4 imagePullSecrets: []
Save the above to a nic_cluster_policy.yaml and apply with kubectl apply -f nic_cluster_policy.yaml .
5. After about 5 minutes, you should see nvidia.com/gpu and nvidia.com/hostdev resources as Allocatable in kubectl describe node .
Allocatable:
cpu: 176
ephemeral-storage: 126353225220
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 990737488Ki
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
pods: 110
Comments
0 comments
Please sign in to leave a comment.