Overview
While adding B200 nodes to an existing Crusoe Managed Kubernetes (CMK) cluster, you might see the cuda-validator pod failing to initialize. The following error will be observed when you look at the logs for this pod.
Failed to allocate device vector A (error code system not yet initialized)! [Vector addition of 50000 elements] stream closed EOF for nvidia-gpu-operator/nvidia-cuda-validator-zl2ds (cuda-validation)
Prerequisites
- Crusoe Managed Kubernetes (CMK)
- B200 VMs
- NVIDIA GPU Operator v25.3.0
Cause
Support for B200 type GPUs was added to GPU driver versions 570.133.20 and later. Earlier versions of Operator v25.3.0 installed drivers older than version 570.133.20 and did not have B200 support.
Steps
-
Step 1: Identify the GPU drivers installed within the cluster
- Run the following command in the nvidia-driver-daemonset pod on one of the B200 nodes
# kubectl exec -it <nvidia-driver-daemonset-xxxxx> -- nvidia-smi
-
Step 2: If the driver version is lower than v570.133.20, upgrade the GPU operator
- Since B200 support is added starting driver version 570.133.20 and later, the GPU operator can be upgraded from v25.3.0 to v25.3.1 with any custom values you have initially provided during installation.
# helm repo update # helm upgrade gpu-operator nvidia/gpu-operator -n nvidia-gpu-operator --version 25.3.1 -f <custom_values.yaml>
Additional Resources
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html#v25-3-0
Comments
0 comments
Article is closed for comments.