Introduction
This guide addresses a common issue where the NVIDIA GPU Operator's Node Feature Discovery (NFD) master pod enters a CrashLoopBackOff state due to a missing NodeFeatureGroup Custom Resource Definition (CRD). This issue typically manifests as GPU operator pods stuck in Init state and repeated crashes of the gpu-operator-node-feature-discovery-master
pod. Users following this guide will learn to identify, diagnose, and resolve this configuration issue to restore full GPU operator functionality across their Kubernetes cluster.
Prerequisites
- Kubernetes cluster with RKE2 or similar distribution
- NVIDIA GPU Operator deployed in the cluster
- Administrative access to the cluster (
kubectl
with cluster-admin permissions) - Basic familiarity with Kubernetes concepts (pods, deployments, CRDs)
- Access to cluster logs and pod descriptions
Step-by-Step Instructions
1. Step 1: Identify the Issue
First, verify that you're experiencing the NodeFeatureGroup CRD issue:
kubectl get pods -n gpu-operator
# Look for pods in CrashLoopBackOff or Init states
# Specifically check for gpu-operator-node-feature-discovery-master pod
Expected symptoms:
-
gpu-operator-node-feature-discovery-master
pod in CrashLoopBackOff state - Other GPU operator pods stuck in Init state
- High restart count on the master pod
2. Step 2: Verify Missing CRD
Confirm that the NodeFeatureGroup CRD is missing:
kubectl get crd | grep nodefeaturegroup
Expected result: No output (empty result indicates the CRD is missing)
3. Step 3: Examine Pod Logs
Review the logs to confirm the root cause:
# Get the exact name of the failing pod
kubectl get pods -n gpu-operator | grep "node-feature-discovery-master"
# Check the logs (replace with actual pod name)
kubectl logs gpu-operator-node-feature-discovery-master-<pod-id> -n gpu-operator
Look for these error messages:
failed to list *v1alpha1.NodeFeatureGroup: the server could not find the requested resource (get nodefeaturegroups.nfd.k8s-sigs.io)
4. Step 4: Identify NFD Version
Determine the Node Feature Discovery version to use the correct CRD:
# Describe the failing pod to get the image version
kubectl describe pod <gpu-operator-node-feature-discovery-master-pod-name> -n gpu-operator | grep Image:
Note the version (e.g., v0.16.3
) from the image tag for the next step.
5. Step 5: Apply the Missing CRD (Recommended Solution)
Install the NodeFeatureGroup CRD for your specific NFD version:
# Replace v0.16.3 with your actual NFD version from Step 4
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.16.3/deployment/base/nfd-crds.yaml
Verification:
# Verify the CRD was installed successfully
kubectl get crd | grep nodefeaturegroup
Expected output:
nodefeaturegroups.nfd.k8s-sigs.io 2025-05-26T10:30:00Z
6. Step 6: Restart GPU Operator Components
Restart the GPU operator pods to apply the changes:
# Restart all GPU operator deployments
kubectl rollout restart deployment -n gpu-operator
# Alternatively, restart just the NFD master deployment
kubectl rollout restart deployment gpu-operator-node-feature-discovery-master -n gpu-operator
7. Step 7: Verify Resolution
Confirm that all pods are now running correctly:
# Check pod status
kubectl get pods -n gpu-operator
# Verify the master pod is running and check its logs
kubectl logs -f $(kubectl get pod -l role=master -n gpu-operator -o name) -n gpu-operator
Expected results:
- All pods should be in Running state
- No more NodeFeatureGroup CRD errors in logs
- GPU operator functionality restored across all nodes
Example
Scenario: GPU Operator Deployment on Multi-Node Cluster
A customer deployed NVIDIA GPU Operator on a 3-node RKE2 cluster with NVIDIA GPUs. After deployment, they noticed:
-
Initial symptoms observed:
$ kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS gpu-operator-node-feature-discovery-master-6bb4867495 0/1 CrashLoopBackOff 41 nvidia-container-toolkit-daemonset-abc123 0/1 Init:0/1 0 nvidia-driver-daemonset-xyz789 0/1 Init:CrashLoopBackOff 13
-
CRD check revealed missing resource:
$ kubectl get crd | grep nodefeaturegroup (no output)
-
Logs confirmed the issue:
$ kubectl logs gpu-operator-node-feature-discovery-master-6bb4867495-vn2gp -n gpu-operator failed to list *v1alpha1.NodeFeatureGroup: the server could not find the requested resource
-
Resolution applied:
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.16.3/deployment/base/nfd-crds.yaml customresourcedefinition.apiextensions.k8s.io/nodefeaturegroups.nfd.k8s-sigs.io created $ kubectl rollout restart deployment -n gpu-operator
-
Final verification:
$ kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS gpu-operator-node-feature-discovery-master-6bb4867495 1/1 Running 0 nvidia-container-toolkit-daemonset-abc123 1/1 Running 0 nvidia-driver-daemonset-xyz789 1/1 Running 0
Troubleshooting Common Issues
Issue 1: CRD URL Returns 404 Error
Problem: The GitHub URL for the CRD file returns a 404 error. Solution:
- Verify the NFD version number from the pod image
- Check the NFD GitHub repository for the correct branch/tag structure
- Use the correct URL format:
https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v[VERSION]/deployment/base/nfd-crds.yaml
Issue 2: Pods Still Failing After CRD Installation
Problem: Master pod continues to crash even after installing the CRD.
Solution:
- Wait 2-3 minutes for the CRD to be fully registered
- Force restart the specific pod:
kubectl delete pod <pod-name> -n gpu-operator
- Check for additional missing CRDs:
kubectl get crd | grep nfd
Issue 3: Configuration Method Not Working
Problem: Setting feature gates to false doesn't resolve the issue.
Solution:
- Ensure the ConfigMap edit was saved correctly
- Verify pod picked up the new configuration by checking environment variables
- Consider using the CRD installation approach instead
Issue 4: Different GPU Operator Version
Problem: Your GPU Operator uses a different NFD version than examples.
Solution:
- Always check the actual image tag in your pod description
- Refer to the GPU Operator release notes for compatible NFD versions
- Test with the closest available NFD version if exact match isn't available
Comments
0 comments
Please sign in to leave a comment.