How-To Resolve NVIDIA GPU Operator CrashLoopBackOff Due to Missing NodeFeatureGroup CRD

Last updated: Nov 06, 2025

Introduction

This guide addresses a common issue where the NVIDIA GPU Operator's Node Feature Discovery (NFD) master pod enters a CrashLoopBackOff state due to a missing NodeFeatureGroup Custom Resource Definition (CRD). This issue typically manifests as GPU operator pods stuck in Init state and repeated crashes of the gpu-operator-node-feature-discovery-master pod. Users following this guide will learn to identify, diagnose, and resolve this configuration issue to restore full GPU operator functionality across their Kubernetes cluster.

Prerequisites

Kubernetes cluster with RKE2 or similar distribution
NVIDIA GPU Operator deployed in the cluster
Administrative access to the cluster (kubectl with cluster-admin permissions)
Basic familiarity with Kubernetes concepts (pods, deployments, CRDs)
Access to cluster logs and pod descriptions

Step-by-Step Instructions

1. Identify the Issue

First, verify that you're experiencing the NodeFeatureGroup CRD issue:

kubectl get pods -n gpu-operator
Look for pods in CrashLoopBackOff or Init states
Specifically check for gpu-operator-node-feature-discovery-master pod

Expected symptoms:
1. gpu-operator-node-feature-discovery-master pod in CrashLoopBackOff state
2. Other GPU operator pods stuck in Init state
3. High restart count on the master pod

2. Verify Missing CRD

Confirm that the NodeFeatureGroup CRD is missing:

kubectl get crd | grep nodefeaturegroup

Expected result: No output (empty result indicates the CRD is missing)

3. Examine Pod Logs

Review the logs to confirm the root cause:

# Get the exact name of the failing pod
kubectl get pods -n gpu-operator | grep "node-feature-discovery-master"

# Check the logs (replace with actual pod name)
kubectl logs gpu-operator-node-feature-discovery-master-<pod-id> -n gpu-operator

Look for these error messages:

failed to list *v1alpha1.NodeFeatureGroup: the server could not find the requested resource (get nodefeaturegroups.nfd.k8s-sigs.io)

4. Identify NFD Version

Determine the Node Feature Discovery version to use the correct CRD:

# Describe the failing pod to get the image version
kubectl describe pod <gpu-operator-node-feature-discovery-master-pod-name> -n gpu-operator | grep Image:

Note the version (e.g., v0.16.3) from the image tag for the next step.

5. Apply the Missing CRD

Install the NodeFeatureGroup CRD for your specific NFD version:

# Replace v0.16.3 with your actual NFD version from Step 4
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.16.3/deployment/base/nfd-crds.yaml

Verification:

# Verify the CRD was installed successfully
kubectl get crd | grep nodefeaturegroup

Expected output:

nodefeaturegroups.nfd.k8s-sigs.io    2025-05-26T10:30:00Z

6. Restart GPU Operator Components

Restart the GPU operator pods to apply the changes:

# Restart all GPU operator deployments
kubectl rollout restart deployment -n gpu-operator

# Alternatively, restart just the NFD master deployment
kubectl rollout restart deployment gpu-operator-node-feature-discovery-master -n gpu-operator

7. Verify Resolution

Confirm that all pods are now running correctly:

# Check pod status
kubectl get pods -n gpu-operator

# Verify the master pod is running and check its logs
kubectl logs -f $(kubectl get pod -l role=master -n gpu-operator -o name) -n gpu-operator

Expected results:
1. All pods should be in Running state
2. No more NodeFeatureGroup CRD errors in logs
3. GPU operator functionality restored across all nodes

Example

Scenario: GPU Operator Deployment on Multi-Node Cluster

A customer deployed NVIDIA GPU Operator on a 3-node RKE2 cluster with NVIDIA GPUs. After deployment, they noticed:

Initial symptoms observed:

$ kubectl get pods -n gpu-operator
NAME                                                    READY   STATUS             RESTARTS
gpu-operator-node-feature-discovery-master-6bb4867495  0/1     CrashLoopBackOff   41
nvidia-container-toolkit-daemonset-abc123              0/1     Init:0/1           0
nvidia-driver-daemonset-xyz789                          0/1     Init:CrashLoopBackOff  13

CRD check revealed missing resource:

$ kubectl get crd | grep nodefeaturegroup
(no output)

Logs confirmed the issue:

$ kubectl logs gpu-operator-node-feature-discovery-master-6bb4867495-vn2gp -n gpu-operator
failed to list *v1alpha1.NodeFeatureGroup: the server could not find the requested resource

Resolution applied:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.16.3/deployment/base/nfd-crds.yaml
customresourcedefinition.apiextensions.k8s.io/nodefeaturegroups.nfd.k8s-sigs.io created

$ kubectl rollout restart deployment -n gpu-operator

Final verification:

$ kubectl get pods -n gpu-operator
NAME                                                    READY   STATUS    RESTARTS
gpu-operator-node-feature-discovery-master-6bb4867495  1/1     Running   0
nvidia-container-toolkit-daemonset-abc123              1/1     Running   0
nvidia-driver-daemonset-xyz789                          1/1     Running   0

Troubleshooting Common Issues

Issue 1: CRD URL Returns 404 Error

Problem: The GitHub URL for the CRD file returns a 404 error.
Solution:
1. Verify the NFD version number from the pod image
2. Check the NFD GitHub repository for the correct branch/tag structure
3. Use the correct URL format: https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v[VERSION]/deployment/base/nfd-crds.yaml

Issue 2: Pods Still Failing After CRD Installation

Problem: Master pod continues to crash even after installing the CRD.
Solution:
1. Wait 2-3 minutes for the CRD to be fully registered
2. Force restart the specific pod: kubectl delete pod <pod-name> -n gpu-operator
3. Check for additional missing CRDs: kubectl get crd | grep nfd

Issue 3: Configuration Method Not Working

Problem: Setting feature gates to false doesn't resolve the issue.
Solution:
1. Ensure the ConfigMap edit was saved correctly
2. Verify pod picked up the new configuration by checking environment variables
3. Consider using the CRD installation approach instead

Issue 4: Different GPU Operator Version

Problem: Your GPU Operator uses a different NFD version than examples.
Solution:
1. Always check the actual image tag in your pod description
2. Refer to the GPU Operator release notes for compatible NFD versions
3. Test with the closest available NFD version if exact match isn't available

Additional Resources

Related to

how-to cmk gpu-operator crashloopback crd node feature group

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Step-by-Step Instructions

Example

Troubleshooting Common Issues

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Resolve NVIDIA GPU Operator CrashLoopBackOff Due to Missing NodeFeatureGroup CRD

Introduction

Prerequisites

Step-by-Step Instructions

Example

Troubleshooting Common Issues

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments