GPU Operator CrashLoopBackOff Due to Exhausted inotify Limits

Overview

On CMK worker nodes running a large number of pods, GPU Operator components such as nvidia-dcgm-exporter and nvidia-device-plugin may enter CrashLoopBackOff with the following error:

Error creating watcher: too many open files

This occurs when the node's kernel inotify limits — fs.inotify.max_user_watches and fs.inotify.max_user_instances — are exhausted by the number of running pods and file watchers on the node. The default kernel values are too low for dense Kubernetes environments and must be increased manually.

Note: This is a known platform limitation. A Jira has been filed to address default inotify limits in the CMK worker image.

Prerequisites

SSH access to the affected CMK worker node(s)
Sufficient OS-level permissions to modify kernel parameters (sudo)

Steps

Confirm the Error
- Check the logs of the crashing GPU Operator pod to confirm the inotify limit is the cause:

     kubectl logs <pod-name> -n <gpu-operator-namespace>

Look for Error creating watcher or too many open files in the output.

Check Current inotify Limits on the Affected Node
- SSH into the affected worker node and check the current values:

     sysctl fs.inotify.max_user_watches
     sysctl fs.inotify.max_user_instances

Increase inotify Limits
- Apply the following values on the affected node:

     sudo sysctl fs.inotify.max_user_watches=1048576
     sudo sysctl fs.inotify.max_user_instances=8192

Note that these changes are applied at runtime and will not persist across a reboot. See Step 4 to make them permanent.

Persist the Changes Across Reboots
- To ensure the values survive a reboot, add them to /etc/sysctl.d/:

     echo "fs.inotify.max_user_watches=1048576" | sudo tee /etc/sysctl.d/99-inotify.conf
     echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.d/99-inotify.conf

Repeat Steps 3 and 4 on each affected worker node.

Verify GPU Operator Components Recover
- After applying the new limits, confirm that the previously crashing pods come up healthy:

     kubectl get pods -n <gpu-operator-namespace>

Resolution

The following describes how this issue was resolved in a confirmed case:

Following resolution of an orphaned Volcano webhook issue, nvidia-dcgm-exporter and nvidia-device-plugin pods remained in CrashLoopBackOff. Pod logs showed Error creating watcher: too many open files.
The affected worker node was running 100+ pods, exhausting the default inotify limits.
The following values were applied on the affected nodes:

   sudo sysctl fs.inotify.max_user_watches=1048576
   sudo sysctl fs.inotify.max_user_instances=8192

All GPU Operator components came up healthy on both nodes immediately after the limits were increased.

Additional Resources

Related to

cmk gpu-operator crashloopback solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

GPU Operator CrashLoopBackOff Due to Exhausted inotify Limits

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments