Skip to main content
Crusoe Support Help Center home page
Crusoe

GPU Operator CrashLoopBackOff Due to Exhausted inotify Limits

Matt Roark
Matt Roark
Updated

Overview

On CMK worker nodes running a large number of pods, GPU Operator components such as nvidia-dcgm-exporter and nvidia-device-plugin may enter CrashLoopBackOff with the following error:

 
Error creating watcher: too many open files

This occurs when the node's kernel inotify limits — fs.inotify.max_user_watches and fs.inotify.max_user_instances — are exhausted by the number of running pods and file watchers on the node. The default kernel values are too low for dense Kubernetes environments and must be increased manually.

Note: This is a known platform limitation. A Jira has been filed to address default inotify limits in the CMK worker image.


Prerequisites

  • SSH access to the affected CMK worker node(s)
  • Sufficient OS-level permissions to modify kernel parameters (sudo)

Steps

  1. Confirm the Error
    • Check the logs of the crashing GPU Operator pod to confirm the inotify limit is the cause:
     kubectl logs <pod-name> -n <gpu-operator-namespace>
  • Look for Error creating watcher or too many open files in the output.
  1. Check Current inotify Limits on the Affected Node
    • SSH into the affected worker node and check the current values:
     sysctl fs.inotify.max_user_watches
     sysctl fs.inotify.max_user_instances
  1. Increase inotify Limits
    • Apply the following values on the affected node:
     sudo sysctl fs.inotify.max_user_watches=1048576
     sudo sysctl fs.inotify.max_user_instances=8192
  • Note that these changes are applied at runtime and will not persist across a reboot. See Step 4 to make them permanent.
  1. Persist the Changes Across Reboots
    • To ensure the values survive a reboot, add them to /etc/sysctl.d/:
     echo "fs.inotify.max_user_watches=1048576" | sudo tee /etc/sysctl.d/99-inotify.conf
     echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.d/99-inotify.conf
  • Repeat Steps 3 and 4 on each affected worker node.
  1. Verify GPU Operator Components Recover
    • After applying the new limits, confirm that the previously crashing pods come up healthy:
     kubectl get pods -n <gpu-operator-namespace>

Resolution

The following describes how this issue was resolved in a confirmed case:

  1. Following resolution of an orphaned Volcano webhook issue, nvidia-dcgm-exporter and nvidia-device-plugin pods remained in CrashLoopBackOff. Pod logs showed Error creating watcher: too many open files.
  2. The affected worker node was running 100+ pods, exhausting the default inotify limits.
  3. The following values were applied on the affected nodes:
   sudo sysctl fs.inotify.max_user_watches=1048576
   sudo sysctl fs.inotify.max_user_instances=8192
  1. All GPU Operator components came up healthy on both nodes immediately after the limits were increased.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.