Overview
On CMK worker nodes running a large number of pods, GPU Operator components such as nvidia-dcgm-exporter and nvidia-device-plugin may enter CrashLoopBackOff with the following error:
Error creating watcher: too many open files
This occurs when the node's kernel inotify limits — fs.inotify.max_user_watches and fs.inotify.max_user_instances — are exhausted by the number of running pods and file watchers on the node. The default kernel values are too low for dense Kubernetes environments and must be increased manually.
Note: This is a known platform limitation. A Jira has been filed to address default inotify limits in the CMK worker image.
Prerequisites
- SSH access to the affected CMK worker node(s)
- Sufficient OS-level permissions to modify kernel parameters (
sudo)
Steps
- Confirm the Error
- Check the logs of the crashing GPU Operator pod to confirm the inotify limit is the cause:
kubectl logs <pod-name> -n <gpu-operator-namespace>
- Look for
Error creating watcherortoo many open filesin the output.
- Check Current inotify Limits on the Affected Node
- SSH into the affected worker node and check the current values:
sysctl fs.inotify.max_user_watches
sysctl fs.inotify.max_user_instances- Increase inotify Limits
- Apply the following values on the affected node:
sudo sysctl fs.inotify.max_user_watches=1048576
sudo sysctl fs.inotify.max_user_instances=8192- Note that these changes are applied at runtime and will not persist across a reboot. See Step 4 to make them permanent.
- Persist the Changes Across Reboots
- To ensure the values survive a reboot, add them to
/etc/sysctl.d/:
- To ensure the values survive a reboot, add them to
echo "fs.inotify.max_user_watches=1048576" | sudo tee /etc/sysctl.d/99-inotify.conf
echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.d/99-inotify.conf- Repeat Steps 3 and 4 on each affected worker node.
- Verify GPU Operator Components Recover
- After applying the new limits, confirm that the previously crashing pods come up healthy:
kubectl get pods -n <gpu-operator-namespace>
Resolution
The following describes how this issue was resolved in a confirmed case:
- Following resolution of an orphaned Volcano webhook issue,
nvidia-dcgm-exporterandnvidia-device-pluginpods remained inCrashLoopBackOff. Pod logs showedError creating watcher: too many open files. - The affected worker node was running 100+ pods, exhausting the default inotify limits.
- The following values were applied on the affected nodes:
sudo sysctl fs.inotify.max_user_watches=1048576 sudo sysctl fs.inotify.max_user_instances=8192
- All GPU Operator components came up healthy on both nodes immediately after the limits were increased.