Last Updated: March 13, 2026
Introduction
Linux systems running containerized workloads, Kubernetes components, GPU operators, monitoring agents, or other services that rely heavily on filesystem watchers may encounter "too many open files" errors due to the kernel parameter fs.inotify.max_user_instances being exhausted.
This parameter controls how many inotify file watcher instances a single user can create system-wide. Many modern services - including kubelet, containerd, device plugins, logging agents, and application runtimes - rely on filesystem watchers to monitor directories and configuration changes.
The default Linux value for fs.inotify.max_user_instances is 128, which can be insufficient for systems running real workloads. When this limit is reached, services may fail to create file watchers and produce errors such as:
failed to create FS watcher for /var/lib/kubelet/device-plugins/: too many open files
Unable to start config watcher. error=Too many open files (os error 24)
When this occurs, applications or system services may fail to start correctly. In Kubernetes environments, this may result in nodes becoming partially unusable - GPUs may become non-allocatable and pods may fail to schedule.
This guide explains how to increase the fs.inotify.max_user_instances limit on Linux systems, including Kubernetes worker nodes and standalone virtual machines, to prevent these failures.
Note:
Our engineering team is aware of this limitation in the current node configuration. A future platform update is planned to include an updated default value in the base node image, which will eliminate the need for this manual workaround on newly created nodes.
Prerequisites
Before performing this procedure, ensure the following:
SSH access to the affected system (VM or node)
sudo privileges
Basic familiarity with Linux system administration
If troubleshooting a Kubernetes cluster, ensure:
kubectlis installed and configuredAdministrative access to the cluster
You may identify affected systems by reviewing logs from services that rely on file watchers.
Example (Kubernetes):
kubectl logs -n nvidia-gpu-operator <pod-name>
Look for errors containing "too many open files".
Important:
If the system is part of a Kubernetes cluster, restarting services such as kubelet and containerd may temporarily evict workloads. Perform this change during a maintenance window when possible.
Step-by-Step Instructions
1. Verify the Current Limit
SSH into the affected worker node and check the current max_user_instances value.
ssh ubuntu@<node-public-ip>
cat /proc/sys/fs/inotify/max_user_instances
Example output:
128
If the value is 128, the node is using the Linux default and should be increased for Kubernetes workloads.
2. Increase the inotify max_user_instances Limit
Update the system configuration to increase the limit and make it persistent across reboots.
Add the new value to /etc/sysctl.conf:
echo "fs.inotify.max_user_instances = 8192" | sudo tee -a /etc/sysctl.conf
Apply the configuration immediately:
sudo sysctl -p
Verify the updated value:
cat /proc/sys/fs/inotify/max_user_instances
Expected output:
8192
Note:
The value 8192 is sufficient for most Kubernetes workloads.
3. Restart Affected Services
Some services may need to be restarted to fully recover after the limit is increased.
For Standalone VMs
Restart the affected service(s). For example:
sudo systemctl restart <service-name>
If you are unsure which service is affected, restarting the application or runtime that was producing the error may resolve the issue.
For Kubernetes Nodes
After running sysctl -p, the new limit takes effect immediately in the kernel.
Pods that were previously failing (for example, in CrashLoopBackOff) will automatically retry. Since new inotify_init() calls will now succeed under the raised limit, these pods should recover on their own within a few minutes.
In most cases, no restart of kubelet or containerd is required.
Wait 2–3 minutes, then check whether failing pods have recovered:
kubectl get pods -A --field-selector spec.nodeName=<node-name> | grep -v Running
If pods have not recovered after several minutes, kubelet or containerd may themselves be in a degraded state. In that case, restart them.
First drain the node to safely evict pods:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Restart the services:
sudo systemctl restart kubelet
sudo systemctl restart containerd
Allow the node to accept workloads again:
kubectl uncordon <node-name>
4. Verify the Fix
Check that the affected workloads are now running correctly.
For example, if the nvidia-device-plugin was failing:
kubectl logs -n nvidia-gpu-operator nvidia-device-plugin-daemonset-<pod-id> --tail=20
The logs should no longer show "too many open files" errors.
You can also verify that GPUs are now allocatable on the node:
kubectl describe node <node-name> | grep -A 5 "Allocatable"
5. Repeat for Other Affected Nodes (Optional: Applies to Kubernetes worker nodes)
Repeat steps 1–4 for each affected worker node.
You can check the current value across nodes using:
for node in $(kubectl get nodes -o name); do
echo "$node: $(kubectl debug $node -it --image=busybox -- \
cat /proc/sys/fs/inotify/max_user_instances 2>/dev/null)"
done
Common Issues
Issue 1: Changes Persist Only on the Current System
Resolution: Changes made to /etc/sysctl.conf persist across reboots of the current system. However, newly created nodes or VMs provisioned from the base image may still use the default value of 128.
Reapply the configuration on new systems until the base image includes the updated default.
Issue 2: Changes Do Not Persist on Newly Created Nodes
Resolution: Any new nodes created by a node pool (for example during scaling or replacement) or new VMs provisioned from the base image will start with the default value of 128.
You will need to reapply this workaround until a platform-level fix is included in the base image.
Issue 3: If the issue Persists After Increasing max_user_instances
Resolution: In some cases, the error may instead be caused by the file descriptor limit (LimitNOFILE) being too low. If the issue persists, refer to the companion KB article:
How-To Increase the Open Files (nofiles) ulimit on Kubernetes Worker Nodes