Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Increase inotify max_user_instances on Kubernetes Worker Nodes and Crusoe VMs

Sagar Lulla
Sagar Lulla
Updated

Last Updated: March 13, 2026

Introduction

Linux systems running containerized workloads, Kubernetes components, GPU operators, monitoring agents, or other services that rely heavily on filesystem watchers may encounter "too many open files" errors due to the kernel parameter fs.inotify.max_user_instances being exhausted.

This parameter controls how many inotify file watcher instances a single user can create system-wide. Many modern services - including kubelet, containerd, device plugins, logging agents, and application runtimes - rely on filesystem watchers to monitor directories and configuration changes.

The default Linux value for fs.inotify.max_user_instances is 128, which can be insufficient for systems running real workloads. When this limit is reached, services may fail to create file watchers and produce errors such as:

failed to create FS watcher for /var/lib/kubelet/device-plugins/: too many open files
Unable to start config watcher. error=Too many open files (os error 24)

When this occurs, applications or system services may fail to start correctly. In Kubernetes environments, this may result in nodes becoming partially unusable - GPUs may become non-allocatable and pods may fail to schedule.

This guide explains how to increase the fs.inotify.max_user_instances limit on Linux systems, including Kubernetes worker nodes and standalone virtual machines, to prevent these failures.

Note:
Our engineering team is aware of this limitation in the current node configuration. A future platform update is planned to include an updated default value in the base node image, which will eliminate the need for this manual workaround on newly created nodes.

Prerequisites

Before performing this procedure, ensure the following:

  • SSH access to the affected system (VM or node)

  • sudo privileges

  • Basic familiarity with Linux system administration

If troubleshooting a Kubernetes cluster, ensure:

  • kubectl is installed and configured

  • Administrative access to the cluster

You may identify affected systems by reviewing logs from services that rely on file watchers.

Example (Kubernetes):

kubectl logs -n nvidia-gpu-operator <pod-name>

Look for errors containing "too many open files".

Important:
If the system is part of a Kubernetes cluster, restarting services such as kubelet and containerd may temporarily evict workloads. Perform this change during a maintenance window when possible.

Step-by-Step Instructions

1. Verify the Current Limit

SSH into the affected worker node and check the current max_user_instances value.

ssh ubuntu@<node-public-ip>

cat /proc/sys/fs/inotify/max_user_instances

Example output:

128

If the value is 128, the node is using the Linux default and should be increased for Kubernetes workloads.

2. Increase the inotify max_user_instances Limit

Update the system configuration to increase the limit and make it persistent across reboots.

Add the new value to /etc/sysctl.conf:

echo "fs.inotify.max_user_instances = 8192" | sudo tee -a /etc/sysctl.conf

Apply the configuration immediately:

sudo sysctl -p

Verify the updated value:

cat /proc/sys/fs/inotify/max_user_instances

Expected output:

8192

Note:

The value 8192 is sufficient for most Kubernetes workloads.

3. Restart Affected Services

Some services may need to be restarted to fully recover after the limit is increased.

For Standalone VMs

Restart the affected service(s). For example:

sudo systemctl restart <service-name>

If you are unsure which service is affected, restarting the application or runtime that was producing the error may resolve the issue.

For Kubernetes Nodes

After running sysctl -p, the new limit takes effect immediately in the kernel.

Pods that were previously failing (for example, in CrashLoopBackOff) will automatically retry. Since new inotify_init() calls will now succeed under the raised limit, these pods should recover on their own within a few minutes.

In most cases, no restart of kubelet or containerd is required.

Wait 2–3 minutes, then check whether failing pods have recovered:

kubectl get pods -A --field-selector spec.nodeName=<node-name> | grep -v Running

If pods have not recovered after several minutes, kubelet or containerd may themselves be in a degraded state. In that case, restart them.

First drain the node to safely evict pods:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Restart the services:

sudo systemctl restart kubelet
sudo systemctl restart containerd

Allow the node to accept workloads again:

kubectl uncordon <node-name>

4. Verify the Fix

Check that the affected workloads are now running correctly.

For example, if the nvidia-device-plugin was failing:

kubectl logs -n nvidia-gpu-operator nvidia-device-plugin-daemonset-<pod-id> --tail=20

The logs should no longer show "too many open files" errors.

You can also verify that GPUs are now allocatable on the node:

kubectl describe node <node-name> | grep -A 5 "Allocatable"

5. Repeat for Other Affected Nodes (Optional: Applies to Kubernetes worker nodes)

Repeat steps 1–4 for each affected worker node.

You can check the current value across nodes using:

for node in $(kubectl get nodes -o name); do
echo "$node: $(kubectl debug $node -it --image=busybox -- \
cat /proc/sys/fs/inotify/max_user_instances 2>/dev/null)"
done

Common Issues

Issue 1: Changes Persist Only on the Current System

Resolution: Changes made to /etc/sysctl.conf persist across reboots of the current system. However, newly created nodes or VMs provisioned from the base image may still use the default value of 128.

Reapply the configuration on new systems until the base image includes the updated default.

Issue 2: Changes Do Not Persist on Newly Created Nodes

Resolution: Any new nodes created by a node pool (for example during scaling or replacement) or new VMs provisioned from the base image will start with the default value of 128.

You will need to reapply this workaround until a platform-level fix is included in the base image.

Issue 3: If the issue Persists After Increasing max_user_instances

Resolution: In some cases, the error may instead be caused by the file descriptor limit (LimitNOFILE) being too low. If the issue persists, refer to the companion KB article:

How-To Increase the Open Files (nofiles) ulimit on Kubernetes Worker Nodes

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.