Introduction
During high-concurrency pod scaling events (large pod fanouts), Kubernetes workloads using NFS volumes may experience intermittent mount failures. These failures surface as the Linux kernel error Required key not available (ENOKEY) or as API-level 409 Conflict errors during volume attachment.
The condition is triggered by rapid, concurrent DNS resolutions during a large pod scale-up. When many worker nodes simultaneously resolve the multi-homed VAST storage endpoint (nfs.crusoecloudcompute.com), divergent DNS lookups between the primary mount process and its sub-connection handlers (configured via nconnect=16) produce connection-state mismatches.
This article walks you through identifying the condition and upgrading your Crusoe CSI driver to the corrected production version. The fix switches the driver to a userspace Go resolver that resolves the endpoint exactly once and moves lookups to the node's local resolver, bypassing cluster-level CoreDNS entirely.
Prerequisites
- Running CMK Cluster Using NFS as a Storage Option
-
kubectlCLI Installed and Configured With Your CMK Cluster's Kubeconfig (Get Kubeconfig) - Helm CLI Installed
Instructions
Step 1: Identify the Error State
-
Inspect the events in your namespace to look for
FailedMounterrors matching theENOKEYerror signature:kubectl get events -n <your-namespace> --sort-by='.metadata.creationTimestamp' | grep FailedMount
-
Verify if the log or event message matches the following specific string pattern:
MountVolume.SetUp failed for volume [...] : rpc error: code = Internal desc = ... Required key not available
-
Check for concurrent volume attach conflicts on the Kubernetes API side using the controller logs:
kubectl logs deployment/crusoe-csi-controller -n crusoe-system | grep "409 Conflict"
ℹ️ Note: The CSI driver namespace is typically
crusoe-system. Older deployments may usekube-system. Substitute your actual namespace in the commands above and below.
Step 2: Update the Crusoe CSI Driver
The issue is mitigated in Crusoe CSI Driver Helm Chart version 0.5.0 (and later).
-
Ensure the official Crusoe Helm repository is added and up to date:
helm repo add crusoe-csi-driver https://crusoecloud.github.io/crusoe-csi-driver-helm-charts/charts helm repo update crusoe-csi-driver
-
Verify that version
0.5.0or later is visible in the repository layout:helm search repo crusoe-csi-driver/crusoe-csi-driver --versions
-
Prepare an
override_values.yamlso your API credential paths are mapped correctly during the upgrade:crusoe: secrets: crusoeApiKeys: secretName: "crusoe-api-keys" accessKeyPath: "CRUSOE_ACCESS_KEY" secretKeyPath: "CRUSOE_SECRET_KEY"-
secretName— name of the Kubernetes Secret holding your Crusoe API credentials; must match the Secret in your CSI namespace. -
accessKeyPath— the key within that Secret that stores your Crusoe access key. -
secretKeyPath— the key within that Secret that stores your Crusoe secret key.
-
-
Execute the upgrade against your CSI namespace:
helm upgrade --install crusoe-csi-driver crusoe-csi-driver/crusoe-csi-driver \ --namespace <your-csi-namespace> \ --version 0.5.0 \ -f override_values.yaml -
Confirm all DaemonSet pods are completely updated and in a
Readystate:kubectl get daemonset crusoe-csi-node -n <your-csi-namespace>
Frequently Asked Questions
Q: Why did this issue occur primarily during large pod scale ups ("fanouts")?
A: When hundreds of pods deploy simultaneously, they issue rapid back-to-back mount calls. With NFS configurations like nconnect=16 and remoteports=dns, each mount triggers a sequence of backend network connections. Under heavy parallel load, different stages of the mount sequence received different IP addresses from the DNS load balancer pool. This IP drift breaks the secure handshake, resulting in the Linux kernel throwing a Required key not available (ENOKEY) fault.
Q: Does this issue cause permanent data loss or disk corruption?
A: No. This is exclusively an identification and handshake issue at the network/mount layer. The underlying data stored on your flash volumes (c2-home, c2-datadisk, etc.) remains completely intact and uncorrupted.
Q: How does the CSI driver upgrade fix the root cause?
A: The upgraded driver uses a userspace Go resolver that resolves the VAST storage domain exactly once per mount request. It pins that specific set of IP addresses and passes them directly into the kernel mount command, bypassing iterative lookups and eliminating the divergent DNS racing. It also decouples mounting from CoreDNS, so cluster DNS issues during scale events no longer affect volume availability.
Q: My cluster uses a gang-scheduler (like Kueue) and jobs are still getting evicted before retries can complete. What temporary workaround is available?
A: Because kubelet and the CSI framework retry these mounts automatically, pods will eventually connect if given enough time. If your topology-aware or gang-scheduled jobs are timing out and getting evicted too quickly, increase your Kueue (or scheduler) waitForPodsReady timeout to 10m. That buffer lets the system clear the backlogged mounts until you can apply the permanent driver upgrade.
Example
A training cluster with 200 GPU worker nodes is scaling up after a scheduled maintenance window. As the cluster autoscaler provisions the nodepool and pods begin scheduling, you notice that roughly 10–15% of the pods are stuck in ContainerCreating state with repeated FailedMount events:
MountVolume.SetUp failed for volume "pvc-abc123" : rpc error: code = Internal desc = Required key not available
The remaining pods eventually mount and become ready after kubelet retries, but the affected pods have already exceeded their scheduler's timeout and are evicted. After upgrading the CSI driver to v0.5.0 and redeploying the workload, all pods mount cleanly on the first attempt — even during the next large-scale fanout event.