Introduction
Pods on a CMK GPU node can fail to start with the status UnexpectedAdmissionError and an event like:
Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
This happens when the kubelet's internal GPU accounting drifts out of sync with what is actually free on the node. The kubelet device manager tracks GPU assignments in a local checkpoint file, separate from the NVIDIA device plugin's advertised capacity — so the node can report 10 allocatable GPUs while the kubelet privately believes fewer are free.
The known trigger is simultaneous teardown of multiple GPU pods on the same node. When several containers exit at once — a Deployment rollout, a mass Job preemption, an eviction storm — the container runtime can drop container records before the kubelet processes their exit events. The kubelet then never releases those GPU allocations, and they leak in its checkpoint.
This is most visible on high-density slice nodes (for example l40s-48gb.10x) running many single-GPU replicas of the same Deployment or Job, because that workload shape makes simultaneous same-node teardown likely. Affected pods never recover on their own, and the leak can compound across subsequent rollouts.
This article covers how to confirm the condition, recover the affected node, and prevent recurrence.
Prerequisites
- kubectl Access to Your CMK Cluster
- RBAC Permissions to Delete Pods and Cordon Nodes
- SSH Access to the Affected Node (Kubelet Restart Only)
Instructions
Step 1: Confirm the Symptom
List failing pods and check the failure reason:
kubectl get pods -A --field-selector spec.nodeName=<NODE_NAME> | grep UnexpectedAdmissionError kubectl describe pod <POD_NAME> -n <NAMESPACE>
The describe output shows the Allocate failed due to requested number of devices unavailable for nvidia.com/gpu message. If multiple pods on the same node show this within a short window, you are likely looking at a leaked-allocation cascade rather than genuine capacity exhaustion.
Step 2: Verify It Is an Accounting Leak, Not Real Exhaustion
Compare what the node advertises against what is actually scheduled:
kubectl describe node <NODE_NAME> | grep -A 8 "Allocated resources"
Count the GPU requests of pods in Running state on the node. If running pods request fewer GPUs than the node's allocatable count, yet new pods still fail admission, the kubelet checkpoint has leaked allocations.
ℹ️ Note: Pods in
UnexpectedAdmissionErrorstate hold no GPU — the failure occurs before allocation. They do, however, clutter scheduling and must be removed manually.
Step 3: Delete the Stuck Pods
Failed-admission pods never retry. Delete them so their controllers create fresh replicas:
kubectl delete pod <POD_NAME> -n <NAMESPACE>
⚠️ Warning: If the replacement pods land on the same node, they may fail again until the leak is cleared in Step 4. Cordon the node first (
kubectl cordon <NODE_NAME>) to force replacements onto healthy nodes.
Step 4: Clear the Leaked Allocations
Restarting the kubelet rebuilds the device manager checkpoint from the live container runtime state, dropping the leaked entries. If you have SSH access to the node:
sudo systemctl restart kubelet
Running GPU pods on the node survive the restart — the kubelet reconciles them from the runtime. Once the restart completes, uncordon the node:
kubectl uncordon <NODE_NAME>
If you do not have node access, or if the error recurs immediately, open a support ticket referencing this article — Crusoe Support can recover the node for you.
Step 5: Prevent Recurrence
The leak requires simultaneous multi-pod teardown on one node. Remove that precondition:
- Spread replicas of the same Deployment across nodes with pod anti-affinity or a
topologySpreadConstraintsrule keyed onkubernetes.io/hostname. - Stagger rollouts with
maxUnavailable: 1in the Deployment strategy instead of a percentage. - Scale down batch or low-priority Jobs gracefully rather than deleting them in bulk.
💡 Tip: Anti-affinity is the strongest mitigation on high-density slice nodes — a rolling restart of co-located replicas is the single most common trigger.
Example
A team runs a 5-replica inference Deployment of single-GPU pods, several of which land on the same 10-GPU L40S node. A routine image update triggers a rolling restart that tears down all co-located replicas at once. Minutes later, new pods on that node begin failing with UnexpectedAdmissionError, and over the next several hours dozens of pods cascade into the same state — while kubectl describe node still shows free GPU capacity.
The team deletes the failed pods, cordons the node, restarts the kubelet to rebuild the checkpoint, then adds topologySpreadConstraints and maxUnavailable: 1 to the Deployment. Subsequent rollouts complete without admission failures.