How-To Resolve UnexpectedAdmissionError on CMK GPU Nodes

Introduction

Pods on a CMK GPU node can fail to start with the status UnexpectedAdmissionError and an event like:

Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

This happens when the kubelet's internal GPU accounting drifts out of sync with what is actually free on the node. The kubelet device manager tracks GPU assignments in a local checkpoint file, separate from the NVIDIA device plugin's advertised capacity — so the node can report 10 allocatable GPUs while the kubelet privately believes fewer are free.

The known trigger is simultaneous teardown of multiple GPU pods on the same node. When several containers exit at once — a Deployment rollout, a mass Job preemption, an eviction storm — the container runtime can drop container records before the kubelet processes their exit events. The kubelet then never releases those GPU allocations, and they leak in its checkpoint.

This is most visible on high-density slice nodes (for example l40s-48gb.10x) running many single-GPU replicas of the same Deployment or Job, because that workload shape makes simultaneous same-node teardown likely. Affected pods never recover on their own, and the leak can compound across subsequent rollouts.

This article covers how to confirm the condition, recover the affected node, and prevent recurrence.

Prerequisites

kubectl Access to Your CMK Cluster
RBAC Permissions to Delete Pods and Cordon Nodes
SSH Access to the Affected Node (Kubelet Restart Only)

Instructions

Step 1: Confirm the Symptom

List failing pods and check the failure reason:

kubectl get pods -A --field-selector spec.nodeName=<NODE_NAME> | grep UnexpectedAdmissionError
kubectl describe pod <POD_NAME> -n <NAMESPACE>

The describe output shows the Allocate failed due to requested number of devices unavailable for nvidia.com/gpu message. If multiple pods on the same node show this within a short window, you are likely looking at a leaked-allocation cascade rather than genuine capacity exhaustion.

Step 2: Verify It Is an Accounting Leak, Not Real Exhaustion

Compare what the node advertises against what is actually scheduled:

kubectl describe node <NODE_NAME> | grep -A 8 "Allocated resources"

Count the GPU requests of pods in Running state on the node. If running pods request fewer GPUs than the node's allocatable count, yet new pods still fail admission, the kubelet checkpoint has leaked allocations.

ℹ️ Note: Pods in UnexpectedAdmissionError state hold no GPU — the failure occurs before allocation. They do, however, clutter scheduling and must be removed manually.

Step 3: Delete the Stuck Pods

Failed-admission pods never retry. Delete them so their controllers create fresh replicas:

kubectl delete pod <POD_NAME> -n <NAMESPACE>

⚠️ Warning: If the replacement pods land on the same node, they may fail again until the leak is cleared in Step 4. Cordon the node first (kubectl cordon <NODE_NAME>) to force replacements onto healthy nodes.

Step 4: Clear the Leaked Allocations

Restarting the kubelet rebuilds the device manager checkpoint from the live container runtime state, dropping the leaked entries. If you have SSH access to the node:

sudo systemctl restart kubelet

Running GPU pods on the node survive the restart — the kubelet reconciles them from the runtime. Once the restart completes, uncordon the node:

kubectl uncordon <NODE_NAME>

If you do not have node access, or if the error recurs immediately, open a support ticket referencing this article — Crusoe Support can recover the node for you.

Step 5: Prevent Recurrence

The leak requires simultaneous multi-pod teardown on one node. Remove that precondition:

Spread replicas of the same Deployment across nodes with pod anti-affinity or a topologySpreadConstraints rule keyed on kubernetes.io/hostname.
Stagger rollouts with maxUnavailable: 1 in the Deployment strategy instead of a percentage.
Scale down batch or low-priority Jobs gracefully rather than deleting them in bulk.

💡 Tip: Anti-affinity is the strongest mitigation on high-density slice nodes — a rolling restart of co-located replicas is the single most common trigger.

Example

A team runs a 5-replica inference Deployment of single-GPU pods, several of which land on the same 10-GPU L40S node. A routine image update triggers a rolling restart that tears down all co-located replicas at once. Minutes later, new pods on that node begin failing with UnexpectedAdmissionError, and over the next several hours dozens of pods cascade into the same state — while kubectl describe node still shows free GPU capacity.

The team deletes the failed pods, cordons the node, restarts the kubelet to rebuild the checkpoint, then adds topologySpreadConstraints and maxUnavailable: 1 to the Deployment. Subsequent rollouts complete without admission failures.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.