How-To Recover a CMK Node Pool VM After a Stop/Start

Introduction

Node pool VMs in Crusoe Managed Kubernetes (CMK) are designed to be replaceable, not repairable.

When you want to rotate or remediate a worker node, the supported operation is to delete the VM — the node pool controller detects the gap and automatically provisions a fresh replacement that runs the full bootstrap pipeline and joins the cluster.

A stop/start, by contrast, is a power-cycle at the VM layer. The node pool controller treats a stopped worker as unhealthy and may provision a replacement anyway, while the original VM sits in SHUTOFF — and when you try to start it again, the replacement may have already consumed the underlying capacity, producing an out of stock error.

Even when the stop/start itself succeeds, the node can come back NotReady. Worker node bootstrap — including ephemeral NVMe initialization — runs once, at initial provisioning. A restarted VM can land on a different physical host where the ephemeral NVMe drives are blank, which breaks the /var/lib/containerd → /mnt/nvme/containerd symlink, takes down containerd, and prevents kubelet from starting.

This article covers how to recover from both outcomes: a stopped VM that won't start due to out of stock, and a restarted VM whose node is stuck in NotReady.

Prerequisites

Access to the Crusoe Console or Crusoe CLI
kubectl Access to the CMK Cluster
SSH Access to the Affected Worker Node

Instructions

Step 1: Identify Which Scenario You Are In

List the VMs in the affected node pool and note their states:

crusoe compute vms list

Look for VMs sharing your node pool prefix (for example, np-<NODE_POOL_ID>-2). Two patterns are possible:

The stopped VM shows SHUTOFF and new VMs with higher index suffixes appeared around the time of the stop — the controller has already replaced the node. Go to Step 2.
The same VM shows RUNNING but its Kubernetes node is NotReady in kubectl get nodes — the VM restarted but failed to rejoin the cluster. Go to Step 4.

Step 2: Recover From the Out-of-Stock Error (Replaced VM)

If you see an out of stock error when attempting to start the stopped VM, do not keep retrying the start. The error means the replacement VM has probably already taken the available capacity — the node pool has effectively healed itself.

Confirm the replacement nodes have joined the cluster and are healthy:

kubectl get nodes

ℹ️ Note: The replacement VMs are new Kubernetes nodes with new names. Any workloads pinned to the old node names (nodeSelector, affinity) need to be updated or rescheduled.

Step 3: Re-Attach Disks and Verify Workloads (Replaced VM)

Disks that were attached to the stopped VM do not move to the replacement automatically.

⚠️ Warning: Do not delete the old stopped VMs until you have confirmed all attached disks are accounted for and re-attached. Deleting the VM while you are still mapping disk ownership risks losing track of which disk belonged where.

In the Crusoe Console, identify the disks attached to the stopped VM.
For independently attached (non-CSI) disks, detach them and re-attach them to one of the replacement VMs.
For disks provisioned by the Crusoe CSI driver through PVCs, do not move the disk by hand — let the CSI driver re-attach it when the pod is rescheduled onto a replacement node.

Verify that old VolumeAttachment objects have been cleaned up and that pods can bind their PVCs on the new nodes:

kubectl get volumeattachment
kubectl get pvc -A
kubectl describe pod <POD_NAME>

Confirm the node remains Ready after the disks are attached.

Once everything is verified, delete the old SHUTOFF VMs from the Crusoe Console or CLI to stop incurring charges and keep the node pool clean.

Step 4: Recover a Running VM Stuck in NotReady

If the VM restarted but the node never rejoined the cluster, SSH in and check for the broken containerd symlink:

ls -la /var/lib/containerd && file /var/lib/containerd

If the bootstrap state was lost, the output will look like this:

lrwxrwxrwx 1 root root 20 Mar 11 09:51 /var/lib/containerd -> /mnt/nvme/containerd
/var/lib/containerd: broken symbolic link to /mnt/nvme/containerd

Confirm the ephemeral NVMe drives are visible but have no filesystem:

lsblk -f

In this state, kubelet will be inactive — and journalctl -u kubelet will show nothing useful, because kubelet never started:

systemctl status containerd kubelet

The recommended recovery is to delete the VM (from the Crusoe Console or CLI, not just kubectl delete node) and let the node pool provision a replacement, which runs the full bootstrap pipeline (including NVMe initialization) on creation.

ℹ️ Note: If deleting the VM is operationally difficult in your environment — for example, your storage relies on per-node static IP allowlisting and a new node would require a lengthy allowlist update — open a support ticket before taking action. In-place recovery of the NVMe initialization is possible with Crusoe Support's assistance.

Step 5: Clean Up Autoscaler Side Effects

While GPU nodes are NotReady, the cluster autoscaler (if enabled) may scale up other node pools to absorb pending pods. After the GPU nodes recover, scale the inflated pool back to its intended size via the Console or CLI.

💡 Tip: Set explicit --nodes <min>:<max>:<pool> limits on your cluster autoscaler deployment for every node pool. Without explicit limits, a pool can be assigned a high default maximum and scale beyond what you intended during an incident.

Example

You are rotating two A100 worker nodes in a node pool ahead of a maintenance window. Instead of deleting them, you stop the VMs, intending to start them again later. When you try to start them, the Console returns an out of stock error.

Running crusoe compute vms list shows two new VMs in the same node pool, created minutes after the stop. kubectl get nodes confirms both replacements are Ready. One of the stopped VMs had two data disks attached, so you re-attach them to a replacement VM, confirm the pods mount their PVCs, and the node stays Ready. You then delete the two SHUTOFF VMs.

Going forward, your rotation procedure is stop → delete, letting the node pool create replacements automatically.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.