Cluster Autoscaler Conflicts When Scaling Down a CMK Nodepool

Overview

When a CMK nodepool has a configured desired count, the platform actively works to maintain that number of running VMs. If a node is manually stopped or deleted while a Cluster Autoscaler is also deployed in the cluster, the two can enter a conflict loop: the nodepool attempts to scale down, but the autoscaler detects pending pods and immediately provisions replacement nodes, overriding the manual scale-down.

This conflict can result in capacity exhaustion within a customer's reserved allocation, "Out of Stock" errors when attempting to restart stopped VMs, and an inability to reduce the nodepool to a desired lower count without additional intervention.

Prerequisites

Access to the Crusoe Console or API
kubectl access to the CMK cluster
Sufficient permissions to manage Kubernetes deployments (to scale down the autoscaler)

Steps

Identify the Conflict
- If you attempt to scale down a nodepool's desired count and new VMs continue to be provisioned immediately afterward, a Cluster Autoscaler deployment is likely overriding the change.
- Confirm by checking for an autoscaler deployment in the cluster:

     kubectl get deployments -n <autoscaler-namespace>

Temporarily Scale Down the Cluster Autoscaler
- Before making any nodepool changes, scale the autoscaler deployment to 0 replicas to prevent it from interfering:

     kubectl scale deployment <autoscaler-deployment-name> --replicas=0 -n <autoscaler-namespace>

This stops the autoscaler from detecting resource deficits and spinning up replacement nodes while you perform recovery actions.

Perform the Intended Nodepool Change
- With the autoscaler paused, proceed with your nodepool scaling or VM management operation via the Crusoe Console or API (e.g., adjusting desired count, stopping or deleting a specific VM).
Re-Enable the Cluster Autoscaler
- Once the nodepool is in the desired state, restore the autoscaler deployment to its original replica count:

     kubectl scale deployment <autoscaler-deployment-name> --replicas=1 -n <autoscaler-namespace>

Resolution

The following describes how this conflict was resolved in a confirmed case:

A customer stopped a NotReady CMK worker node, intending to recover it. The nodepool's desired count of 2 triggered immediate provisioning of a replacement VM, consuming their remaining reserved H100 capacity.
Attempts to scale the nodepool's desired count down from 2 to 1 were repeatedly overridden by the Cluster Autoscaler, which detected pending pods and spun up new nodes each time.
Crusoe Support temporarily scaled the autoscaler deployment to 0 replicas, breaking the provisioning loop.
With the autoscaler paused, the original VM was successfully started and nodepool state was restored to the desired configuration.
The autoscaler was re-enabled and the cluster returned to normal operation.

Additional Resources

Related to

cmk solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

Cluster Autoscaler Conflicts When Scaling Down a CMK Nodepool

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments