Skip to main content
Crusoe Support Help Center home page
Crusoe

Cluster Autoscaler Conflicts When Scaling Down a CMK Nodepool

Matt Roark
Matt Roark
Updated

Overview

When a CMK nodepool has a configured desired count, the platform actively works to maintain that number of running VMs. If a node is manually stopped or deleted while a Cluster Autoscaler is also deployed in the cluster, the two can enter a conflict loop: the nodepool attempts to scale down, but the autoscaler detects pending pods and immediately provisions replacement nodes, overriding the manual scale-down.

This conflict can result in capacity exhaustion within a customer's reserved allocation, "Out of Stock" errors when attempting to restart stopped VMs, and an inability to reduce the nodepool to a desired lower count without additional intervention.


Prerequisites

  • Access to the Crusoe Console or API
  • kubectl access to the CMK cluster
  • Sufficient permissions to manage Kubernetes deployments (to scale down the autoscaler)

Steps

  1. Identify the Conflict
    • If you attempt to scale down a nodepool's desired count and new VMs continue to be provisioned immediately afterward, a Cluster Autoscaler deployment is likely overriding the change.
    • Confirm by checking for an autoscaler deployment in the cluster:
     kubectl get deployments -n <autoscaler-namespace>
  1. Temporarily Scale Down the Cluster Autoscaler
    • Before making any nodepool changes, scale the autoscaler deployment to 0 replicas to prevent it from interfering:
     kubectl scale deployment <autoscaler-deployment-name> --replicas=0 -n <autoscaler-namespace>
  • This stops the autoscaler from detecting resource deficits and spinning up replacement nodes while you perform recovery actions.
  1. Perform the Intended Nodepool Change
    • With the autoscaler paused, proceed with your nodepool scaling or VM management operation via the Crusoe Console or API (e.g., adjusting desired count, stopping or deleting a specific VM).
  2. Re-Enable the Cluster Autoscaler
    • Once the nodepool is in the desired state, restore the autoscaler deployment to its original replica count:
     kubectl scale deployment <autoscaler-deployment-name> --replicas=1 -n <autoscaler-namespace>

Resolution

The following describes how this conflict was resolved in a confirmed case:

  1. A customer stopped a NotReady CMK worker node, intending to recover it. The nodepool's desired count of 2 triggered immediate provisioning of a replacement VM, consuming their remaining reserved H100 capacity.
  2. Attempts to scale the nodepool's desired count down from 2 to 1 were repeatedly overridden by the Cluster Autoscaler, which detected pending pods and spun up new nodes each time.
  3. Crusoe Support temporarily scaled the autoscaler deployment to 0 replicas, breaking the provisioning loop.
  4. With the autoscaler paused, the original VM was successfully started and nodepool state was restored to the desired configuration.
  5. The autoscaler was re-enabled and the cluster returned to normal operation.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.