Skip to main content
Crusoe Support Help Center home page
Crusoe

FAQ: CMK Cluster Autoscaler Skips Node Pool Due to Node Drift

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Last Updated: Mar 16, 2026

Introduction

This article explains why a node pool (for example, an A100 GPU pool) may stop scaling even when VMs appear as Running in Crusoe Cloud.

This situation occurs when a VM exists in the infrastructure layer but the corresponding node is missing from the Kubernetes cluster. As a result, the infrastructure and Kubernetes cluster become out of sync, which may cause the Cluster Autoscaler to skip the affected node pool.

You may also see this error in the Cluster Autoscaler logs when it attempts to retrieve readiness information for the node pool.

Failed to find readiness information for <node-pool-id>


Question 1: Can a node showing as "Running" in Crusoe but not registered in Kubernetes cause the autoscaler to stop working?

Answer: 

Yes, If there is drift between:

  • VMs reported as Running in Crusoe Cloud, and

  • Nodes registered in Kubernetes,

the autoscaler will skip that node pool as a valid autoscaling target.

The autoscaler expects infrastructure state and Kubernetes node state to match. If a VM exists but the node object does not, the pool is considered inconsistent and will not scale up to schedule pending pods.

 

Question 2: Why would a VM be "Running" in Crusoe but not appear in Kubernetes?

Answer:

One common cause is out-of-band node deletion from Kubernetes.

Example: Kubernetes audit logs show that the node was manually deleted via some client (eg. k9s) by a user. The VM itself was never deleted from Crusoe Cloud. This will cause

  • VM: Running (Crusoe view)

  • Node object: Deleted (Kubernetes view)

This creates drift. Because the VM still exists at the infrastructure layer, Crusoe sees the expected number of instances running (n/n), so the node pool state remains RUNNING.

 

Question 3: Should the node pool state change to UNHEALTHY if nodes fail to join?

Answer:

It depends on the scenario.

1. If a node fails to join the cluster:

In newer CMK versions:

  • The VM is automatically deleted

  • The instance group transitions to UNHEALTHY

2. If a node is deleted out-of-band in Kubernetes:

  • The VM still exists and is Running

  • The node pool sees expected VM count (n/n)

  • The node pool state remains RUNNING

  • Autoscaler may skip the pool due to drift

Currently, deleting nodes directly in Kubernetes does not affect node pool health state in Crusoe Cloud.

 

Question 4: How can I recover from this condition?

Answer:

You can resolve drift by:

  1. Identifying the orphaned VM (Running in Crusoe but missing in Kubernetes)

  2. Deleting the VM from Crusoe Cloud

  3. Allowing the autoscaler to recreate a clean node

After removing the drift, autoscaling should resume normally.

Additional Resources

  1. Cluster Autoscaler
  2. Crusoe Managed Kubernetes

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.