FAQ: CMK Add-on Lifecycle and Upgrades

Introduction

Crusoe Managed Kubernetes (CMK) ships with a set of core add-ons that extend your cluster for GPU-accelerated and AI/ML workloads. For the core cluster add-ons - Cilium, the NVIDIA GPU Operator, and the NVIDIA Network Operator - Crusoe manages installation and critical upgrades, while add-on configuration is shared and routine version upgrades are yours to perform. Other Crusoe-provided add-ons, such as the Crusoe CSI driver, are installed and upgraded by you.

These responsibility boundaries are defined in the CMK Shared Responsibility Model. This article explains how add-on upgrades work, who is responsible for them, and how to safely upgrade or modify a managed add-on.

Prerequisites

Access to the Crusoe CLI or Crusoe Console (Steps to Install the CLI)
An Existing CMK Cluster With One or More Add-ons Installed
helm and kubectl Installed and Configured (Required for Inspecting or Upgrading Add-ons)

How CMK Add-on Upgrades Work

Add-ons are installed as Helm releases when your cluster is created. After creation, they are not upgraded automatically - each add-on stays at its installed version until it is upgraded. For the core operators (Cilium, the NVIDIA GPU Operator, and the NVIDIA Network Operator), Crusoe may coordinate a critical upgrade with you; for other add-ons such as the Crusoe CSI driver, upgrades are yours to perform.

Two common assumptions are worth clarifying:

The Terraform provider does not push add-on updates. The add_ons parameter applies only at cluster creation. Updating the provider version, or changing add_ons on an existing cluster, will not upgrade or re-apply the add-on Helm releases already running in the cluster.
Kubernetes version upgrades do not update add-ons. Upgrading your cluster's Kubernetes version and upgrading your add-ons are separate activities; one does not trigger the other.

Who Is Responsible for Add-on Upgrades

Responsibility is shared. You are responsible for routine add-on upgrades - keeping versions current for compatibility, performance, and new features - and for any configuration changes you make. Crusoe is responsible for critical upgrades, such as a security fix or a driver-level bug fix.

How Crusoe Handles Critical Upgrades

Critical upgrades are handled on a case-by-case basis. When one is needed, the Crusoe team will reach out through Crusoe Support, explain the proposed change, and coordinate a maintenance window with you before anything is modified. You will be asked to flag any custom configurations so they can be preserved where possible. Nothing is changed without your awareness and consent.

Upgrading an Add-on Yourself

Add-on upgrades are performed with Helm, against the add-on's release in your cluster. The general pattern is the same for every add-on, though the exact chart, repository, namespace, and values differ:

Capture your current values so your configuration is preserved - for example, helm get values -n <namespace> <release>.
Update the chart repository and identify the target version.
Run helm upgrade to the new chart version, preserving your existing values rather than applying chart defaults (which can overwrite Crusoe-provided settings).
Verify the rollout: the release REVISION increments, the pods roll, and the new version is running.

For a complete worked example, see How-To Upgrade Crusoe CSI Driver in CMK Cluster in Additional Resources. The GPU and Network Operators follow a more involved, controlled procedure - see the section below.

Customizing a Managed Add-on

You can modify a managed add-on's configuration and remain supported. Crusoe recommends coordinating with Support before making changes, and capturing your current values first so they can be reconciled during any future upgrade. For example, to capture your current CSI driver values:

helm get values -n crusoe-system crusoe-csi-driver

During a managed upgrade, Crusoe will attempt to preserve your custom settings unless they conflict with the upgrade or are causing cluster issues.

There is no fixed threshold for what counts as a "significant modification," which is why you should coordinate before modifying a managed add-on. As a guideline, minor supported configuration changes keep the add-on within the shared model, while substantial divergence from the Crusoe-provided configuration moves that add-on into customer-owned territory.

⚠️ Important: If a customer-owned add-on later affects cluster health, Crusoe will work to restore the cluster but will not debug the modified component itself.

NVIDIA GPU Operator and Network Operator

The GPU Operator and Network Operator follow the same shared model as Cilium, but require more care because they manage GPU and InfiniBand drivers that are core to the cluster. When you want to change or upgrade these operators, coordinate with Crusoe Support first, then perform the change yourself. (Crusoe handles critical upgrades separately - see "How Crusoe Handles Critical Upgrades.")

ℹ️ Note: When changing or upgrading the GPU or Network Operator, preserve your existing configuration (for example, helm upgrade --reuse-values) rather than applying default values, which can overwrite the Crusoe-provided settings. Running a driver or operator upgrade yourself can also leave you on a configuration Crusoe has not validated, so consult Crusoe Support before running a specific version. Upgrading these operators can interrupt running workloads - see "Will Upgrading an Add-on Cause Downtime?" below.

Will Upgrading an Add-on Cause Downtime?

It depends on the add-on:

Lighter add-ons (e.g., the CSI driver): An upgrade performs a rolling restart of the add-on's pods, one node at a time. Workloads generally keep running; you may see brief intermittent volume attach/detach/mount activity until the new pods are ready. No full-cluster outage.
NVIDIA GPU Operator: Upgrading the GPU driver interrupts workloads on the nodes being updated.
NVIDIA Network Operator: Upgrading the InfiniBand driver is more disruptive - it can take the node offline for a period, because the underlying mlx5_core kernel module is shared with the node's Ethernet interface.
Changing the GPU or Network Operator configuration in a way that unloads and reloads the drivers causes the same interruption.

Because Cilium, the GPU Operator, and the Network Operator are core to the cluster's networking and GPU capabilities, treat their upgrades as higher-impact. Plan for the interruption - drain or reschedule affected workloads - and coordinate a maintenance window with Crusoe Support before proceeding. Avoid running these upgrades during critical workloads.

Checking Which Add-ons Are Installed

To see which add-ons are running in your cluster, list the Helm releases across all namespaces:

helm list -A

Each add-on appears as a Helm release with its chart version, a REVISION number, and the date it was last deployed. A REVISION of 1 means the release has not been upgraded since it was installed at cluster creation - the revision increments each time you run helm upgrade, so it's a quick way to confirm whether an add-on is still at its originally installed version.

ℹ️ Note: This list includes more than the add-ons you opted into at creation. Alongside your selected add-ons (such as the GPU Operator, Network Operator, and CSI driver), Crusoe installs a set of baseline components on every cluster. For example, the Cilium CNI, CoreDNS, and observability agents each as its own Helm release. These are Crusoe-managed platform components. It will also include anything you have installed yourself with Helm; you are free to install whatever you choose, but those releases are not Crusoe add-ons and are not covered by the add-on shared-responsibility model.

You can also inspect the pods for a specific add-on, for example:

kubectl get pods -n crusoe-system
kubectl get pods -n nvidia-gpu-operator
kubectl get pods -n nvidia-network-operator
kubectl get pods -n kube-system | grep cilium

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

Not currently. Today, upgrading an installed add-on is done with Helm, against the add-on's release in your cluster. The other interfaces do not perform add-on upgrades:

Terraform - the add_ons parameter applies only at cluster creation. Changing it on an existing cluster does not upgrade a running add-on.
CLI - there is no supported command to upgrade add-ons on an existing cluster today.
Console - add-ons are selected at cluster creation; there is no in-Console action to upgrade an installed add-on afterward.

Crusoe is actively investing in managed add-on lifecycle and upgrade capabilities. Until those are available, upgrades follow the Helm process described in this article.

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

No. Add-ons are installed as Helm releases, so a missed add-on can be installed on your existing cluster after creation - you do not need to recreate the cluster.

Is There an Additional Cost?

No. There is currently no additional charge for add-on management.

ℹ️ Note: When upgrading or modifying a managed add-on, keep the shared-responsibility boundaries above in mind - especially for Cilium, the GPU Operator, and the Network Operator, where changes can affect cluster networking or GPU scheduling. If you're unsure which action to take or want to coordinate a change, reach out to Crusoe Support for assistance.

Additional Resources

Related to

faq #cmk

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

How CMK Add-on Upgrades Work

Who Is Responsible for Add-on Upgrades

How Crusoe Handles Critical Upgrades

Upgrading an Add-on Yourself

Customizing a Managed Add-on

NVIDIA GPU Operator and Network Operator

Will Upgrading an Add-on Cause Downtime?

Checking Which Add-ons Are Installed

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

Is There an Additional Cost?

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

FAQ: CMK Add-on Lifecycle and Upgrades

Introduction

Prerequisites

How CMK Add-on Upgrades Work

Who Is Responsible for Add-on Upgrades

How Crusoe Handles Critical Upgrades

Upgrading an Add-on Yourself

Customizing a Managed Add-on

NVIDIA GPU Operator and Network Operator

Will Upgrading an Add-on Cause Downtime?

Checking Which Add-ons Are Installed

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

Can I Upgrade an Add-on Using the Console, CLI, or Terraform?

Is There an Additional Cost?

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments