Skip to main content
Crusoe Support Help Center home page
Crusoe

High I/O Wait and Pressure Stall on a VM Accessing a Shared Disk

Matt Roark
Matt Roark
Updated

Introduction

When running workloads across multiple VMs attached to the same Crusoe Shared Disk, you may observe that one VM experiences significantly higher CPU iowait and I/O pressure stall (PSI) compared to its peers — even when all nodes are running identical workloads against the same volume.

This asymmetric behavior is caused by a hardware offload failure on the hypervisor hosting the affected VM. Shared disk traffic on Crusoe VMs is routed through a software-defined networking layer on the hypervisor before reaching the storage backend. Under certain conditions, traffic for a specific VM can get stranded in the hypervisor's software processing path rather than being handled directly by the NIC hardware. When this happens, every storage operation on the affected VM incurs significantly higher per-operation latency compared to peers whose traffic is taking the hardware fast path — which manifests as elevated iowait and PSI on that node alone.

Critically, this is a hypervisor-level condition scoped to a single VM. It does not indicate a problem with the shared disk itself, the storage backend, or the NFS client configuration inside the VM. Peer VMs on different hypervisors are unaffected.

Prerequisites

  • Crusoe Shared Disk Mounted on One or More VMs

Symptoms

  • One VM in a multi-node cluster shows sustained high CPU iowait and/or memory PSI spikes while peer VMs on the same shared disk are clean.
  • NFS client configuration (nfsstat -m, vastnfs-ctl status) is identical between the affected and healthy VMs.

Instructions

Step 1: Confirm the Issue Is Isolated to a Single VM

Compare CPU iowait and PSI metrics across all VMs in your cluster. If the elevated I/O pressure is limited to a single node while peers on the same shared disk are healthy, this pattern is consistent with a per-VM hardware offload failure.

Run the following on both the affected VM and a healthy peer and compare output:

nfsstat -m
vastnfs-ctl status

If mount options and driver status are identical between nodes, client-side misconfiguration is not the cause.

Step 2: Contact Crusoe Support

Open a support ticket and provide the following:

  • The affected VM name and ID.
  • The name and ID of the shared disk.
  • Output of nfsstat -m and vastnfs-ctl status from both the affected VM and a healthy peer.
  • Grafana screenshots or timestamps showing the I/O pressure anomaly if available.

The mitigation requires a networking-level change applied by Crusoe's engineering team and cannot be self-served.

Step 3: Mitigation Applied by Crusoe

Crusoe support will apply a networking-level configuration change to the affected VM's traffic path. No disruption to the VM's workload or network connectivity is expected during the change.

⚠️ Warning: If your VM has a dynamic public IP, this change is not persistent across VM restarts. If the VM is restarted and the issue returns, contact Crusoe support to reapply the mitigation. The change is scoped to the internal networking path used for shared disk traffic and does not affect your VM's public IP or general network connectivity.

Resolution

Crusoe VMs use NIC hardware offloading to accelerate network traffic — under normal conditions, flows are installed directly into the NIC hardware and handled without CPU involvement. A bug in the hypervisor SDN stack can cause a specific VM's shared disk traffic to miss the hardware fast path and fall back to software processing, introducing latency on every storage operation. The networking change applied by Crusoe support reduces the complexity of that VM's traffic path, restoring it to the hardware fast path.

Crusoe engineering is working toward a permanent fix that will eliminate the need for this manual mitigation entirely.

Related Articles

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.