Introduction
When running workloads across multiple VMs attached to the same Crusoe Shared Disk, you may observe that one VM experiences significantly higher CPU iowait and I/O pressure stall (PSI) compared to its peers — even when all nodes are running identical workloads against the same volume.
This asymmetric behavior is caused by a hardware offload failure on the hypervisor hosting the affected VM. Shared disk traffic on Crusoe VMs is routed through a software-defined networking layer on the hypervisor before reaching the storage backend. Under certain conditions, traffic for a specific VM can get stranded in the hypervisor's software processing path rather than being handled directly by the NIC hardware. When this happens, every storage operation on the affected VM incurs significantly higher per-operation latency compared to peers whose traffic is taking the hardware fast path — which manifests as elevated iowait and PSI on that node alone.
Critically, this is a hypervisor-level condition scoped to a single VM. It does not indicate a problem with the shared disk itself, the storage backend, or the NFS client configuration inside the VM. Peer VMs on different hypervisors are unaffected.
Prerequisites
- Crusoe Shared Disk Mounted on One or More VMs
Symptoms
- One VM in a multi-node cluster shows sustained high CPU iowait and/or memory PSI spikes while peer VMs on the same shared disk are clean.
- NFS client configuration (
nfsstat -m,vastnfs-ctl status) is identical between the affected and healthy VMs.
Instructions
Step 1: Confirm the Issue Is Isolated to a Single VM
Compare CPU iowait and PSI metrics across all VMs in your cluster. If the elevated I/O pressure is limited to a single node while peers on the same shared disk are healthy, this pattern is consistent with a per-VM hardware offload failure.
Run the following on both the affected VM and a healthy peer and compare output:
nfsstat -m vastnfs-ctl status
If mount options and driver status are identical between nodes, client-side misconfiguration is not the cause.
Step 2: Contact Crusoe Support
Open a support ticket and provide the following:
- The affected VM name and ID.
- The name and ID of the shared disk.
- Output of
nfsstat -mandvastnfs-ctl statusfrom both the affected VM and a healthy peer. - Grafana screenshots or timestamps showing the I/O pressure anomaly if available.
The mitigation requires a networking-level change applied by Crusoe's engineering team and cannot be self-served.
Step 3: Mitigation Applied by Crusoe
Crusoe support will apply a networking-level configuration change to the affected VM's traffic path. No disruption to the VM's workload or network connectivity is expected during the change.
⚠️ Warning: If your VM has a dynamic public IP, this change is not persistent across VM restarts. If the VM is restarted and the issue returns, contact Crusoe support to reapply the mitigation. The change is scoped to the internal networking path used for shared disk traffic and does not affect your VM's public IP or general network connectivity.
Resolution
Crusoe VMs use NIC hardware offloading to accelerate network traffic — under normal conditions, flows are installed directly into the NIC hardware and handled without CPU involvement. A bug in the hypervisor SDN stack can cause a specific VM's shared disk traffic to miss the hardware fast path and fall back to software processing, introducing latency on every storage operation. The networking change applied by Crusoe support reduces the complexity of that VM's traffic path, restoring it to the hardware fast path.
Crusoe engineering is working toward a permanent fix that will eliminate the need for this manual mitigation entirely.