Overview
Crusoe cloud is currently investigating an issue where multiple customers have reported observing I/O errors on two of the NVMe drives (notably nvme2 and nvme3 but may vary) in H100 SKU instances located in the us-east1-a
and us-southcentral1-a
regions.
[ 54.750100] systemd[1]: Hostname set to <test-nvme-vm.us-southcentral1-a.compute.internal>.
[ 2010.414862] blk_update_request: I/O error, dev nvme3n1, sector 511976448 op 0x1:(WRITE) flags 0x0 phys_seg 5 prio class 0
[ 2010.415686] blk_update_request: I/O error, dev nvme3n1, sector 511976704 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
[ 2010.416890] blk_update_request: I/O error, dev nvme2n1, sector 511979520 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 2010.418006] blk_update_request: I/O error, dev nvme2n1, sector 511979776 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
[ 2010.438129] blk_update_request: I/O error, dev nvme3n1, sector 512012288 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 2010.438916] blk_update_request: I/O error, dev nvme2n1, sector 512012288 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
The root cause of the problem is vendor-specific and stems from a malfunction in a particular PCIe slot that connects to the NVMe drives. This issue is OS agnostic, meaning it is unrelated to any specific OS version and instead triggered by high-intensity write operations to the affected drives. We are actively investigating with the vendor and have identified a potential fix and are in the process of testing and validating the fix.
Prerequisites
- Crusoe Compute VM
- Sudo/root access
- Install
sudo apt install nvme-cli
if not already installed
Workaround Steps
As an immediate workaround, you can identify the problematic NVMe drives and exclude them from your existing setup (Raid0 or independent disks) to mitigate the issue. Kindly note, this means you will now have 6 NVMe drives in total instead of 8; therefore around ~6TB storage capacity.
For Existing VMs with Raid0 setup:
- Backup existing data in NVMe drives
- Identify problematic drives:
- Use
dmesg
or check/var/log/syslog
to identify the drives exhibiting I/O errors.
- Use
-
Record Serial Numbers:
- Once you've identified the problematic drives, use
sudo nvme list
to retrieve and note their serial numbers.
- Once you've identified the problematic drives, use
-
Comment out startup lifecycle script at
/usr/local/bin/crusoe/startup
- The startup script should be left with only this line:
#!/bin/bash
- The startup script should be left with only this line:
-
Destroy existing Raid0 setup:
- Unmount /raid0:
sudo umount /raid0
- Wipe out existing Raid File system:
sudo wipefs -a /dev/md127
- Stop Raid0 array:
sudo mdadm --stop /dev/md127
- Remove any remaining metadata on the NVMe devices:
sudo mdadm --zero-superblock /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1
- Remove the the /raid0 mountpoint to clear out any remnants
sudo rm -rf /raid0
- Unmount /raid0:
-
Run the following script in a new .sh file replacing the serial numbers of the affected NVMe devices with the correct ones.
-
#!/bin/bash
problematic_serials=("SN1" "SN2") # <--- serial numbers
all_nvme_devices=("/dev/nvme0n1" "/dev/nvme1n1" "/dev/nvme2n1" "/dev/nvme3n1" "/dev/nvme4n1" "/dev/nvme5n1" "/dev/nvme6n1" "/dev/nvme7n1")
devices_to_include=()
for dev in "${all_nvme_devices[@]}"; do
serial=$(sudo nvme id-ctrl $dev | grep -m 1 'sn' | awk '{print $3}')
if [[ ! " ${problematic_serials[@]} " =~ " ${serial} " ]]; then
devices_to_include+=($dev)
fi
done
if [ ${#devices_to_include[@]} -eq 1 ]; then
dev_name=${devices_to_include[0]}
sudo mkfs.ext4 $dev_name
mkdir /nvme && mount -t ext4 /dev/nvme0n1 /nvme
elif [ ${#devices_to_include[@]} -ge 1 ]; then
sudo mdadm --create --verbose /dev/md127 --level=0 --raid-devices=${#devices_to_include[@]} "${devices_to_include[@]}"
sudo mkfs.ext4 /dev/md127
mkdir /raid0 && mount -t ext4 /dev/md127 /raid0
else
echo "no ephemeral drives detected"
fi
-
-
Verify by running
sudo mdadm --detail /dev/md127
Please note the following important points
- Drive Serial Numbers: The affected NVMe drives may appear with random names on each VM boot. Therefore, it's crucial to record the serial numbers of the problematic drives to reliably identify them.
-
Ephemeral Drives: Since these are ephemeral drives, any changes made within the VM will be lost upon VM STOP operation, meaning you will need to reapply the workaround after every VM restart. Restarting in place through
sudo reboot
orreset
through the Crusoe CLI will not wipe the ephemeral disks.
We appreciate your patience as we work with our vendor to resolve this issue. Your understanding during this time is greatly valued, and we are committed to providing updates as we progress through testing and finalization of the fix. For any additional questions or concerns, please reach out to support using the Crusoe Support Portal.
Comments
0 comments
Please sign in to leave a comment.