Introduction
GPU Direct Storage (GDS) enables direct data transfers between storage and GPU memory, bypassing the CPU entirely. Without GDS, data flows through a CPU buffer before reaching the GPU — adding latency and CPU overhead that reduces the effective GPU utilization available for compute.
For AI training workloads, enabling GDS can meaningfully improve Model FLOP Utilization (MFU) by reducing the time the GPU spends waiting on data rather than doing compute. This is particularly relevant for large model training where checkpoint loading and dataset I/O are on the critical path.
Important: Enabling GDS on the node is only half of the equation. Your application must also use the cuFile API (part of the NVIDIA cuFile library) to perform direct reads and writes that bypass the CPU buffer. Frameworks and libraries that already support cuFile will automatically benefit from GDS once it is enabled on the node:
- NVIDIA DALI — native GDS support, enable with
DALI_USE_GDS=1- RAPIDS (cuDF, cuIO) — GDS-aware by default when GDS is enabled
- PyTorch — supported via NVIDIA DALI or cuFile-backed data loaders
If you are writing custom data loading code, refer to the cuFile API documentation to implement direct storage access using
cuFileRead/cuFileWriteinstead of standard POSIXread/write.
Crusoe manages GPU drivers via the NVIDIA GPU Operator, which exposes a NvidiaDrivers custom resource per GPU type. This lets you enable GDS selectively for specific GPU types in a mixed cluster — without affecting other node pools.
Prerequisites
-
kubectlconfigured with access to your Crusoe Kubernetes cluster - NVIDIA GPU Operator installed (
nvidia-gpu-operatornamespace) - Cluster admin permissions to patch custom resources
Step-by-Step Instructions
1. Identify the NvidiaDrivers resource for your GPU type
List the available NvidiaDrivers custom resources in your cluster:
kubectl get nvidiadrivers.nvidia.com -n nvidia-gpu-operator
Example output:
NAME AGE h100 5d a100 5d
The resource name corresponds to your GPU type. Valid values are: h100, h200, b200, a100, l40s, a100-non-ib.
2. Enable GDS for your GPU type
Patch the NvidiaDrivers resource to enable the GDS component:
kubectl patch nvidiadrivers.nvidia.com <GPU-TYPE> -n nvidia-gpu-operator --type='json' -p='[
{
"op": "add",
"path": "/spec/gds",
"value": {
"enabled": true,
"image": "nvidia-fs",
"imagePullPolicy": "IfNotPresent",
"repository": "nvcr.io/nvidia/cloud-native",
"version": "<nvidia-fs-version>"
}
}
]'
Replace <GPU-TYPE> with your GPU type and <nvidia-fs-version> with the appropriate version for your driver.
Note: To find the correct
nvidia-fsversion for your GPU type and driver version, refer to the NVIDIA GPU Operator release notes or contact Crusoe support.
3. Verify GDS is enabled
Once patched, the NVIDIA GPU driver DaemonSet pods will be updated to include an additional container called nvidia-fs-ctr. Verify it is running:
kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset
Then confirm the new container is present and running:
kubectl describe pod <driver-pod-name> -n nvidia-gpu-operator | grep nvidia-fs-ctr
You should see nvidia-fs-ctr listed as a container with State: Running.
4. (If needed) Increase containerd file descriptor limits
On some GPU types (notably B200), the nvidia-fs-ctr container may hit the default file descriptor limit, causing it to fail to start. If you observe pods stuck in CrashLoopBackOff or Error state, run the following to identify affected nodes:
kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset | grep -v Running
SSH into each affected node and run:
sudo mkdir -p /etc/systemd/system/containerd.service.d cat <<EOF | sudo tee /etc/systemd/system/containerd.service.d/override.conf [Service] LimitNOFILE=131072 EOF sudo systemctl daemon-reload && sudo systemctl restart containerd
This increases the open file descriptor limit for containerd to 131072. After restarting, the nvidia-fs-ctr container should start successfully.
Example
The following shows enabling GDS on an h100 node pool in a cluster that also has a100 nodes. Because GDS is patched per GPU type, the a100 pool is unaffected.
$ kubectl get nvidiadrivers.nvidia.com -n nvidia-gpu-operator
NAME AGE
a100 10d
h100 10d
$ kubectl patch nvidiadrivers.nvidia.com h100 -n nvidia-gpu-operator --type='json' -p='[
{
"op": "add",
"path": "/spec/gds",
"value": {
"enabled": true,
"image": "nvidia-fs",
"imagePullPolicy": "IfNotPresent",
"repository": "nvcr.io/nvidia/cloud-native",
"version": "<nvidia-fs-version>"
}
}
]'
nvidiadrivers.nvidia.com/h100 patched
$ kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset
NAME READY STATUS RESTARTS AGE
nvidia-driver-daemonset-h100-xxxxx 4/4 Running 0 2m
nvidia-driver-daemonset-a100-xxxxx 3/3 Running 0 10d