Skip to main content
Crusoe Support Help Center home page
Crusoe

How To Enable GPU Direct Storage (GDS) on Crusoe GPU Instances

Rohit Kalmankar
Rohit Kalmankar
Updated

Introduction

GPU Direct Storage (GDS) enables direct data transfers between storage and GPU memory, bypassing the CPU entirely. Without GDS, data flows through a CPU buffer before reaching the GPU — adding latency and CPU overhead that reduces the effective GPU utilization available for compute.

For AI training workloads, enabling GDS can meaningfully improve Model FLOP Utilization (MFU) by reducing the time the GPU spends waiting on data rather than doing compute. This is particularly relevant for large model training where checkpoint loading and dataset I/O are on the critical path.

Important: Enabling GDS on the node is only half of the equation. Your application must also use the cuFile API (part of the NVIDIA cuFile library) to perform direct reads and writes that bypass the CPU buffer. Frameworks and libraries that already support cuFile will automatically benefit from GDS once it is enabled on the node:

  • NVIDIA DALI — native GDS support, enable with DALI_USE_GDS=1
  • RAPIDS (cuDF, cuIO) — GDS-aware by default when GDS is enabled
  • PyTorch — supported via NVIDIA DALI or cuFile-backed data loaders

If you are writing custom data loading code, refer to the cuFile API documentation to implement direct storage access using cuFileRead/cuFileWrite instead of standard POSIX read/write.

Crusoe manages GPU drivers via the NVIDIA GPU Operator, which exposes a NvidiaDrivers custom resource per GPU type. This lets you enable GDS selectively for specific GPU types in a mixed cluster — without affecting other node pools.

Prerequisites

  • kubectl configured with access to your Crusoe Kubernetes cluster
  • NVIDIA GPU Operator installed (nvidia-gpu-operator namespace)
  • Cluster admin permissions to patch custom resources

Step-by-Step Instructions

1. Identify the NvidiaDrivers resource for your GPU type

List the available NvidiaDrivers custom resources in your cluster:

kubectl get nvidiadrivers.nvidia.com -n nvidia-gpu-operator

Example output:

NAME       AGE
h100       5d
a100       5d

The resource name corresponds to your GPU type. Valid values are: h100, h200, b200, a100, l40s, a100-non-ib.

2. Enable GDS for your GPU type

Patch the NvidiaDrivers resource to enable the GDS component:

kubectl patch nvidiadrivers.nvidia.com <GPU-TYPE> -n nvidia-gpu-operator --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/gds",
    "value": {
      "enabled": true,
      "image": "nvidia-fs",
      "imagePullPolicy": "IfNotPresent",
      "repository": "nvcr.io/nvidia/cloud-native",
      "version": "<nvidia-fs-version>"
    }
  }
]'

Replace <GPU-TYPE> with your GPU type and <nvidia-fs-version> with the appropriate version for your driver.

Note: To find the correct nvidia-fs version for your GPU type and driver version, refer to the NVIDIA GPU Operator release notes or contact Crusoe support.

3. Verify GDS is enabled

Once patched, the NVIDIA GPU driver DaemonSet pods will be updated to include an additional container called nvidia-fs-ctr. Verify it is running:

kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset

Then confirm the new container is present and running:

kubectl describe pod <driver-pod-name> -n nvidia-gpu-operator | grep nvidia-fs-ctr

You should see nvidia-fs-ctr listed as a container with State: Running.

4. (If needed) Increase containerd file descriptor limits

On some GPU types (notably B200), the nvidia-fs-ctr container may hit the default file descriptor limit, causing it to fail to start. If you observe pods stuck in CrashLoopBackOff or Error state, run the following to identify affected nodes:

kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset | grep -v Running

SSH into each affected node and run:

sudo mkdir -p /etc/systemd/system/containerd.service.d

cat <<EOF | sudo tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=131072
EOF

sudo systemctl daemon-reload && sudo systemctl restart containerd

This increases the open file descriptor limit for containerd to 131072. After restarting, the nvidia-fs-ctr container should start successfully.


Example

The following shows enabling GDS on an h100 node pool in a cluster that also has a100 nodes. Because GDS is patched per GPU type, the a100 pool is unaffected.

$ kubectl get nvidiadrivers.nvidia.com -n nvidia-gpu-operator
NAME    AGE
a100    10d
h100    10d

$ kubectl patch nvidiadrivers.nvidia.com h100 -n nvidia-gpu-operator --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/gds",
    "value": {
      "enabled": true,
      "image": "nvidia-fs",
      "imagePullPolicy": "IfNotPresent",
      "repository": "nvcr.io/nvidia/cloud-native",
      "version": "<nvidia-fs-version>"
    }
  }
]'
nvidiadrivers.nvidia.com/h100 patched

$ kubectl get pods -n nvidia-gpu-operator -l app=nvidia-driver-daemonset
NAME                                   READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-h100-xxxxx     4/4     Running   0          2m
nvidia-driver-daemonset-a100-xxxxx     3/3     Running   0          10d

 


Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.