Introduction
On high performance computing (HPC) clusters, warm up delays, where basic operations like conda activate, git status take minutes to complete, are typically caused by NFS metadata latency. This occurs when multiple GPU nodes simultaneously attempt to access thousands of small files (such as Python libraries) on a shared network disk, creating a Thundering Herd effect that exhausts the storage controller's performance limits. This FAQ provides a tiered strategy to diagnose and eliminate these bottlenecks.
Question 1: How do I verify if the shared disk is the bottleneck?
Answer
The most reliable test is to measure the warm up time of a basic filesystem operation. Run the following command on a compute node:
time ls -R /path/to/shared/directory > /dev/nullIf a recursive listing of a directory containing many small files (like a Conda environment or a Git repo) takes significantly longer on the shared drive than it does on a local disk, you are hitting a metadata ceiling.
Question 2: Why is my IDE (VS Code/Cursor) freezing?
Answer
IDEs constantly scan folders for git status and IntelliSense. If your project is on a shared drive, the IDE hangs while waiting for the network to respond to thousands of tiny file lookups. Moving your active workspace to local storage usually stops the "Extension Host Unresponsive" errors immediately.
Question 3: How do I handle multinode training if the storage is local?
Answer
The standard workflow is to keep your code and results on the shared drive, but move your active environment and datasets to the Local NVMe storage at the start of a job. You can automate this using a Slurm prologue script or rsync.
Question 4: What is the recommended "Performance Tiering" strategy?
Answer
To maximize performance and cost efficiency, the recommended approach is to move through these three tiers:
Tier 1: Containerization (Best Practice): Package your environment using Apptainer (Singularity) or Enroot. By converting your environment into a single image file, you replace thousands of network handshakes with a single sequential read. This is the industry standard for multi-node training.
Tier 2: Local NVMe Caching: Utilize the high speed local NVMe storage physically attached to every Crusoe GPU node. Use a Slurm prologue script to copy your active environment or datasets to local storage at the start of a job.
Tier 3: Disk Resizing: On the Crusoe platform, metadata performance scales with total disk capacity. If architectural changes are not possible, increasing your Shared Disk size via the Crusoe CLI will automatically grant the volume a higher metadata Quality of Service (QoS) ceiling.
Question 5: Why is there such a large performance difference between my /home folder and the local NVMe drive?
Answer
Local NVMe Storage: These are high speed NVMe drives physically located inside each server. They are the fastest option but are node local (files on Node A are not visible to Node B).
Shared Storage: This is a Network File System (NFS) that allows all nodes to see the same files simultaneously. While convenient, every file lookup must travel over the network to a storage controller. When dozens of nodes do this at once the controller's metadata processing limit is reached, causing the slowness you see.
Question 6: When to use Local vs. Shared Storage
Answer
| Operation | Recommended Path | Why? |
|---|---|---|
| Conda Env Management | Local NVMe | Prevents long wait times for environment activation/deletion. |
| Git Clone/Scanning | Local NVMe | Prevents IDEs from hanging during file scans. |
| Active Model Training | Local NVMe | Maximizes GPU utilization by preventing network data starvation. |
| Large Model Weights | Shared Disk | Best for long-term storage of large, static files. |
| Multi-Node Results | Shared Disk | Ensures all nodes can write to a single global state. |
Question 7: How do I confirm the network fabric is healthy?
Answer
If raw throughput is a concern, run a neper test between two nodes. If neper shows high speeds (e.g., >20 Gbps) but your filesystem remains sluggish, the issue is confirmed to be storage metadata latency and not the Crusoe network fabric.