FAQ: Slow Performance and Hanging on Shared Storage

Introduction

On high performance computing (HPC) clusters, warm up delays, where basic operations like conda activate, git status take minutes to complete, are typically caused by NFS metadata latency. This occurs when multiple GPU nodes simultaneously attempt to access thousands of small files (such as Python libraries) on a shared network disk, creating a Thundering Herd effect that exhausts the storage controller's performance limits. This FAQ provides a tiered strategy to diagnose and eliminate these bottlenecks.

Question 1: How do I verify if the shared disk is the bottleneck?

Answer
The most reliable test is to measure the warm up time of a basic filesystem operation. Run the following command on a compute node:

time ls -R /path/to/shared/directory > /dev/null

If a recursive listing of a directory containing many small files (like a Conda environment or a Git repo) takes significantly longer on the shared drive than it does on a local disk, you are hitting a metadata ceiling.

Question 2: Why is my IDE (VS Code/Cursor) freezing?

Answer

IDEs constantly scan folders for git status and IntelliSense. If your project is on a shared drive, the IDE hangs while waiting for the network to respond to thousands of tiny file lookups. Moving your active workspace to local storage usually stops the "Extension Host Unresponsive" errors immediately.

Question 3: How do I handle multinode training if the storage is local?

Answer

The standard workflow is to keep your code and results on the shared drive, but move your active environment and datasets to the Local NVMe storage at the start of a job. You can automate this using a Slurm prologue script or rsync.

Question 4: What is the recommended "Performance Tiering" strategy?

Answer

To maximize performance and cost efficiency, the recommended approach is to move through these three tiers:

Tier 1: Containerization (Best Practice): Package your environment using Apptainer (Singularity) or Enroot. By converting your environment into a single image file, you replace thousands of network handshakes with a single sequential read. This is the industry standard for multi-node training.
Tier 2: Local NVMe Caching: Utilize the high speed local NVMe storage physically attached to every Crusoe GPU node. Use a Slurm prologue script to copy your active environment or datasets to local storage at the start of a job.
Tier 3: Disk Resizing: On the Crusoe platform, metadata performance scales with total disk capacity. If architectural changes are not possible, increasing your Shared Disk size via the Crusoe CLI will automatically grant the volume a higher metadata Quality of Service (QoS) ceiling.

Question 5: Why is there such a large performance difference between my /home folder and the local NVMe drive?

Answer

Local NVMe Storage: These are high speed NVMe drives physically located inside each server. They are the fastest option but are node local (files on Node A are not visible to Node B).
Shared Storage: This is a Network File System (NFS) that allows all nodes to see the same files simultaneously. While convenient, every file lookup must travel over the network to a storage controller. When dozens of nodes do this at once the controller's metadata processing limit is reached, causing the slowness you see.

Question 6: When to use Local vs. Shared Storage

Answer

Operation	Recommended Path	Why?
Conda Env Management	Local NVMe	Prevents long wait times for environment activation/deletion.
Git Clone/Scanning	Local NVMe	Prevents IDEs from hanging during file scans.
Active Model Training	Local NVMe	Maximizes GPU utilization by preventing network data starvation.
Large Model Weights	Shared Disk	Best for long-term storage of large, static files.
Multi-Node Results	Shared Disk	Ensures all nodes can write to a single global state.

Question 7: How do I confirm the network fabric is healthy?

Answer

If raw throughput is a concern, run a neper test between two nodes. If neper shows high speeds (e.g., >20 Gbps) but your filesystem remains sluggish, the issue is confirmed to be storage metadata latency and not the Crusoe network fabric.

Additional Resources

Related to

GPU slurm storage nfs networking

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

FAQ: Slow Performance and Hanging on Shared Storage

Introduction

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments