Skip to main content
Crusoe Support Help Center home page
Crusoe

NIXL Memory Registration Errors on InfiniBand-Enabled GPU Instances

Apeksha Khilari
Apeksha Khilari
Updated

Introduction

Running workloads that use the NVIDIA Inference Xfer Library (NIXL) on Crusoe GPU instances with InfiniBand networking may encounter memory registration errors that cause workloads to fail. These errors are not caused by hardware faults, driver crashes, or network failures. They stem from a known behavior in how NIXL registers GPU memory for RDMA, which has been addressed in fixes to both the NIXL and UCX libraries.

This article explains what causes these errors, what they look like, and what steps you can take to resolve or mitigate them.

ℹ️ Note: For a full technical walkthrough of how Crusoe diagnosed this issue and drove the upstream fix, see the Crusoe engineering blog: How Crusoe's root cause analysis drove a 70% reduction in NIXL's memory footprint.

Prerequisites

  • InfiniBand-Enabled GPU VM
  • Workload Uses the NIXL Library
  • Memory Registration Failures or UCX/RDMA Errors During Workload Execution

Cause

NVIDIA ConnectX-7 InfiniBand NICs allocate host kernel memory — referred to as ICM memory or firmware pages — to support RDMA operations such as queue pairs, memory regions, and flow tables. NIXL's default behavior registers each GPU buffer across all available InfiniBand NICs simultaneously — on an 8-NIC host (standard for HGX H100, H200, and B200 nodes), this means every buffer registration produces 8 firmware-level registrations on the host.

On top of this fan-out, NIXL also overrides UCX's IB_PCI_RELAXED_ORDERING setting to try, which causes UCX to create two memory regions (MR) per NIC per buffer (a standard MR plus a relaxed-ordering MR) on hardware that supports it. Combined, this results in up to 16 firmware-level registrations per buffer:

NIXL Configuration Registrations per Buffer
NIXL with UCX_IB_PCI_RELAXED_ORDERING=auto 8 (one per HCA)
NIXL default (try) 16 (two MRs × 8 NICs)

At production-scale buffer sizes (40–80 GB per GPU), this can consume tens of gigabytes of pinned host memory through firmware-page allocations, potentially triggering host OOM conditions or exceeding Crusoe's per-VF firmware page cap — a host-side protection that limits how much host memory any single VM can drive through RDMA registrations.

Identifying the Error

When your workload encounters these errors, you will see output similar to the following in your application logs:

[*] Registering 8 memory regions with NIXL...
  UCX  ERROR ibv_reg_dmabuf_mr(address=0x7f7980000000, length=42949672960, access=0x10000f)
      failed: Resource temporarily unavailable
  UCX  ERROR failed to register address 0x7f7980000000 (cuda) length 42949672960
      dmabuf_fd 438 on md[11]=mlx5_5: Input/output error (md supports: host)
  Failed to ucp_mem_map: Input/output error
  registerMem: registration failed for the specified or all potential backends
  nixlBackendError: NIXL_ERR_BACKEND

ℹ️ Note: The dmabuf_fd number and md[] device index will vary depending on your environment.

Key indicators:

  • Resource temporarily unavailable or Input/output error on ibv_reg_dmabuf_mr
  • NIXL_ERR_BACKEND from the NIXL Python bindings
  • Errors referencing mlx5_* devices

This is not a NIC hardware failure or a network issue.

Instructions

Option 1: Upgrade NIXL and UCX (Recommended)

This issue has been resolved upstream. The fix introduces a new UCX parameter, UCX_MAX_HCA_PER_GPU, which NIXL sets to auto when it detects a compatible UCX version. This causes NIXL to register each buffer only on the topologically closest NICs rather than all 8, reducing the per-buffer host memory footprint significantly depending on SKU and workload size.

To pick up the fix, your workload needs both of the following:

The fix requires no kernel, driver, or firmware changes.

Option 2: Workaround If You Cannot Upgrade Immediately

If upgrading NIXL or UCX is not immediately possible, set UCX_IB_PCI_RELAXED_ORDERING=auto to buy more time. NIXL respects this environment variable if it is set before the process starts and will not override it. This halves the number of firmware-level registrations per buffer, roughly doubling the GPU memory you can register before hitting an error — but it does not eliminate the issue. You will still encounter the same error at a higher threshold.

export UCX_IB_PCI_RELAXED_ORDERING=auto

⚠️ Warning: Relaxed ordering can improve RDMA throughput on some hardware configurations. Setting this to auto may reduce transfer performance depending on your workload. Benchmark before adopting it as a permanent setting. Upgrading NIXL and UCX remains the recommended fix.

For any further questions, contact Crusoe Support with:

  • A description of your workload and typical GPU memory registration size per operation
  • The error output from your logs
  • Your instance type and region

Resolution

This issue is caused by NIXL's default behavior of registering each GPU buffer across all 8 InfiniBand NICs simultaneously, combined with a UCX relaxed-ordering override that doubles the per-buffer registration count. The upstream fix in NIXL v1.2.0 and UCX v1.21.0-rc1 introduces UCX_MAX_HCA_PER_GPU, which limits registrations to the topologically closest NICs only — reducing the host memory footprint by up to 70%. The UCX_IB_PCI_RELAXED_ORDERING=auto workaround provides temporary relief by halving registrations but should only be used until an upgrade is possible.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.