Harnessing 3200 Gbps Network: A Journey with RDMA, EFA, and libfabric

Earlier this year, I had the fortune of joining Perplexity AI, where I finally got to use servers with the most powerful configuration—AWS p5 instances equipped with 8 NVIDIA H100 GPUs interconnected via NVSwitch. What excited me even more was the ultra-high-speed 3200 Gbps network between servers. I thought it would be incredibly cool if I could write a program that could utilize this full 3200 Gbps bandwidth!

Recently, I spent a week exploring this, developed a small proof-of-concept program, and managed to utilize 97% of the bandwidth. I found this exploration process quite interesting, and given that there are very limited articles and tutorials online about RDMA, EFA, libfabric, and high-performance networking, I decided to share what I learned during this week. This serves both as a record and as a beginner’s tutorial.

Those familiar with MLSys might ask: Couldn’t this be done with just one line of PyTorch or NCCL code? Indeed, NCCL is very mature in terms of collective communication and is a cornerstone for large language model training and inference. However, I think collective communication has some limitations in other scenarios:

Collective communication requires establishing a global communication domain (MPI World). If you need to dynamically add, remove, or replace nodes in the cluster, you need to stop the entire cluster first.
Collective communication uses a synchronous communication model. Whether implemented in blocking or non-blocking mode, it poses a significant mental burden for me. I’m more comfortable with an asynchronous communication model like gRPC.
Most importantly, isn’t it fun to build your own wheel?

Since my experimental environment is an AWS p5 cluster, some technical details mentioned in this article might only apply to AWS p5 clusters. However, I hope this article can still provide valuable references for other high-performance networking environments.

Because there’s quite a bit of content, I’ve split it into several articles, which you’re welcome to read:

Harnessing 3200Gbps Network (0): Introduction
Harnessing 3200Gbps Network (1): RDMA and EFA
Harnessing 3200Gbps Network (2): High-Performance Network System Design Philosophy
Harnessing 3200Gbps Network (3): libfabric
Harnessing 3200Gbps Network (4): Unidirectional SEND and RECV
Harnessing 3200Gbps Network (5): Bidirectional SEND and RECV
Harnessing 3200Gbps Network (6): GPUDirect RDMA WRITE
Harnessing 3200Gbps Network (7): Queuing and Benchmark [97.433 Gbps (97.4%)]
Harnessing 3200Gbps Network (8): Bus Topology
Harnessing 3200Gbps Network (9): Using 32 Network Cards [287.089 Gbps (9.0%)]
Harnessing 3200Gbps Network (10): Pre-benchmark Warmup [293.461 Gbps (9.2%)]
Harnessing 3200Gbps Network (11): Multi-threading [355.301 Gbps (11.1%)]
Harnessing 3200Gbps Network (12): CPU Core Pinning [1237.738 Gbps (38.7%)]
Harnessing 3200Gbps Network (13): State Sharding [1522.567 Gbps (47.6%)]
Harnessing 3200Gbps Network (14): Batch Posting [2589.488 Gbps (80.9%)]
Harnessing 3200Gbps Network (15): Lazy Posting [3108.283 Gbps (97.1%)]

During my exploration and writing process, I received help from the following friends, whom I’d like to thank: Brian Barrett and Shi Jin. I’m also very grateful to my company for allowing me to spend time exploring these technologies and providing such powerful hardware.

The following reference materials were particularly useful:

libfabric official documentation, especially the introductory fi_intro(7) and fi_arch(7). For specific API details, you can refer to each API’s documentation.
aws-ofi-nccl code. This is AWS’s NCCL plugin and forms the communication foundation for all multi-machine machine learning tasks running on AWS. This codebase contains many practical examples of using libfabric and the efa Provider.
fabtests code. This is libfabric’s testing tool, which can be used to test libfabric’s performance.

The article’s code can be found on GitHub: https://github.com/abcdabcd987/libfabric-efa-demo