Lequn Chen || abcdabcd987

Explorations of RDMA in LLM Systems

Nov 9, 2025 • 中文版

Last week, our team summarized some recent progress we made on point-to-point communication for LLM systems and posted a paper on arXiv. We also open-sourced the code on GitHub.

We built an RDMA communication library based on the idea of Unordered Reliable Datagram (URD) semantics. It runs on both AWS EFA and NVIDIA ConnectX. We applied this library to three scenarios: KvCache transfer in disaggregated inference, model-parameter updates in RL post-training, and MoE communication. The MoE kernel actually runs slightly faster than DeepEP on ConnectX-7 during decode, and on EFA we achieved the first actually-usable performance as well.

In this post, I want to share the backstory — the motivation, the design decisions, and some fun debugging moments along the way. If you want the full technical details, check the paper, source code, and linked blog posts at the end.

Read on →
Quick Follow-up on Inter-node RL Weight Transfer

Sep 17, 2025 • 中文版
In the previous blog post, I walked through how we achieved cross-machine RL weight updates in just 2 seconds. This post is a quick follow-up with a few extra details:
1. For Kimi-K2 (1T params), with 256 GPUs in BF16 training and 128 GPUs in FP8 inference, weight updates take less than 1.3 seconds.
2. The pipeline for parameter updates has been tuned a bit more, adding two parallelizable steps: H2D memcpy and a global communication barrier.
3. I ran a PyTorch Profiler trace to get a visual breakdown of the update pipeline and see exactly where time is being spent.
4. Added a few figures for easier intuition.
Read on →
Journey to 2-second Inter-node RL Weight Transfer

Sep 7, 2025 • 中文版

I just spent the past two weeks getting cross-machine parameter updates for Qwen3-235B (BF16 training, FP8 inference) to run in just 2 seconds (128 GPUs for training, 32 GPUs for inference). Instead of writing a “here’s the solution” kind of post, I want to share my exploration process and thoughts along the way. I’ll post a shorter, polished version on the company blog in a few days.

Read on →
Harnessing 3200 Gbps Network: A Journey with RDMA, EFA, and libfabric

Dec 25, 2024 • 中文版
Earlier this year, I had the fortune of joining Perplexity AI, where I finally got to use servers with the most powerful configuration—AWS p5 instances equipped with 8 NVIDIA H100 GPUs interconnected via NVSwitch. What excited me even more was the ultra-high-speed 3200 Gbps network between servers. I thought it would be incredibly cool if I could write a program that could utilize this full 3200 Gbps bandwidth!

Recently, I spent a week exploring this, developed a small proof-of-concept program, and managed to utilize 97% of the bandwidth. I found this exploration process quite interesting, and given that there are very limited articles and tutorials online about RDMA, EFA, libfabric, and high-performance networking, I decided to share what I learned during this week. This serves both as a record and as a beginner’s tutorial.

Those familiar with MLSys might ask: Couldn’t this be done with just one line of PyTorch or NCCL code? Indeed, NCCL is very mature in terms of collective communication and is a cornerstone for large language model training and inference. However, I think collective communication has some limitations in other scenarios:
1. Collective communication requires establishing a global communication domain (MPI World). If you need to dynamically add, remove, or replace nodes in the cluster, you need to stop the entire cluster first.
2. Collective communication uses a synchronous communication model. Whether implemented in blocking or non-blocking mode, it poses a significant mental burden for me. I’m more comfortable with an asynchronous communication model like gRPC.
3. Most importantly, isn’t it fun to build your own wheel?
Since my experimental environment is an AWS p5 cluster, some technical details mentioned in this article might only apply to AWS p5 clusters. However, I hope this article can still provide valuable references for other high-performance networking environments.

Because there’s quite a bit of content, I’ve split it into several articles, which you’re welcome to read:
- Harnessing 3200Gbps Network (0): Introduction
- Harnessing 3200Gbps Network (1): RDMA and EFA
- Harnessing 3200Gbps Network (2): High-Performance Network System Design Philosophy
- Harnessing 3200Gbps Network (3): libfabric
- Harnessing 3200Gbps Network (4): Unidirectional SEND and RECV
- Harnessing 3200Gbps Network (5): Bidirectional SEND and RECV
- Harnessing 3200Gbps Network (6): GPUDirect RDMA WRITE
- Harnessing 3200Gbps Network (7): Queuing and Benchmark [97.433 Gbps (97.4%)]
- Harnessing 3200Gbps Network (8): Bus Topology
- Harnessing 3200Gbps Network (9): Using 32 Network Cards [287.089 Gbps (9.0%)]
- Harnessing 3200Gbps Network (10): Pre-benchmark Warmup [293.461 Gbps (9.2%)]
- Harnessing 3200Gbps Network (11): Multi-threading [355.301 Gbps (11.1%)]
- Harnessing 3200Gbps Network (12): CPU Core Pinning [1237.738 Gbps (38.7%)]
- Harnessing 3200Gbps Network (13): State Sharding [1522.567 Gbps (47.6%)]
- Harnessing 3200Gbps Network (14): Batch Posting [2589.488 Gbps (80.9%)]
- Harnessing 3200Gbps Network (15): Lazy Posting [3108.283 Gbps (97.1%)]
Read on →
Potentials of Multitenancy Fine-Tuned LLM Serving

Sep 11, 2023 • 中文版

As open-source pre-trained Large Language Models (LLMs) become more powerful and permissive, more and more users are incorporating LLMs into their projects. An essential adaptation step is the integration of domain-specific documents into the pre-trained model, known as fine-tuning.

Often, the additional knowledge from domain-specific documents is minuscule compared to what the pre-trained model already knows. In such scenarios, the Low-Rank Adaptation (LoRA) technique proves valuable.

With LoRA, a fine-tuned model adds fewer than 0.1% of parameters to the pre-trained model. In concrete terms, this means a LoRA fine-tuned model increases storage by only 10~200 MB, depending on the configuration. From a computational standpoint, given the marginal increase in parameters compared to the pre-trained model, the additional computational load is relatively small.

Considering the minimal storage addition and computational overhead, I believe there’s potential in developing a multitenancy fine-tuned LLM serving service. This service could host thousands of LoRA models, all sharing the same backbone LLM. With batching, each user request would invoke a distinct fine-tuned model, thereby amortizing storage and computational costs across various models.

In my previous blog post, I delved into the batching effects in LLM serving. In this post, I’ll detail why multitenancy LoRA serving has immense potential.

Read on →

« Older