In the previous chapter, by submitting new WRITE operations in batches, we achieved a transmission speed of 2589.488 Gbps, reaching 80.9% of the total 3200 Gbps bandwidth. Now, the full 3200 Gbps speed is within reach.

Notice that every time we submit a WRITE operation, the PostWrite() function calls ProgressPendingOps(). It’s conceivable that if ProgressPendingOps() cannot progress in the first operation of a batch, subsequent operations will also be unable to progress. However, each ProgressPendingOps() takes the operation from the head of the queue, calls fi_writemsg(), and puts it back at the head. This operation is very inefficient. We can try to make PostWrite() lazier and not call ProgressPendingOps() every time. Let’s name the program in this chapter 15_lazy.cpp.

Lazy Operation Submission

To reflect the laziness of operations, let’s change the function name from PostWrite() to LazyPostWrite() and remove the call to ProgressPendingOps().

void Network::LazyPostWrite(
    RdmaWriteOp &&write, std::function<void(Network &, RdmaOp &)> &&callback) {
  auto *op = new RdmaOp{
      .type = RdmaOpType::kWrite,
      .write = std::move(write),
      .callback = std::move(callback),
  };
  pending_ops.push_back(op);
  // Caller needs to poll completion to progress pending ops
}

In the subsequent main loop, we’ll call ProgressPendingOps() through PollCompletion(), so we don’t need to make further modifications.

Results

As seen in the video above, we achieved 3108.283 Gbps, which is 97.1% of the total 3200 Gbps bandwidth. Compared to the 97.433 Gbps (97.4%) of a single NIC in Chapter 7 and the 94.751 Gbps (94.8%) achieved by libfabric’s own fi_rma_bw testing tool in Chapter 3, our program is now very close to full speed.

At this point, we have squeezed out the 3200 Gbps bandwidth of the AWS p5 cluster. In this series of articles, we started from scratch, learned about RDMA, EFA, libfabric, gained insights into the design philosophy of high-performance network systems, then wrote code to gradually implement RECV, SEND, GPUDirect RDMA WRITE, operation queues, expanded to multiple NICs, carried out a series of optimizations, and finally achieved a transmission speed of 3108.283 Gbps (97.1%). I hope this series of articles has been helpful to everyone.

Code for this chapter: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/15_lazy.cpp