In the previous chapter, we sharded the states across different threads to avoid synchronization between threads, which increased the transmission speed to 1522.738 Gbps. Unfortunately, this still only reached 47.6% of the total 3200 Gbps bandwidth. Clearly, the CPU is still the bottleneck and we need to further optimize.

In the ContinuePostWrite() function, when we submit new WRITE operations, we only submit one operation at a time. After that, the CPU control flow leaves this function and enters other functions. This causes the CPU control flow to frequently switch between different functions, increasing the CPU cache miss rate. We can try to submit multiple operations together to reduce the switching of CPU control flow. Let’s name the program in this chapter 14_batch.cpp.

Batch Posting Operations

The modification in this chapter is very simple. We just need to add a loop in the ContinuePostWrite() function to submit 16 operations each time.

struct RandomFillRequestState {
  // ...

  void ContinuePostWrite(size_t gpu_idx) {
    constexpr int kBatchSize = 16;
    auto &s = write_states[gpu_idx];
    if (s.i_repeat == total_repeat)
      return;
    auto page_size = request_msg->page_size;
    auto num_pages = request_msg->num_pages;
    auto &group = (*net_groups)[gpu_idx];
    for (int i = 0; i < kBatchSize; ++i) {
      auto net_idx = group.GetNext();
      group.nets[net_idx]->PostWrite(...);
      ++posted_write_ops[gpu_idx];

      if (++s.i_page == num_pages) {
        s.i_page = 0;
        if (++s.i_buf == buf_per_gpu) {
          s.i_buf = 0;
          if (++s.i_repeat == total_repeat)
            return;
        }
      }
    }
  }
};

Results

As you can see from the video above, after we submit 16 operations at once in ContinuePostWrite(), the transmission speed reaches 2589.488 Gbps, achieving 80.9% of the total 3200 Gbps bandwidth. Compared to the previous 1522.567 Gbps, it is almost doubled. Now we are not far from the full speed of 3200 Gbps. In the next chapter, we will further squeeze out performance.

Code for this chapter: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/14_batch.cpp