In the previous chapter, we solved the pre-test warmup issue, yet the transmission speed was still only 293.461 Gbps, reaching just 9.2% of the total 3200 Gbps bandwidth. We can speculate that the CPU might be unable to promptly process the completion queues of 32 network cards and submit new operations, leading to network card idling. In this chapter, we’ll attempt to solve this problem using multi-threading. We’ll name our program 11_multithread.cpp.

Multi-threading

We plan to use 8 threads, with each thread responsible for one GPU and its corresponding 4 network cards.

To avoid race conditions when multiple threads read and write state, we need to change some variables in the server-side state machine to atomic types.

struct RandomFillRequestState {
  // ...
  std::atomic<State> state = State::kWaitRequest;
  std::atomic<size_t> posted_warmups = 0;
  std::atomic<size_t> cnt_warmups = 0;
  std::atomic<size_t> posted_write_ops = 0;
  std::atomic<size_t> finished_write_ops = 0;
  // ...
};

In the server-side main loop, we’ll create a thread for each GPU.

int ServerMain(int argc, char **argv) {
  // ...

  // Loop forever. Accept one client at a time.
  for (;;) {
    printf("------\n");
    // State machine
    RandomFillRequestState s(&nets, &net_groups, &cuda_bufs);
    // RECV for CONNECT
    nets[0].PostRecv(buf1, [&s](Network &net, RdmaOp &op) { s.OnRecv(net, op); });
    // RECV for RandomFillRequest
    nets[0].PostRecv(buf2, [&s](Network &net, RdmaOp &op) { s.OnRecv(net, op); });
    // Multi-thread Poll completions
    std::vector<std::thread> threads;
    threads.reserve(num_gpus);
    for (size_t gpu_idx = 0; gpu_idx < net_groups.size(); ++gpu_idx) {
      // Start a thread for each GPU
      threads.emplace_back([&s, gpu_idx] {
        auto nets = (*s.net_groups)[gpu_idx].nets;
        while (s.state != RandomFillRequestState::State::kDone) {
          for (auto *net : nets) {
            net->PollCompletion();
          }
          switch (s.state) {
          case RandomFillRequestState::State::kWaitRequest:
            break;
          case RandomFillRequestState::State::kPostWarmup:
            s.PostWarmup(gpu_idx);
            break;
          case RandomFillRequestState::State::kWaitWarmup:
            break;
          case RandomFillRequestState::State::kWrite:
            s.ContinuePostWrite(gpu_idx);
            break;
          case RandomFillRequestState::State::kDone:
            break;
          }
        }
      });
    }
    for (auto &t : threads) {
      t.join();
    }
  }

  return 0;
}

The above is the entire modification.

Results

From the video above, we can see that our program, under multi-threading, achieved a transmission speed of 355.301 Gbps, utilizing 11.1% of the total bandwidth. Compared to the single-threaded 293.461 Gbps, this is a 21% improvement. However, this speed is still far from our expectations, and we must continue our efforts.

Chapter code: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/11_multithread.cpp