In the previous chapter, we were indeed able to use 8 GPUs and 32 network cards, however the transmission speed was only 287.089 Gbps, reaching just 9.0% of the total bandwidth of 3200 Gbps.

If we carefully observe the running process of the previous chapter’s program, we’ll see that the transmission speed starts around 100 Gbps at the beginning, quickly rises to around 260 Gbps, and then gradually slowly increases to 287 Gbps. This phenomenon is quite suspicious. It indicates that the program initially experiences a significant latency, after which it can maintain the same transmission speed. This chapter, we’ll solve this slow startup problem. We’ll name this chapter’s program 10_warmup.cpp.

We can easily guess that this phenomenon might be because the first WRITE operation takes an exceptionally long time. And the reason the first operation takes an exceptionally long time is that except for the first network card, other network cards haven’t yet established a connection with the client.

Warmup

To solve this problem is also simple. Before entering the formal speed test, we can introduce a warmup phase. During the warmup phase, we let each network card send a WRITE operation to each other, ensuring that connections are established on both ends. After the warmup phase ends, we begin the formal speed test.

We add some warmup-related states to the server-side state machine:

struct RandomFillRequestState {
  enum class State {
    kWaitRequest,
    kPostWarmup,  // Added
    kWaitWarmup,  // Added
    kWrite,
    kDone,
  };

  struct WriteState {
    bool warmup_posted = false;  // Added
    size_t i_repeat = 0;
    size_t i_buf = 0;
    size_t i_page = 0;
  };

  // ...
  size_t posted_warmups = 0;
  size_t cnt_warmups = 0;
  // ...
};

When the server receives a RANDOM_FILL request, we enter the kPostWarmup state:

struct RandomFillRequestState {
  // ...

  void HandleRequest(Network &net, RdmaOp &op) {
    // ...
    // Generate random data and copy to local GPU memory
    // ...

    // Prepare for warmup
    write_states.resize(connect_msg->num_gpus);
    state = State::kPostWarmup;
  }
};

Then we add a PostWarmup(gpu_idx) function to submit a WRITE operation for all network cards corresponding to this GPU. When all GPUs have submitted warmup operations, we let the state machine enter the kWaitWarmup state:

struct RandomFillRequestState {
  // ...

  void PostWarmup(size_t gpu_idx) {
    // Warmup the connection.
    // Write 1 page via each network
    auto &s = write_states[gpu_idx];
    if (s.warmup_posted) {
      return;
    }

    auto page_size = request_msg->page_size;
    auto &group = (*net_groups)[gpu_idx];
    for (size_t k = 0; k < group.nets.size(); ++k) {
      auto net_idx = group.GetNext();
      const auto &mr =
          connect_msg->mr((gpu_idx * nets_per_gpu + net_idx) * buf_per_gpu);
      auto write = RdmaWriteOp{ ... };
      group.nets[net_idx]->PostWrite(std::move(write),
                                     [this](Network &net, RdmaOp &op) {
                                       HandleWarmupCompletion(net, op);
                                     });
    }
    s.warmup_posted = true;
    if (++posted_warmups == connect_msg->num_gpus) {
      state = State::kWaitWarmup;
    }
  }
};

Then in the warmup operation’s callback function, we check if all warmup operations have completed. If so, we enter the kWrite state:

struct RandomFillRequestState {
  // ...

  void HandleWarmupCompletion(Network &net, RdmaOp &op) {
    if (++cnt_warmups < connect_msg->num_nets) {
      return;
    }
    printf("Warmup completed.\n");

    // Prepare RDMA WRITE the data to remote GPU memory.
    printf("Started RDMA WRITE to the remote GPU memory.\n");
    total_write_ops = connect_msg->num_gpus * buf_per_gpu *
                      request_msg->num_pages * total_repeat;
    write_op_size = request_msg->page_size;
    write_states.resize(connect_msg->num_gpus);
    write_start_at = std::chrono::high_resolution_clock::now();
    state = State::kWrite;
  }
};

Finally, in the server’s main loop, we judge that if the state machine is in the kPostWarmup state, we call the PostWarmup(gpu_idx) function:

int ServerMain(int argc, char **argv) {
    // ...
    while (s.state != RandomFillRequestState::State::kDone) {
      for (size_t gpu_idx = 0; gpu_idx < net_groups.size(); ++gpu_idx) {
        for (auto *net : net_groups[gpu_idx].nets) {
          net->PollCompletion();
        }
        switch (s.state) {
        case RandomFillRequestState::State::kWaitRequest:
          break;
        case RandomFillRequestState::State::kPostWarmup:  // Added
          s.PostWarmup(gpu_idx);
        case RandomFillRequestState::State::kWaitWarmup:  // Added
          break;
        case RandomFillRequestState::State::kWrite:
          s.ContinuePostWrite(gpu_idx);
          break;
        case RandomFillRequestState::State::kDone:
          break;
        }
      }
    }
    // ...
}

Results

In the video above, we can see that after adding the warmup phase, the transmission speed at the beginning can immediately reach around 290 Gbps, and then remains stable around this speed. The final transmission speed is 293.461 Gbps, reaching 9.2% of the total bandwidth of 3200 Gbps. We will continue to optimize this program in the next chapter.

This chapter’s code: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/10_warmup.cpp