Harnessing 3200Gbps Network (10): Pre-benchmark Warmup
In the previous chapter, we were indeed able to use 8 GPUs and 32 network cards, however the transmission speed was only 287.089 Gbps, reaching just 9.0% of the total bandwidth of 3200 Gbps.
If we carefully observe the running process of the previous chapter’s program, we’ll see that the transmission speed starts around 100 Gbps at the beginning, quickly rises to around 260 Gbps, and then gradually slowly increases to 287 Gbps. This phenomenon is quite suspicious. It indicates that the program initially experiences a significant latency, after which it can maintain the same transmission speed. This chapter, we’ll solve this slow startup problem. We’ll name this chapter’s program 10_warmup.cpp
.
We can easily guess that this phenomenon might be because the first WRITE
operation takes an exceptionally long time. And the reason the first operation takes an exceptionally long time is that except for the first network card, other network cards haven’t yet established a connection with the client.
Warmup
To solve this problem is also simple. Before entering the formal speed test, we can introduce a warmup phase. During the warmup phase, we let each network card send a WRITE
operation to each other, ensuring that connections are established on both ends. After the warmup phase ends, we begin the formal speed test.
We add some warmup-related states to the server-side state machine:
struct RandomFillRequestState {
enum class State {
kWaitRequest,
kPostWarmup, // Added
kWaitWarmup, // Added
kWrite,
kDone,
};
struct WriteState {
bool warmup_posted = false; // Added
size_t i_repeat = 0;
size_t i_buf = 0;
size_t i_page = 0;
};
// ...
size_t posted_warmups = 0;
size_t cnt_warmups = 0;
// ...
};
When the server receives a RANDOM_FILL
request, we enter the kPostWarmup
state:
struct RandomFillRequestState {
// ...
void HandleRequest(Network &net, RdmaOp &op) {
// ...
// Generate random data and copy to local GPU memory
// ...
// Prepare for warmup
write_states.resize(connect_msg->num_gpus);
state = State::kPostWarmup;
}
};
Then we add a PostWarmup(gpu_idx)
function to submit a WRITE
operation for all network cards corresponding to this GPU. When all GPUs have submitted warmup operations, we let the state machine enter the kWaitWarmup
state:
struct RandomFillRequestState {
// ...
void PostWarmup(size_t gpu_idx) {
// Warmup the connection.
// Write 1 page via each network
auto &s = write_states[gpu_idx];
if (s.warmup_posted) {
return;
}
auto page_size = request_msg->page_size;
auto &group = (*net_groups)[gpu_idx];
for (size_t k = 0; k < group.nets.size(); ++k) {
auto net_idx = group.GetNext();
const auto &mr =
connect_msg->mr((gpu_idx * nets_per_gpu + net_idx) * buf_per_gpu);
auto write = RdmaWriteOp{ ... };
group.nets[net_idx]->PostWrite(std::move(write),
[this](Network &net, RdmaOp &op) {
HandleWarmupCompletion(net, op);
});
}
s.warmup_posted = true;
if (++posted_warmups == connect_msg->num_gpus) {
state = State::kWaitWarmup;
}
}
};
Then in the warmup operation’s callback function, we check if all warmup operations have completed. If so, we enter the kWrite
state:
struct RandomFillRequestState {
// ...
void HandleWarmupCompletion(Network &net, RdmaOp &op) {
if (++cnt_warmups < connect_msg->num_nets) {
return;
}
printf("Warmup completed.\n");
// Prepare RDMA WRITE the data to remote GPU memory.
printf("Started RDMA WRITE to the remote GPU memory.\n");
total_write_ops = connect_msg->num_gpus * buf_per_gpu *
request_msg->num_pages * total_repeat;
write_op_size = request_msg->page_size;
write_states.resize(connect_msg->num_gpus);
write_start_at = std::chrono::high_resolution_clock::now();
state = State::kWrite;
}
};
Finally, in the server’s main loop, we judge that if the state machine is in the kPostWarmup
state, we call the PostWarmup(gpu_idx)
function:
int ServerMain(int argc, char **argv) {
// ...
while (s.state != RandomFillRequestState::State::kDone) {
for (size_t gpu_idx = 0; gpu_idx < net_groups.size(); ++gpu_idx) {
for (auto *net : net_groups[gpu_idx].nets) {
net->PollCompletion();
}
switch (s.state) {
case RandomFillRequestState::State::kWaitRequest:
break;
case RandomFillRequestState::State::kPostWarmup: // Added
s.PostWarmup(gpu_idx);
case RandomFillRequestState::State::kWaitWarmup: // Added
break;
case RandomFillRequestState::State::kWrite:
s.ContinuePostWrite(gpu_idx);
break;
case RandomFillRequestState::State::kDone:
break;
}
}
}
// ...
}
Results
In the video above, we can see that after adding the warmup phase, the transmission speed at the beginning can immediately reach around 290 Gbps, and then remains stable around this speed. The final transmission speed is 293.461 Gbps, reaching 9.2% of the total bandwidth of 3200 Gbps. We will continue to optimize this program in the next chapter.
This chapter’s code: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/10_warmup.cpp