In the previous chapter, we enabled multi-threading and created a thread for each GPU. However, the transmission speed was still only 355.301 Gbps, reaching just 11.1% of the total 3200 Gbps bandwidth.

If we carefully compare the execution process of the multi-threaded version in the previous chapter with the single-threaded version from two chapters ago, we’ll notice a problem: the random number generation step has become slower. Although the time taken for random number generation is not included in the transmission speed test, this phenomenon itself is still very strange. Even though we used multiple threads, we still only use one thread to generate random numbers, so why would the speed of random number generation become slower?

One possible reason is that when the operating system switches between multiple threads, it may switch threads to different CPU cores. As a result, when a thread is generating random numbers, it may switch between different CPU caches, leading to cache invalidation. It’s even possible that when a thread is generating random numbers, it may switch between different NUMA nodes, resulting in memory access latency and GPU memory access latency.

Since this issue can slow down random number generation, it’s also very likely to slow down the processing of network completion queues and the submission of new operations. To solve this problem, we can try to bind each thread to a CPU core. By doing so, when a thread generates random numbers and performs network transmission, it won’t switch between different CPU cores or different NUMA nodes. We’ll name the program in this chapter 12_pin.cpp.

Pinning CPU Cores

In the bus topology detection in Chapter 8, we have already evenly distributed all physical CPU cores across 8 GPUs according to NUMA. Here, we only need to arbitrarily select one CPU core for each GPU and then bind the GPU’s corresponding thread to this CPU core. Binding a CPU core can be achieved through the pthread_setaffinity_np() function.

int ServerMain(int argc, char **argv) {
    // ...

    // Multi-thread Poll completions
    std::vector<std::thread> threads;
    threads.reserve(num_gpus);
    for (size_t gpu_idx = 0; gpu_idx < net_groups.size(); ++gpu_idx) {
      const auto &cpus = topo_groups[gpu_idx].cpus;
      int preferred_cpu = cpus[cpus.size() / 2];
      threads.emplace_back([&s, preferred_cpu, gpu_idx] {
        // Pin CPU
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(preferred_cpu, &cpuset);
        CHECK(pthread_setaffinity_np(
          pthread_self(), sizeof(cpu_set_t), &cpuset) == 0);

        // Poll completions
        // ...
      });
    }

    // ...
}

Results

As shown in the video above, after binding CPU cores, our program’s transmission speed reached 1237.738 Gbps, reaching 38.7% of the total 3200 Gbps bandwidth. This is a decent improvement, but there is still some distance to our goal, so we need to further optimize.

Code for this chapter: https://github.com/abcdabcd987/libfabric-efa-demo/blob/master/src/12_pin.cpp