Damn Thread, Don't be such a dog in the manger

Take a close look at the following code:

class Foo {
    std::mutex m_mutex;
    std::unordered_map<std::string, int> m_map;

public:
    void pass_by_reference(const std::string &key, int value) {
        std::lock_guard<std::mutex> lock(m_mutex);
        m_map.emplace(key, value);
    }

    void pass_by_value(const std::string key, int value) {
        std::lock_guard<std::mutex> lock(m_mutex);
        m_map.emplace(key, value);
    }

    void useless_copy(const std::string &key, int value) {
        std::string copy(key);
        std::lock_guard<std::mutex> lock(m_mutex);
        m_map.emplace(copy, value);
    }

    void useless_hash(const std::string &key, int value) {
        std::hash<std::string> hash_fn;
        (void) hash_fn(key);
        std::lock_guard<std::mutex> lock(m_mutex);
        m_map.emplace(key, value);
    }
};

The only difference between the first two is pass-by-reference versus pass-by-value. As we all know, pass-by-reference should be more efficient than pass-by-value for large objects, as it avoids copying the memory. The Item 20 of Effective C++ is: Prefer pass-by-reference-to-const to pass-by-value. It’s such a common best practice that every time you write a function signature, you’ll spontaneously write down the reference symbol.

useless_copy passes by reference but creates a local copy. useless_hash calculates the hash value of the string and never uses it. I would expect the four functions have almost the same performance if pass_by_reference is not slighter better.

Benchmark

I wrote a benchmark. Eight workers concurrently update the object for 1000000 times. There are 1000000 strings. Each of them has length 32. You can see the benchmark program here: https://gist.github.com/abcdabcd987/f0d99b2f21567654276ca14d7d1338db. And here is the result:

pass by reference: 7.360 ± 0.077 seconds
pass by value    : 5.601 ± 0.040 seconds
useless copy     : 5.657 ± 0.051 seconds
useless hash     : 5.704 ± 0.065 seconds

The result shows that the other three are about 25% faster than pass_by_reference, which contrasts with our intuition.

Cache

Whenever something abnormal happens, you might want to suspect processors’ caching.

A reference in C++ is actually a pointer. pass_by_reference actually passes a pointer to the string. The content the pointer points to has never been accessed until std::unordered_map::emplace. When it is accessed for the first time, a cache miss will definitely happen because there are too many strings. Therefore, the processor needs to load it from memory. It takes a relatively long time to load data from memory compared to cache.

What’s worse, the stall happens inside the critical section. All other threads are waiting for the lock while the thread that holds the lock is idle and waiting for data transfer from memory. That thread is such a dog in the manger!

pass_by_value and useless_copy avoid this issue because the processor fetches the string into its cache when creating the local copy, which happens outside the critical section so that other threads will not be blocked.

useless_hash avoids this issue because the calculation of the string’s hash value requires reading the content of the string. Thus it’s inside the cache before the critical section as well.

`__builtin_prefetch`

My friend Zheng Luo mentioned __builtin_prefetch after I posted this article. I ran the same benchmark (the above gist has updated):

builtin prefetch : 5.853 ± 0.081 seconds

It also works perfectly! But I had a question that how many bytes will be fetched into the cache? The GCC document doesn’t give me answers. I found that the __builtin_prefetch is translated to the prefetcht0 instruction.

cache-prefetching.cc ****         std::lock_guard<std::mutex> lock(m_mutex);
                 .loc 17 42 0
0016 488B06           movq    (%rsi), %rax
cache-prefetching.cc ****         __builtin_prefetch(key.data());
                 .loc 17 41 0
0019 8955EC           movl    %edx, -20(%rbp)
cache-prefetching.cc ****         std::lock_guard<std::mutex> lock(m_mutex);
                 .loc 17 42 0
001c 0F1808           prefetcht0  (%rax)

Then I turned to Intel® 64 and IA-32 Architectures Software Developer Manuals:

The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, however, be a minimum of 32 bytes.

Section 4.3 Instructions (M-U), PREFETCHh.

It happens that the length of strings is 32 bytes in my benchmark. What if the string is longer? Here is the result:

cache-prefetching-size

The result shows that, as the string length goes up, the effect of __builtin_prefetch decreases, and finally, it’s the same as doing nothing. The performance significantly drops since 4KB. That’s because my processor has 32KB 8-way L1 cache for each hyper thread, thus each way has 32/8=4KB. __builtin_prefetch has a virtue that it works when the data is small, and it won’t make things worse when the data is big. useless_copy and useless_hash might result in more cache misses, thus slower when the data is big.

Conclusion

“Premature optimization is the root of all evil.” Don’t optimize the cache control of your program just because you read this article. Just trust the hardware.
If you really want to optimize it,
- Check if data is already in the cache before entering the critical section.
- Maybe try __builtin_prefetch.

Afterword

Maybe you realized why as soon as you read the example in this article, but it was not that obvious for the code in my previous article, Sharding: A General Method to Improve Lock Granularity in a Brute-force Way. When I was writing that article, I found some abnormality I couldn’t understand: a single “sharded” KVSharded ran faster than KVBigLock. I asked Jason Lau for advice. We tried several different solutions, and none of them work:

Use templates to replace virtual functions
Run on different operating systems and different machines
echo 3 > /proc/sys/vm/drop_caches
Set thread affinity
Run profilers
- callgrind
- gprof
- Intel VTune Amplifier
- perf stat
Build with different instruction sets

He stayed up late and finally gave this reasonable explanation. Special thanks go to Jason Lau!