Sharding: A General Method to Improve Lock Granularity in a Brute-force Way

Example: Concurrent Hashmap

Suppose we need a concurrent hashmap. I know there are lots of libraries, but let’s assume we need to write it by ourselves. For simplicity, we only support three kinds of operations: get, put and remove.

Trivial Solution

We can use std::unordered_map, but it is not thread-safe. So, the most straightforward solution would be to have a big lock to protect it. It’s simple, just like this:

class KVBigLock {
    std::mutex m_mutex;
    std::unordered_map<std::string, int> m_map;

public:
    void put(const std::string &key, int value) {
        std::lock_guard lock(m_mutex);
        m_map.emplace(key, value);
    }

    std::optional<int> get(const std::string &key) {
        std::lock_guard lock(m_mutex);
        auto it = m_map.find(key);
        if (it != m_map.end())
            return it->second;
        return {};
    }

    bool remove(const std::string &key) {
        std::lock_guard lock(m_mutex);
        auto n = m_map.erase(key);
        return n;
    }
};

So far so good, we have a feasible solution. However, it’s slow because all threads need to wait for the lock. Okay, the problem is obvious. The lock is too big. We like finer lock granularity.

“Finest” Lock Granularity

How about adding a lock for every element, i.e., mapping std::string to std::pair<std::mutex, int>. The lock granularity should be fine enough, shouldn’t it? However, it’s not even a correct solution.

Consider removing and getting the same key at the same time. In order to prevent other threads to read the value, remove function should hold the lock of throughout the removal. However,

The behavior of a program is undefined if a mutex is destroyed while still owned by any threads.

– http://en.cppreference.com/w/cpp/thread/mutex

Even if nothing bad happens after the locked mutex is destroyed, the thread trying to fetch the key will access the std::pair value which has been deleted.

What’s worse, consider removing and insert at the same time. The two threads will concurrently modify the data structure of std::unordered_map, which is not protected by any locks.

Reader-Writer Lock

It’s clear that we need to protect the map data structure, not the element the map stores. How about a reader-writer lock. Concurrent get operations could share the lock, assuming no data race for std::unordered_map::get.

class KVSharedLock {
    std::shared_mutex m_mutex;
    std::unordered_map<std::string, int> m_map;

public:
    void put(const std::string &key, int value) {
        std::lock_guard lock(m_mutex);
        m_map.emplace(key, value);
    }

    std::optional<int> get(const std::string &key) {
        std::shared_lock lock(m_mutex);
        auto it = m_map.find(key);
        if (it != m_map.end())
            return it->second;
        return {};
    }

    bool remove(const std::string &key) {
        std::lock_guard lock(m_mutex);
        auto n = m_map.erase(key);
        return n;
    }
};

Will it run faster? It depends on the workload. If it’s read-dominant, it should be faster. Otherwise, it might even be slower because a reader-writer lock is more complicated than a simple exclusive lock. You can imagine a reader-writer lock as a simple exclusive lock with a conditional variable.

Sharding

Notice that different elements have no logical relation to each other. We can split the map into multiple shards by keys, which will not affect correctness. As long as the key accessing distribution of shards is uniform enough, and the number of shards is big enough, the probability of two threads waiting for the same lock should be small enough.

class KVSharded {
    const size_t m_mask;
    std::vector<KVBigLock> m_shards;

    KVBigLock& get_shard(const std::string &key) {
        std::hash<std::string> hash_fn;
        auto h = hash_fn(key);
        return m_shards[h & m_mask];
    }

public:
    KVSharded(size_t num_shard): m_mask(num_shard-1), m_shards(num_shard) {
        if ((num_shard & m_mask) != 0)
            throw std::runtime_error("num_shard must be a power of two");
    }

    void put(const std::string &key, int value) {
        get_shard(key).put(key, value);
    }

    std::optional<int> get(const std::string &key) {
        return get_shard(key).get(key);
    }

    bool remove(const std::string &key) {
        return get_shard(key).remove(key);
    }
};

The implementation is simple. We first calculate the hash value of the key and take the modulo. As long as the hash function is a uniform distribution, the module is as well.

Benchmark

I ran a benchmark. The ratio of get, put and remove was about 1:1:1. There were 1000000 string keys and 10000000 operations. The number of concurrent workers was eight which is the number of logical processors on the machine. Each test ran for 16 times. I also ran a lock-free hashmap implementation libcuckoo and Intel TBB as comparison. The benchmark program can be found here: https://gist.github.com/abcdabcd987/53b7aa6fdb8f7dbe46798fa6df2f5871

benchmark

The figure shows that 32 sharded version is 80% faster than the big lock version. The reader-writer lock version is even slower because get operation only takes up one-third operations, and the overhead of reader-writer lock is too large. Intel TBB runs surprisingly slow, making me wonder if I used it wrongly. Also, the sharded version is 50% faster than the lock-free data structure.

It is an amazing result because we have the best performance without any twisted thinking or coding.

How many shards?

I also ran a benchmark on the number of shards.

the number of shards

The performance improves quickly at first, and it converges at about 128 shards, which is not a big number, thus no too much space overhead.

Can we calculate it? I don’t know. I guess it would be hard. Maybe running benchmarks is the easiest way to find out the number.

Also notice that, in our hashmap example, there is no more overhead other than the tiny extra space. However, in some other cases, more factors need to be considered. For instance, fixing the total size of a buffer pool, as the number of shards goes up, the size of each shard goes down, which will lead to more frequent page replacement.

General & Brute-force

I love this sharding technique because it’s a general method to improve lock granularity in a brute-force way. You have seen how easy it is:

Write a naive big lock version
Split it into several shards according to some keys

It can also be applied to many other cases.

For example, the database I’m working on didn’t have a Buffer Pool before. Instead, paging mechanisms couples with different data structures. I wanted to refactor it, creating a buffer pool to manage paging things for all data structures. And I did it. It was great. It ran 20% faster than before on unit tests, and the code base looked clean and tidy. Everything was fine until I ran concurrent tests. It was almost 10x slower than before.

Thanks to this answer on Zhihu, the Chinese Quora, I learned this sharding technique. And it works well. And I’m also happy to find that MySQL also uses this technique.

Of course, you can always try to design some fancy lock-free data structures. But it takes time and effort to debug it and prove it correct. And usually, it means that you need to design the lock-free data structure specific to one implementation. For example, there are several page replacement algorithms. Now you need to protect not only the hashmap but also metadata page replacement algorithms need. Using locks, you can implement those algorithms as different derived classes, and can easily change which one to use. The hashmap and the algorithm are two decoupled parts. On the other hand, using lock-free data structure might require you to couple them together. I’m not expert in the lock-free world. So please correct me if I’m wrong.