A HashMap speedup tale: The number of allocations matters

I recently wrote bhashmap - a C11 hash table implementation library. This post will document one of the latest optimizations, which is in hindsight extremely simple to understand and implement, but which resulted in a ~30% runtime speedup.


Before talking about the optimization, let’s quickly get up to speed with how things worked before. I will focus only on the details relevant to this particular optimization.

At the core of every hash table implementation is a key-value pair, and here the case is no different: every such pair is represented by a HashPair struct:

1struct HashPair {
2    void *key;
3    size_t keylen;
4    const void *value;
5    struct HashPair *next;

Looks simple enough. For every key-value pair we store a pointer to the key, its length (keylen), and a pointer to its associated value. For every key inserted into the map, bhashmap makes and stores a copy of it interally.

For example, after the following call:

1/* Some heap-allocated string */
2char *name = get_name();
4/* Create new hash map of size 16 */
5BHashMap *map = bhm_create(16);
7/* Associate the name with the value 5000 */
8size_t val = 5000;
9bhm_insert(map, name, strlen(name) + 1, &val);

The caller is free to mess with the memory occupied by name, or even to release it entirely, as a copy of the key has been made. This is a deliberate design decision, and one of great relevance to this article.

That means, that whenever a new key-value pair is to be inserted into the map, we have to do the following:

  1. Allocate memory for the new struct HashPair
  2. Allocate memory for a copy of the key that is to be stored
  3. Copy the bytes of the key over to the new buffer

The semi-pseudo code (with no regard for error-checking, etc…) for that might look something along the lines of the following:

 1struct HashPair* create_pair(const void *key, const size_t keylen, const void *value) {
 2    struct HashPair *new_pair = malloc(sizeof(struct HashPair));
 4    *new_pair = (struct HashPair) {
 5        .key = malloc(keylen),
 6        .keylen = keylen,
 7        .value = value
 8    };
10    memcpy(new_pair->key, key, keylen);
12    return new_pair;

Makes sense. This is very self-explanatory code. It just works, and it’s something I’d write and not thing about at all.

However, the big thing here is that for every new pair, we perform two allocations (malloc calls): one for the struct itself, and one for the copy of the key. Every such malloc call then has to be followed up by a free later on. And malloc/free calls aren’t cheap - when inserting tens of thousands of pairs into the hashtable, the overhead adds up and has a considerable effect on performance. The question then is - could we get the number of allocations down to just one? Of course, I wouldn’t be writing this article otherwise.

The optimization

Let’s get straight to the point:

By editing the definition of struct HashPair to utilize flexible array members, we can allocate the memory for the struct itself, as well as the memory needed to store the copy of the key in one fell swoop:

1struct HashPair {
2    size_t keylen;
3    const void *value;
4    struct HashPair *next;
6    unsigned char key[];

We no longer store the key copy as a pointer to some allocated memory, but rather as an array right in the struct itself. This also has the added benefit of the CPU not having to follow pointers around when accessing the key - it is stored right alongside the accompanying struct.

The create_pair implementation would now look like so:

 1struct HashPair* create_pair(const void *key, const size_t keylen, const void *value) {
 2    struct HashPair *new_pair = malloc(sizeof(struct HashPair) + keylen);
 4    *new_pair = (struct HashPair) {
 5        .keylen = keylen,
 6        .value = value
 7    };
 9    memcpy(new_pair->key, key, keylen);
11    return new_pair;

Notice that on the first line, we allocate enough memory to hold the instance of the struct, plus enough memory needed to store the copy of the key.


Every benchmark I’ve tested seems to indicate a ~33% runtime speed improvement compared to the unoptimized version, which is huge for such a simple optimization.

Here’s the unoptimized version’s average runtime of 16 iterations of the words_400k benchmark, which inserts 466,550 key-value pairs into the map.

1$ ./buildrel/bench_words400k
3Avg. time of 16 ITERATIONS: 312ms

And the optimized version:

1$ ./buildrel/bench_words400k
3Avg. time of 16 ITERATIONS: 204ms

A 34% runtime speed improvement!