I recently went on a mission to find (and perhaps build) a better dictionary. I’ve been looking at Dr. Askitis’ work on so-called HAT-tries, which are something akin to a burst trie. It all seems reasonable enough, and his experimental results seem very promising. In particular, I was looking for an open source version of this data structure that didn’t preclude commercial use (as Askitis’ version does).
HAT-tries rely on another of Askitis’ data structures, the hash array map. Essentially, it’s a hash table, but instead of using linked lists to store nodes containing the various key/value pairs stored in a particular slot, each slot is actually a buffer that stores a bunch of packed information, including the number of items in the buffer, the length of the string key, and the value itself. The idea is that this arrangement is much more conducive to caching (and hardware prefetches out of memory since each slot is a contiguous slab of memory). Contrast this to a more conventional approach in which each slot is a linked list that must be traversed to find the actual key/value pair that’s being retrieved.
I should note that there are a couple of implementations that I looked at before venturing out to make my own. The design is actually relatively simple: hash a provided key to map it onto one of many internal buckets. If that bucket is empty, allocate enough memory to store an 1) integer counter of the number of pairs in this buffer, 2) integer length of the provided key, 3) a byte copy of the key itself, and 4) a byte copy of the value.
For the hash, I chose my personal favorite,
superfasthash. I actually
began by having my implementation follow the STL style of being able to
provide a memory allocator, and when I didn’t see the performance I wanted, I
switched to malloc
and realloc
as prescribed in the paper. Even then, I
did not see the performance I wanted. Of course, I imagine that my
implementation could be improved, but I felt like it was certainly at least
reasonable. I tried a number of alterations, including preallocating more
memory than was needed in each slot with hopes that realloc would save me. No
dice.
My benchmark was focused on speed. Memory was less of a concern for my needs,
and so long as it stayed mostly unnoticeable (say, less than a couple hundred megabytes for a million keys), I was happy. I decided to give it a run against
std::map
(mostly to feel better about myself), and then tr1::unordered_map
(mostly out of hubris). Although my rough implementation doesn’t (yet) include
fancy-schmancy features like iterators, it barely edged out
tr1::unordered_map
for a small number of keys (less than 10k). When scaling
up, however, the story was less than impressive.
This benchmark was performed using the first k words from Askitis’
distinct_1
data set, and the strings were loaded into memory before running
the tests. These numbers are each the best of 10 consecutive runs (hoping to
warm the cache and cast each of these in the best possible light), with each
of these containers being a mapping of std::string
to size_t
. Each key was
associated with the value 1, and when querying, it was verified that the
resulting value was still 1. The query was performed on the same input set of
keys, and random query was run exactly the same, but after performing
std::random_shuffle
on the vector. It was compiled with g++-4.2
with flags
-O3 -Wall
(though other optimization levels had almost no impact). I also
tried with clang-2.1
and the results were very similar. I encourage you to
run the same bench on your own system and your own compiler version.
[caption id=”attachment_1073” align=”aligncenter” width=”300” caption=”Insertion Time Relative to std::tr1::unordered_map”][/caption]
[caption id=”attachment_1074” align=”aligncenter” width=”300” caption=”Query Time Relative to std::tr1::unordered_map”][/caption]
[caption id=”attachment_1075” align=”aligncenter” width=”300” caption=”Random Query Time Relative to std::tr1::unordered_map”][/caption]
While tr1::unordered_map
scaled better, at least for the purposes of
HAT-trie, the number of items in the hash is relatively limited (roughly in
the range of 10k). When testing the HAT-trie itself, I think the hash array
map has earned at least a chance for a trial. For those curious,
my source is available on github.