Test driving cista's serializable hash map

I have been frequently running into the same problem where I need to produce a large, static mapping once, and then read it many times. Take my Valhalla live traffic implementation for example: TomTom started offering a new location referencing method with their Orbis product last year, where instead of sending OpenLR messages, they provide GERS or OSM identifiers along with start and end offsets in meters and a direction, for which you can then lookup affected edges in your road network (read more about it in TomTom’s docs). This is great, because OpenLR is just unnecessary overhead if the producer and consumer of a traffic API are using the same road network under the hood. However, this mapping of GERS ID -> [...valhalla_edges] is not trivial to construct because it involves sorting the valhalla edges topologically to correctly identify affected edges given the start and end offsets, and because we’re talking about tens of millions of unique GERS identifiers for a whole continent.

Rolling my own disk backed lookup

My first approach was to ship my own file format. I wanted something small and self-contained, so my goal became coming up with a simple file format that I could use to build the mapping once, and I could use for these GERS ID lookups, with little start up time required for the reader, and without needing to hold the entire mapping in memory. Cross-platform support was off the table, since I knew I would build the file on the same platform as the reader using it. The format I designed was dead simple:

  • a header containing information about
    • the map version
    • the ID type (since Orbis allows both GERS and OSM IDs)
    • the number of keys
    • the file format version
    • the production date
    • the start offset of the index
  • a data section containing the values (variable length lists of edge IDs)
  • an index consisting of [key, data_offset, length]

Since a single GERS entity often maps to tens, sometimes hundreds of edge IDs in Valhalla, the bulk of the data lives in the data section, while the index stays quite slim, even on 128-bit keys. That way, a good trade-off between memory footprint and start-up time would be to read the index into an in-memory hash map, and then memory map the data section.

Discovering cista

I ran a live traffic writer for Valhalla with this custom format in production for the better part of a year, and was quite satisfied with it (and happy about my learnings of course). However, over time I kept running into the same kind of problem when working on other applications, where I needed a write-once-read-many mapping from a single key to a sequence of values. I looked around, and quickly found out about cista: a general purpose header-only serialization library for C++ that supports serializing (almost) arbitrary data structures, with its own serializable hash map implementation. It sounded too good to pass up, so I gave it a try.

The API is very ergonomic, the interface of the hash map implementation (cista::offset::hash_map<KeyT, ValueT>) is similar to that of other implementations in C++ (STL, ankerl, abseil, etc.), and the serialization and deserialization steps are trivially done in fewer than 5 lines of code:

 1    // Serialize 
 2    {
 3        cista::buf mmap{cista::mmap{fname.c_str()}};
 4
 5        // WITH_INTEGRITY writes a hash of the key/value types into the file header 
 6        constexpr auto MODE = cista::mode::WITH_VERSION | cista::mode::WITH_INTEGRITY;
 7
 8        // write to file
 9        cista::serialize<MODE>(mmap, my_map);
10    }
11
12    // Deserialize 
13    {
14        auto memory = cista::mmap{fname.c_str(), cista::mmap::protection::READ};
15        constexpr auto MODE = cista::mode::WITH_VERSION | cista::mode::WITH_INTEGRITY;
16        auto my_map =
17            cista::deserialize<cista::offset::hash_map<KeyT, ValueT>,
18                               MODE>(memory);
19
20        // use it like any other hash map
21        // auto needle = my_map->find(some_key);
22    }

Thanks to the simple API, I was able to quickly create a small wrapper for it that I could use as a drop-in replacement for my own disk-backed mapping, and compared the two across construction time, file size, and query time. For my test, I used a OSM file for the Czech Republic (~3.2GB) to construct the mapping, and benchmarked it by querying the edge IDs for the GERS ID of each way in the file (~4.4 million successful lookups, ~6.6 million unsuccessful ones; note the majority of ways in OSM are not routable ways).

Construction timeFile sizeQuery time
custom implementation22s112MB11s
cista25s199MB11s

The advantages in terms of hard numbers aren’t that clear: query time is unchanged, construction takes a little longer, and the file size grew by a whopping 78%. But the ease of use is hard to ignore – construct your lookup like any other hash map, then save it to and read it from disk and query it like any other map. That’s pretty neat, and probably enough reason to retire my own format. Next time I run into this kind of problem, I’ll reach for cista first.