This post is about organizing data in differential dataflow. It is also about organizing organizing data, in that I got tired of writing so many different ways of organizing data and put together a modular framework for building different types of data layout. I think this is pretty cool, and you might think it is cool too. Or (for the more cynical) you could point out the obvious blunders I've made. Something good must come of it.
I'm optimistic!
Although you probably should read this in order for it to make sense, the structure is
-
The problem definition: In which we describe the interface that our datastructure needs to support.
-
A bespoke implementation: A description of differential dataflow current design, which has some issues and which we will want to improve.
-
Growing pains: A list of issues with the current implementation. There are some performance problems, some expressivity problems, and many maintainability problems.
-
A modular approach: In which we deconstruct the bespoke approach into generic components, which can be re-assembled as appropriate.
-
Evaluation: Testing our new implementations in various settings, trying to draw out their up and down sides.
Hey is differential dataflow fast or slow? Let's profile it.
Strongly connected components is one computation I run a lot. Because it is cool and complicated, but not because anyone ever asked for continually updated strongly connected components. Not once.
But you fire up a 10M node, 20M edge computation, you get a profile that looks a bit like this
What it is telling, because seriously what the heck, is that we are spending a lot of time (20%) futzing around with memory management (memmove
) and lots of time in methods called things like extend
and consolidate
and done
. These are all methods related to building up collections, which makes sense because the scc
computation produces a bunch of intermediate state (it is a really cool dataflow) but doesn't do much with it once it does (one iteration is often enough, for random graphs at least).
So, building up collections is part of what is holding scc back.
What about something else? Let's look at a continually updated breadth first search compuation. This is another reference computation; I'm going to do 1M nodes and 10M edges, and continually apply 1,000 edge updates to the computation.
This time around we see we spend lots of time in advance
, which we might not have met yet but will meet soon. There are also some seek_key
and tidy_vals
and tidy_keys
, all related to navigating around traces. This makes sense for bfs because there are relatively few changes to the computation as the edges change, but we do need to run around in our indexed data and check that things really have stayed the same.
So, seeking around in collections looking for keyed data is part of what is holding bfs back.
Bold Claim In order to improve differential dataflow, we need to improve how we manage data. Like, how we efficently bundle it up and stash it, and how we navigate around in it.
Not shown: a screenshot of Instruments.App
beachballing, because apparently it locked up OSX's screen shot taking mechanism too. It's good to know you aren't the only one who's software isn't fully sorted yet.
Recall that differential dataflow tracks lots of four-tuples,
(key, val, time, diff)
where key
and val
are probably ordered, time
is kind of a mess, and diff
is some sort of integer (I've put on my grown-up pants and switched it from an i32
to an isize
, because big data).
All of this stuff also applies more broadly if your scenario has more interesting relations with more dimensions (or fewer, as we will get to).
Without worrying the details, I'm going to claim that we want to navigate the data in this order too: we want to sniff around by key
, within each key we want to move around val
, and finally we want to crank through a bunch of (time, diff)
pairs for each value. This may change as the underlying differential dataflow algorithms change, but let's take it as a given for now.
Our tuples arrive in batches of varying sizes. At the beginning of the computation we might get a billion edges (hi twitter_rv
!), and as the computation proceeds these batches might wobble around and be as small as single records. In any case, we want to append these tuples to our collection, so that when we sniff around in the future, we see them.
There is some magic about compacting differences for equivalent times which you can read about, but for now let's ignore this.
So, we repeatedly accept batches of tuples, and need to support interactive exploration by keys and vals and .. not really by time but we need access to all the (time, diff)
pairs.
We'll start with the custom implementation, before we get fancy and generalize it. This should call out the intended structure of the implementation, and then we'll talk about swapping in clever bits more modularly. It is the same implementation as mentioned in the recent status report post, but let's review it anyhow.
Our first building block is going to be an immutable collection. If we are presented with a batch of data and are told "just this", how might we represent it?
There are a few choices, but I happened to choose the following:
struct Batch<K: Ord, V: Ord, T> {
keys: Vec<(K, usize)>,
vals: Vec<(V, usize)>,
times: Vec<(T, isize)>,
}
The idea here is that keys
is ordered by the K
field, and has as its second field an offset into vals
where the values associated with that key end. The values start, of course, at the offset where the previous key's values end (or zero, if this is the first key). Similarly, each interval of vals
is sorted by V
and has an offset into times
bounding the associated range range of (T, isize)
entries.
This works. Because things are sorted, we could use binary search or something to zip around. It has some performance issues that we will get to, but it is not so bad to navigate through.
You could walk around in these fields on your own, but I thought it would be good to put together a trait that allowed us to hide the details behind a simple Cursor
interface:
pub trait Cursor<K, V, T> {
// move the cursor around..
fn step_key(&mut self);
fn seek_key(&mut self, key: K);
fn step_val(&mut self);
fn seek_val(&mut self, val: V);
// see what the cursor points at..
fn valid_key(&self) -> bool;
fn valid_val(&self) -> bool;
fn key(&self) -> &K;
fn val(&self) -> &V;
// do stuff with the associated (time, diff) data.
fn map_times<F: FnMut(&T, isize)>(&self, func: F);
}
Actually the real trait is a bit more complicated (rewinding, peeking, etc).
Nonetheless, the trait above gives us some structure about what we'll need to be able to do with our data organization. Walk around, jump around, look at things.
Our ordered lists allow us to use binary search for keys and values, but there is something a bit more clever we can do too. Exponential or "galloping" search starts from an existing cursor and moves forward in exponentially increasing steps until it would pass the value, then takes exponentially decreasing steps until it gets there. This is the version I use:
/// Reports the number of elements to pass until `function` becomes false.
pub fn advance<T, F: Fn(&T)->bool>(slice: &[T], function: F) -> usize {
// start with no advance
let mut index = 0;
if index < slice.len() && function(&slice[index]) {
// advance in exponentially growing steps.
let mut step = 1;
while index + step < slice.len() && function(&slice[index + step]) {
index += step;
step = step << 1;
}
// advance in exponentially shrinking steps.
step = step >> 1;
while step > 0 {
if index + step < slice.len() && function(&slice[index + step]) {
index += step;
}
step = step >> 1;
}
index += 1;
}
index
}
The advantage here is that if you aren't going too far away, it can be pretty quick. The time taken is logarithmic in the distance to the target, and if that is small, zoom. If it isn't small, well shucks.
All this jumping around is fun, but it isn't cheap. Stomping around in large amounts of memory can be pretty painful. One thing that improves life is breaking the keys
vector into two vectors:
keys: Vec<K>,
offs: Vec<usize>,
so that we can jump around in keys
and then look things up in offs
. Saving the 8 bytes of the usize
entry helps a lot for small keys, as we jump around in a relatively smaller amount of memory while looking for the key.
This is most of what I wanted to say about the immutable collection. We are going to use it, but there isn't all that much to this implementation so far.
We have above an approach for immutable collections, but for some reason we don't get all of our immutable data at once. We keep getting more and more immutable collections handed to us, which probably doesn't seem like the immutability we were promised.
But, there are standard ways of dealing with this. The one I thought was most natural was just to keep the immutable collections in a list as they come in,
struct Spine<K: Ord, V: Ord, T> {
batches: Vec<Rc<Batch<K, V, T>>>,
}
Side notes: yes, we will merge the immutable collections and create new immutable collections. Also, the Rc
there is a reference-counting smart pointer, which we use so we can share the batches without taking ownership from others.
We also need a newer and smarter cursor implementation which would keep its fingers in each of immutable collections, perhaps by using their cursors. This new and smart cursor will need to present the appearance of one order on keys and values, probably by merging the information from each of its constituent cursors.
What is a good way of merging several cursors? I had a pretty horrible way before, but I came up with something that I find pretty tasteful. It has probably been invented before, so if you know the name, holler!
The rough idea is that we'll keep all of our cursors in a list, which is a smart place to keep things when you have many, but we keep them sorted first by each's current key, and then among cursors with the smallest key we keep them sorted by each's current value. Let's imagine we also write down how many cursors currently have the same key and within them the same value, so we don't have to constantly re-evaluate this.
struct CursorMerge<C: Cursor> {
cursors: Vec<C>,
same_keys: usize,
same_vals: usize,
}
The current key and value can be read out of the first element in the list. To navigate around, we need to manipulate some of the cursors, but we only need to work with the prefix of the cursors whose keys or values, appropriately, are smaller than the target. Often this is a relatively small prefix (Note: this only works as we seek forward, which is going to be how we look for keys).
I said I wasn't going to say much about these, and I'll try not to. When we need to, we merge immutable collections into new merged collections. The results are (i) one collection and one cursor, and (ii) less space used. As we do this, we can apply advance_by
logic to each of the time
fields, which gives us the potential to consolidate diff
fields and occasionally cancel tuples out of the collection.
It's still append-only, kinda. Immutable is arguable at this point.
The bespoke approach above is fine. It has decent worst-case performance, but isn't going to win any awards (unless there are prizes for being impressed with oneself). There are a variety of ways we might want to tweak this implementation to accommodate structural properties of the data, or to take aim at better performance. Let's go through a few examples:
What if your collection isn't (key, val)
pairs, but is just key
s?
This is not uncommon in, for example, Datalog computations where the collections are all set-valued. The pevasive distinct
operator doesn't know anything about keys and values, nor does half of the semijoin
operator (which passes those records whose keys exist in a second set).
In this setting, while we could just take our V
type to be ()
, the zero space unit type, we feel pretty silly keeping around
vals: Vec<((), usize)>,
when we could have just put all of its offsets into times
up in the keys
vector.
Instead, it makes sense to have a different datastructure:
struct Batch<K, T> {
keys: Vec<(K, usize)>,
times: Vec<(T, isize),
}
Of course, we'll need a different cursor now too. Copy/paste?
Some collections don't change. Or, maybe they just have few enough times at which they change that we could tolerate an immutable collection for each time.
If we want to do this, it neither makes sense to record the times in the collection, nor to allow vals
to describe ranges (as there will only be one diff
). Instead, we'd like a struct like
struct Batch<K, V, T> {
time: T,
keys: Vec<(K, usize)>,
vals: Vec<(V, isize)>,
}
This saves us a bunch on the memory footprint (those times aren't small, nor is the usize
). It also simplifies some of the cursor logic, which we'll have to rewrite of course. No copy/paste this time.
Ordered lists are pretty handy for a few reasons.
- We can search for elements in logarithmic time.
- We can merge lists quite efficiently.
- Flat data layout is appealing for copying, serialization.
One potential improvement is to use an order-preserving hash table. My personal favorite is Robin Hood Hashing, which keeps all elements in sorted order of their hash codes, but puts elements no earlier than the index their hash code would suggest. It leaves element in sorted order, but with some gaps that in essence absorb some of the pain of bucket collisions. It still meets most of the properties up above, at the cost of bloating intervals up to powers of two, but only applies when your type implements Hash
in addition to Ord
.
The changes are not wildly complicated:
struct Batch<K, V, T> {
keys: Vec<Option<(K, usize)>>,
vals: Vec<Option<(V, usize)>>,
times: Vec<(T, isize),
}
The only change here is that we've wrapped each of our entries in an Option
type, which allows for the absense of data with its None
variant. Of course, there is also a substantially different cursor to write now.
These two examples are a good start, but maybe what we actually want is hashing for the keys, and ordered lists for the values, because we jump around in keys mostly and then scan values. Or maybe we want hashing for that neat key-free version that distinct
uses. Awesome!
Do I really have to write out the full cross-product of these variants? What when we invent a new variant that doesn't use isize
but instead has a binary polarity for the whole batch and uses the number of occurrences to indicate the count?
You could write all these things out, and that would be one way to spend several months.
The bespoke approach quickly becomes painful to maintain. There are obviously many improvements, but the cost of implementing them (and maintaining them, if there are bugs, lol) makes it not that exciting for me at least.
Let's do something smarter, and make things faster in the process.
Our various representations above varied mostly in which fields (keys
, vals
, times
) we kept, and how we navigated through each of these layers. In fact, there is a pretty simple contract between the layers: each layer is segmented into distinct regions by usize
d boundaries known one layer up. We can think about writing generic "layers" that we then plug on top of other layers.
Let me give you a taste of what this looks like:
struct OrdLayer<K: Ord, L: Layer> {
keys: Vec<(K, usize)>,
vals: L,
}
Ok, this generic struct has keys
like before, but then it just has vals
which is some type L: Layer
. What does that mean?
The generic struct means that we can plug in our favorite type that implements some Layer
trait, and we get the ability to navigate around by key K
, and perhaps then dive down into vals
using the usize
offsets associated with the keys. It's not yet specified what we are diving down into, but that's cool. Something polymorphic, I bet.
Let's do another one, with less cleverness:
struct BottomLayer<T: Ord> {
vals: Vec<(T, isize)>,
}
I've suggestively named the layer and its generic parameter to connect this with the last level of our first Batch<K, V, T>
implementation. Speaking of which,
type Batch<K, V, T> = OrdLayer<K, OrdLayer<V, BottomLayer<T>>>;
Ooo. That was pretty easy, wasn't it? Let's do that value-free (ha) version that something like distinct
would use:
type Batch<K, V, T> = OrdLayer<K, BottomLayer<T>>;
Oh neat. What about that hashing stuff? We could certainly write out something like
struct HashLayer<K: Ord+Hash, L: Layer> {
keys: Vec<(K, usize)>,
vals: L,
}
which lets us choose to use our Robin Hood hashing on a layer-by-layer basis. We could do
type Batch<K, V, T> = HashLayer<K, OrdLayer<V, BottomLayer<T>>>;
or
type Batch<K, V, T> = OrdLayer<K, HashLayer<V, BottomLayer<T>>>;
or
type Batch<K, V, T> = HashLayer<K, BottomLayer<T>>;
or whatever we want, as long as what we want is to type in lots of different batch definitions.
But there's more isn't there? Just writing the structs is the easy part; there is code to write too.
Probably the scariest thing to worry about are the cursors. That is where all the logic exploiting our cleverness lives. But, cursors really aren't all that hard.
Here is the trait that I'm currently using:
pub trait Cursor {
type Key: Ord;
fn step(&mut self);
fn seek(&mut self, key: &Self::Key);
fn key(&self) -> &Self::Key;
fn valid(&self) -> bool;
fn reposition(&mut self, lower: usize, upper: usize);
}
The trait has an associated type Key
which is just the cursor's way of saying "you need to use this type to talk with me". It has step
and seek
, which do the things you would expect (jump around). It has key
and valid
methods to know what you are looking at. Finally, there is a reposition
method that, let's be honest, you aren't supposed to call. That is for parent cursors to call when they jump around. Hands off.
This moves us around one level, but there is nothing in the trait definition about diving down to lower values and such. How do we do that?
I've just kept it out of the trait definition, though perhaps it could live there too. It seemed easier just to do something like
pub struct OrdCursor<K, L: Cursor> {
layer: OrdLayer<K>,
pos: usize,
bounds: (usize, usize),
pub vals: L,
}
This cursor has a link to the layer, a current position, and bounds on the interval it is currently exploring (so you don't want into the K
for some other G
, or whatever lives above you).
In addition, the cursor has a public field, vals
, which is just the cursor for the next level down. You can use that as you see fit. Importantly, none of the vals
navigation informs the keys
cursor; once the vals
cursor runs out of values to look at (or whatever), it just sits in an invalid state. The only protocol between the two is that as we advance in the parent OrdCursor
, we need to reposition the child vals
to the corresponding range of values (hence reposition
. don't touch!).
The reality is that I'm going to have to implement the Cursor
trait from way back above (with step_key
and seek_val
) for each of these types. No magic there, yet. But that code should be relatively simple, and mostly it just makes associations between the trait methods and those of the weird type that I've built. It exports an understanding of what are keys and what are values.
As long as all of our batches are the same time, their cursors will be the same type and we can just throw them in that CursorMerger
struct, which was written to be generic over cursor implementation.
This part is pretty cool. I think.
We are going to have to merge some of our Batch
collections. No sweat, but let's at least try and do a good job here. Batches are sorted by something (either key or hash), and we can walk through them doing a merge.
We all know how to write merge sort, maybe. We start with two cursors and repeatedly look at the next element in each cursor, and take the smaller or, if there is a tie, take them both. Actually, it's a bit more complicated with our batches: if there are tied keys, we need to merge the associated values. If there are tied values we need to merge the associated times. Ok, it's like merge sort but a bit weirder.
Actually, it is like merge sort but awesomer, in some few important ways.
Let me describe a variant of merge sort that works on arrays, and uses our advance
method from up above. Remember that one? It skips forward in ordered sequences to find the first element greater than some target, using logarithmic steps.
Here is a merge sort fragment that compares two elements, and based on which one wins uses advance
to zip through the one with lesser values, looking for a number step
of elements to copy over. It then uses extend_from_slice
, which Rust folks will know is one of the few ways to trick LLVM into using memcpy
. Depending on the phase of the moon, of course.
// merge sort, kinda.
fn merge<T: Ord>(mut source1: &[T], mut source2: &[T], target: &mut Vec<T>) {
target.reserve(source1.len() + source2.len());
while source1.len() > 0 && source2.len() > 0 {
match source1[0].cmp(&source2[0]) {
Ordering::Less => {
let step = 1 + advance(&source1[1..], |k| k < &source2[0]);
target.extend_from_slice(&source1[..step]);
source1 = &source1[step..];
}
Ordering::Equal => {
target.push(source1[0]);
target.push(source2[0]);
source1 = &source1[1..];
source2 = &source2[1..];
}
Ordering::Greater => {
let step = 1 + advance(&source2[1..], |k| k < &source1[0]);
target.extend_from_slice(&source2[..step]);
source2 = &source2[step..];
}
}
}
target.extend_from_slice(&source1);
target.extend_from_slice(&source2);
}
This works rather well if there are runs of similarly sized values in either input. Maybe these runs aren't very common in sorting benchmarks, but they show up quite a lot in our batches. Wait, really?
In our batches, the analogue of extend_from_slice
doesn't just copy over some keys, it needs to move over all of the associated values, and for each of those values the associated times. Even if we only get a few keys in a run, we are likely to want to bulk copy a bunch of values and times.
Fortunately, we can make our Layer
implementations expose a copy_range
interface, which efficiently pulls over large blocks of data from existing layers. This makes merging a lot less painful, and we only have to write it once for each Layer
implementation, rather than for each of our weird cross-product implementations. This should make merging hurt a lot less. I hope.
Let's see how this goes. We are initially going to compare against differential dataflow's default implementation, which was the "not very impressive" implementation whose performance limitations we suggested at back at the beginning. We expect to see improvements here.
Specifically, we'd like to get speedy random access, high-throughput merging, and more efficient memory layout where appropriate. The things we will measure are
- Time to build collections from unsorted tuples.
- Time to merge collections for varying sizes.
- Time to look up query sets in collections, each of varying sizes.
We'll actually do a few flavors of queries, taking random query sets of varying sizes and sorting them to see how performance improves as we get larger batches of queries. All of our implementations are designed to go faster as we get larger batches of queries to look up.
As data, I'm not going to work too hard here and just introduce a bunch of (u32, (usize, isize))
tuples, where the key ranges from 0
up to some limit, and there is just one usize
value for each key. Our goal is to stress out the key layer, not the layers below it. I think we could probably do some more interesting measurements, but we would start to conflate a bunch of performance issues and likely not understand the behavior of these layers as easily.
Let's see!
To start, let's consider using Rust's HashMap
to drive a layer. I'm not really sure how this would work modularly, but we can use it for the top layer (keys), and get a sense for how "random access" could look, and what building and merging such a structure looks like (in fairness, it is optimized for neither).
HashMap<u32, (usize, isize)>
For increasing numbers of keys, we are going to stash (key, (0u64, 0i64))
in the hash map. This is less work than the other layers are going to do, as they have the potential of variable numbers of values to maintain, but since HashMap
doesn't do that we won't measure it. Instead, we will build up the hash map, merge the hash map with itself, and seek around in the hash map. In each case we report elapsed times divided by the number of records involved, to tell us about the rate.
keys | build | merge | seek |
---|---|---|---|
1,000 | 137ns | 82ns | 19ns |
10,000 | 102ns | 49ns | 24ns |
100,000 | 153ns | 108ns | 69ns |
1,000,000 | 256ns | 154ns | 77ns |
10,000,000 | 232ns | 177ns | 83ns |
100,000,000 | 283ns | 228ns | 177ns |
Hash maps are meant to have good seek
latencies at the expense of building and merging overhead; hash map construction stabs around in memory, and stutters a bit once we get more and more keys. The performance drop off a bit once we get to 100 million keys, a result of stabbing around randomly in quite a bit of memory.
Our first candidate, and roughly equivalent to what we are currently using in differential dataflow, is the ordered layer. It is meant to have better sequential access behavior, meaning good building and merging but maybe worse random lookups (yup). We are going to be using:
OrdLayer<u32, WeightedLayer<usize>>
Let's start by looking at the time it takes to build up an instance of the layer, starting from randomly ordered data (explicitly shuffled). We'll go through each key size, where we just have a single value associated with each (as we are trying to test this layer, not others). There are two costs to building the layer: sorting the input tuples, and then feeding them into the layer. We will measure them separately, as well as the time to then merge the collection with itself.
keys | sort | build | merge |
---|---|---|---|
1,000 | 82ns | 9ns | 10ns |
10,000 | 84ns | 11ns | 16ns |
100,000 | 96ns | 24ns | 17ns |
1,000,000 | 136ns | 31ns | 23ns |
10,000,000 | 166ns | 40ns | 20ns |
100,000,000 | 180ns | 43ns | 24ns |
The measurements indicate that sorting dominates, and if we could improve this we would see some gains. Or perhaps viewed differently, the building process is relatively efficient, which is good news because merging looks more like building than it looks like sorting. This merge also happens to be a pessimal merge, in that we merge identical traces with each other, meaning we never get a chance to use our fancy memcpy
. The performance is still pretty good, and is notably much better than the HashMap
implementation. Especially if we sort out sorting.
Up next, let's evaluate the time it takes to issue a bunch of queries. We'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation can (and will) sort the queries in each batch to use one forward scan through the layer, which is important because our advance
method only goes forward. Let's plot the average number of nanoseconds for a seek_key()
call across all the keys, batched at different sizes:
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 61ns | 51ns | 56ns | 44ns | 44ns |
10,000 | 57ns | 51ns | 52ns | 50ns | 49ns |
100,000 | 207ns | 174ns | 112ns | 86ns | 67ns |
1,000,000 | 513ns | 484ns | 384ns | 261ns | 78ns |
10,000,000 | 1,168ns | 1,290ns | 1,004ns | 707ns | 86ns |
100,000,000 | 1,812ns | 2,014ns | 1,646ns | 1,185ns | 96ns |
We see a general trend toward lower latencies with higher batch sizes, though this only really shows up at larger batch sizes. As we get to processing ALL
keys, we are essenitally performing a sequential scan through the stored keys, which is pretty efficient. The latencies at large batch sizes are quite high, and notably much worse than the HashMap
implementation (by about 10x).
For informational purposes, let's re-do this measurement where we don't charge the cost of sorting to the query. This makes some sense in cases where the keys are pre-sorted, as when they arrive in a pre-formed layer (which they do, in differential dataflow).
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 56ns | 40ns | 22ns | 9ns | 4ns |
10,000 | 56ns | 42ns | 28ns | 18ns | 4ns |
100,000 | 209ns | 119ns | 87ns | 34ns | 5ns |
1,000,000 | 494ns | 501ns | 364ns | 215ns | 5ns |
10,000,000 | 1,135ns | 1,257ns | 975ns | 651ns | 4ns |
100,000,000 | 1,661ns | 1,931ns | 1,538ns | 1,183ns | 4ns |
Here the improvement as batch sizes increase becomes much clearer, even for small numbers of keys. On the other hand, the performance for large numbers of keys stays pretty unbearable.
Our next candidate is the Robin Hood hashing layer,
HashLayer<u32, WeightedLayer<usize>>
This evaluation taught me something about how the FNV hash function works, or rather doesn't work on consecutive integers. It's fast on small inputs, though, so let's use it for now. I'm using a trait Hashed
with two methods that reveal a hash value, and how many of the bits are legit. The latter is most commonly 64
, but can be less for 32
bit hash values (bear with me on this). Here is the default implementation for things that can be hashed.
impl<T: ::std::hash::Hash> Hashed for T {
fn hashed(&self) -> u64 {
let mut h: fnv::FnvHasher = Default::default();
self.hash(&mut h);
h.finish()
}
fn bits(&self) -> usize { 64 }
}
If we spin up the computation as with OrdLayer
, assembling different numbers of distinct keys, we get numbers like:
keys | sort | build | merge |
---|---|---|---|
1,000 | 421ns | 165ns | 28ns |
10,000 | 129ns | 69ns | 27ns |
100,000 | 122ns | 104ns | 43ns |
1,000,000 | 95ns | 68ns | 45ns |
10,000,000 | 91ns | 64ns | 32ns |
100,000,000 | 98ns | 74ns | 37ns |
The sorting has improved by a factor of 2x over OrdLayer
, for non-small key counts, but the build times have increased (which makes sense, as we are trying to be more clever). There are several potential improvements to the hash builder, mainly that it currently first stages keys in a second array to figure out where to place them. These numbers also call out the overhead my radix sorter has for small key sizes (there is a fixed cost proportional to 256
). Merging doesn't seem too bad!
Up next, let's evaluate the time it takes to issue a bunch of queries. Again, we'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation sorts the queries to use one forward scan through the layer. Let's plot the average number of nanoseconds for a seek_key()
call across all the keys:
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 63ns | 69ns | 115ns | 118ns | 143ns |
10,000 | 43ns | 46ns | 84ns | 120ns | 151ns |
100,000 | 112ns | 89ns | 113ns | 145ns | 183ns |
1,000,000 | 145ns | 90ns | 120ns | 147ns | 218ns |
10,000,000 | 174ns | 158ns | 194ns | 216ns | 248ns |
100,000,000 | 279ns | 279ns | 309ns | 346ns | 324ns |
The enormous latencies for large key counts are much reduced. But, we see a different trend here than in OrdLayer
. As the batch size gets bigger things get slower. This seems mostly due to the fact that sorting has a cost, and the random look-ups we are doing are already pretty fast (for large key sizes and small batches, almost 10x faster than OrdLayer
). The cost of sorting can be decreased with radix sorting, but my implementation isn't especially brilliant on small batches (i.e. those other than ALL
). Rust's comparison-based sorting is also probably annoyed because there is are two fnv()
evaluations in the comparison function. I'm ashamed.
Let's check again on the case where the cost of sorting is not counted against the cost of doing a query, simulating performance when the data arrive pre-arranged.
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 51ns | 32ns | 35ns | 17ns | 19ns |
10,000 | 40ns | 21ns | 15ns | 16ns | 12ns |
100,000 | 82ns | 88ns | 91ns | 40ns | 17ns |
1,000,000 | 140ns | 65ns | 54ns | 55ns | 17ns |
10,000,000 | 163ns | 145ns | 133ns | 133ns | 28ns |
100,000,000 | 268ns | 240ns | 235ns | 227ns | 68ns |
These numbers look a lot better. If the queries come in sorted, we will be almost as happy as with OrdLayer
, and much happier for the small batch sizes (by almost a factor of 10x for large key spaces).
But, these numbers still aren't as good at we would like. The ALL
measurement is noticeably larger than the single-digit nanoseconds of the OrdLayer
measurements, and the large key counts take more than a hundred nanoseconds on average; more than we'd expect.
It turns out that the janky performance can be attributed (I think) to poor distribution of hash values. If we instrument the implementation, some keys are displaced 40 positions from their intended location, and the variance of displacement is something like 24. These are much larger than they are supposed to be for Robin Hood Hashing (the variance is meant to be 1-ish).
If you dig into how the FNV hash function works, it treats its arguments as a sequence of bytes, and repeatedly x-ors in a byte and then multiplies by some large prime. Unfortunately, that prime is basically a large power of two plus some small junk; specifically, for 64 bit hashes it is
2^40 + 2^8 + 0xb3
The problem (for us) is that something as simple as consecutive integers differ only in their final byte, and by not very much. The result is that the high-order bits of the hash will differ only by some small multiple of 2^40
, which doesn't quite reach the highest-order bits of the hash. If we look primarily at the high-order bits to determine intended locations, we'll have a bunch of collisions when our keys are consecutive integers.
For example, let's take the numbers 0 .. 128
, compute the FNV hash, and look at the top byte. There are 256 possible values, think hash buckets, so there should be plenty of space for 128 values. Unfortunately, they collide a lot. There are only 24 distinct top bytes that FNV produces on these numbers, so buckets will get five-ish entries on average. And worse, the occupied buckets form runs; here are the first six occupied buckets, and their counts:
bucket[0x0B]: 4
bucket[0x0C]: 6
bucket[0x0D]: 6
bucket[0x2B]: 4
bucket[0x2C]: 6
bucket[0x2D]: 6
...
So, all of those entries are just going to pile up, aren't they. At least, at this small scale. Apparently similar things happen at the larger scales too.
One fix that seems crazy, but works pretty great, is to ask the nice graph processing people, who thoughtfully mapped all their graph identifiers down to small contiguous integers, to turn the identifiers into distinct random numbers. We can then just use this number as the hash, which is pretty handy (we skip the FNV evaluation, and radix sorting is cheaper because we have half the bytes in the key).
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Default)]
pub struct Uniform {
val: u32,
}
impl Hashed for Uniform {
fn hashed(&self) -> u64 { self.val as u64 }
fn bits(&self) -> usize { 32 }
}
Our layer is now going to be
HashLayer<Uniform, WeightedLayer<usize>>
which is roughly the same as our previous HashLayer
, just with a different hashed()
implementation. If we re-run these measurements with random u32
identifiers rather than 0 .. nodes
, we see a maximum displacement of 10 (rather than 40), and a variance of 1, (rather than 24). That feels better.
Aside If you have dense identifiers, you would usually try something like an array rather than a hash map. That makes sense if you have lots of the identifiers, but we need something that works even for smaller sets of keys. If each of our immutable collections carries around an index sized to the number of nodes, we will be blowing a lot of memory.
Let's look at how this version fares. Building and merging shouldn't be much different, as these are largely agnostic to collisions. However, we save on the sorting (because radix sort is smart enough to notice that u32
has fewer bytes than u64
), and we might save a bit on the FNV evaluation, which we aren't doing now.
keys | sort | build | merge |
---|---|---|---|
1,000 | 310ns | 213ns | 30ns |
10,000 | 68ns | 57ns | 26ns |
100,000 | 53ns | 92ns | 37ns |
1,000,000 | 44ns | 63ns | 36ns |
10,000,000 | 49ns | 65ns | 35ns |
100,000,000 | 52ns | 84ns | 31ns |
These sort times are about half what they are for the data when we use u64
hash values to sort. The radix sort is smart and understands how many bytes are in the key, based on its type, and only does that many passes. Building and merging both seem to be about the same amount of time as before, which makes sense (perhaps fewer FNV invocations, but not enough to matter).
When we start to do look-ups, we notice generally improved times.
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 41ns | 43ns | 65ns | 69ns | 76ns |
10,000 | 39ns | 24ns | 41ns | 67ns | 87ns |
100,000 | 68ns | 97ns | 60ns | 69ns | 105ns |
1,000,000 | 125ns | 61ns | 70ns | 83ns | 122ns |
10,000,000 | 138ns | 54ns | 61ns | 78ns | 132ns |
100,000,000 | 191ns | 89ns | 96ns | 127ns | 158ns |
As before, a lot of overhead here is due to sorting, but we've also cut into the actual query times. If we check out the times without counting sorting against the queries, we see just how much is sorting:
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 26ns | 19ns | 10ns | 11ns | 11ns |
10,000 | 34ns | 17ns | 14ns | 12ns | 8ns |
100,000 | 82ns | 23ns | 22ns | 37ns | 10ns |
1,000,000 | 65ns | 42ns | 27ns | 25ns | 6ns |
10,000,000 | 134ns | 42ns | 34ns | 33ns | 8ns |
100,000,000 | 189ns | 76ns | 65ns | 64ns | 9ns |
These times are now much closer to OrdLayer
for large query batches, especially ALL
, and are up to an order of magnitude faster than OrdLayer
for queries against large key sizes. In fact, the apparently rough single-query batches seem to be an artifact of re-initializing the cursor more than anything; if we re-use the same cursor (the HashCursor
supports forward and backward navigation) the times become roughly the same as the batch size 10
numbers; this makes some sense because 10 keys out of 100 million don't really have any more locality than one key out of 100 million. What doesn't make sense is why construcing a cursor breaks performance (it does very little).
Our times are also similar to, and occasionally better than the HashMap
times. My understanding of HashMap
is that it cynically optimizes for "misses", by storing values separately from keys. This means that when we do successful look-ups and want the value, there is another memory dereference to do. We stash the values with the key, so once we've found the key we have the data at hand. That's my best interpretation of the difference between the two. We also don't use siphash, which could make a difference. This isn't meant to be a critique of HashMap
, so don't get too excited about the difference.
Let's try out what happens when we put a spurious bogus layer in place. We are going to use a collection of type
HashLayer<Uniform, OrdLayer<(), WeightedLayer<usize>>>
which just adds a bogus ()
value. This is meant to represent the world in which we just have keys, no values, but have to suffer due to an implementation that assumes we must have values.
keys | sort | build | merge |
---|---|---|---|
1,000 | 320ns | 204ns | 28ns |
10,000 | 68ns | 58ns | 27ns |
100,000 | 53ns | 82ns | 55ns |
1,000,000 | 44ns | 73ns | 43ns |
10,000,000 | 49ns | 75ns | 41ns |
100,000,000 | 52ns | 60ns | 67ns |
These numbers aren't actually that bad. I was hoping they would be worse, actually. Oh well. The build and merge seem to creep upwards slighly, except for some heroics building the 100 million element collection. But it isn't as bad as we might have worried.
Looking things up will not behave any differently until we actually try to track down some values, which we haven't done just yet (we've just been calling seek
and confirming we found the key). But, it's a pretty easy experiment to do, so lets compare the two. I've done the version where we don't charge sorting to the query, to try and make the distinction the clearest.
First, the previous implementation without the bogus OrdLayer<(),_>
gumming up the works:
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 41ns | 22ns | 16ns | 12ns | 13ns |
10,000 | 42ns | 19ns | 16ns | 14ns | 9ns |
100,000 | 142ns | 31ns | 35ns | 53ns | 10ns |
1,000,000 | 205ns | 63ns | 46ns | 43ns | 9ns |
10,000,000 | 233ns | 68ns | 74ns | 54ns | 9ns |
100,000,000 | 351ns | 146ns | 126ns | 113ns | 11ns |
Notice that these numbers are for sure worse than when we didn't look at any of the associated data. This is due to the additional dereference we need to do to find the data. It becomes less of a problem with ALL
keys, because our sequential scan of the keys also turns in to a sequential scan of the values.
Now with the implementation with the bogus layer containing ()
values:
keys \ batch | 1 | 10 | 100 | 1,000 | ALL |
---|---|---|---|---|---|
1,000 | 50ns | 27ns | 16ns | 14ns | 24ns |
10,000 | 54ns | 25ns | 23ns | 18ns | 13ns |
100,000 | 206ns | 129ns | 93ns | 62ns | 14ns |
1,000,000 | 287ns | 103ns | 87ns | 78ns | 11ns |
10,000,000 | 440ns | 128ns | 103ns | 119ns | 11ns |
100,000,000 | 515ns | 232ns | 191ns | 197ns | 13ns |
These numbers are definitely worse. Not catastrophic, but definitely worse. Add to this the fact that we hold on to an extra usize
for each key, 800 MB for the 100 million key case, and I'm comfortable with the effort spent making it possible to use the leaner representation.
What about our fancy memcpy
merging? Let's evaluate that by doing some merging on two different pairs of collections. The first pair of collections will have all the even keys in one collection, and all the odd keys in the other collection. The second pair of collections will have the first nodes
keys in one collection, and the second nodes
keys in the other collection. The former collections alternate keys and will be annoying to merge, whereas the second collections can largely just copy their ranges over.
keys | alternating | contiguous |
---|---|---|
1,000 | 73ns | 36ns |
10,000 | 57ns | 22ns |
100,000 | 59ns | 23ns |
1,000,000 | 53ns | 26ns |
10,000,000 | 55ns | 25ns |
100,000,000 | 67ns | 31ns |
This shows off what sorts of gains we might pick up as we copy over larger regions. We have picked the very best possibly case to show this off, so it remains to be seen how cool it is in practice.
It seems like we have a decent toolkit to start building up interesting index structures. My first plan is to try and slot in a
HashLayer<Key, OrdLayer<Val, WeightedLayer<Time>>>
into differential dataflow and see how that looks. Of course, there is a bunch of glue to write, and perhaps some re-factoring of interfaces if pain points get discovered.
It might also make sense to push this code as a separate crate, then everyone could .. I'm not sure what you actually do with this, really.