-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump downwards #37
Bump downwards #37
Conversation
This changes `bumpalo`'s implementation from * initializing the bump pointer at the start of the chunk, and * incrementing the bump pointer to allocate an object to * initializing the bump pointer at the end of the chunk, and * decrementing the bump pointer to allocate an object This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code. Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code. The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well. Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below. Note that there is a ~4% regression in `realloc` performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however. -------------------------------------------------------------------------------- alloc/small time: [26.129 us 26.168 us 26.208 us] thrpt: [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s] change: time: [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05) thrpt: [+9.1627% +9.6372% +10.141%] Performance has improved. Found 123 outliers among 1000 measurements (12.30%) 51 (5.10%) high mild 72 (7.20%) high severe alloc/big time: [348.03 us 348.21 us 348.41 us] thrpt: [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s] change: time: [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05) thrpt: [+2.9776% +3.0989% +3.2145%] Performance has improved. Found 150 outliers among 1000 measurements (15.00%) 58 (5.80%) low mild 46 (4.60%) high mild 46 (4.60%) high severe alloc-with/small time: [26.446 us 26.477 us 26.508 us] thrpt: [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s] change: time: [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05) thrpt: [+18.904% +19.318% +19.759%] Performance has improved. Found 57 outliers among 1000 measurements (5.70%) 43 (4.30%) high mild 14 (1.40%) high severe alloc-with/big time: [313.26 us 313.75 us 314.35 us] thrpt: [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s] change: time: [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05) thrpt: [+6.4014% +6.7187% +7.0495%] Performance has improved. Found 166 outliers among 1000 measurements (16.60%) 70 (7.00%) low mild 44 (4.40%) high mild 52 (5.20%) high severe format-realloc/format-realloc/10 time: [84.850 ns 85.002 ns 85.162 ns] thrpt: [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s] change: time: [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05) thrpt: [-5.8870% -5.1707% -4.6552%] Performance has regressed. Found 299 outliers among 1000 measurements (29.90%) 1 (0.10%) low severe 78 (7.80%) low mild 22 (2.20%) high mild 198 (19.80%) high severe format-realloc/format-realloc/80 time: [85.144 ns 85.353 ns 85.571 ns] thrpt: [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s] change: time: [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05) thrpt: [-5.8039% -5.2209% -4.4014%] Performance has regressed. Found 168 outliers among 1000 measurements (16.80%) 40 (4.00%) high mild 128 (12.80%) high severe format-realloc/format-realloc/270 time: [84.940 ns 85.080 ns 85.225 ns] thrpt: [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s] change: time: [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05) thrpt: [-4.4390% -4.0554% -3.6579%] Performance has regressed. Found 229 outliers among 1000 measurements (22.90%) 8 (0.80%) low severe 2 (0.20%) low mild 11 (1.10%) high mild 208 (20.80%) high severe format-realloc/format-realloc/640 time: [85.917 ns 86.199 ns 86.497 ns] thrpt: [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s] change: time: [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05) thrpt: [-3.7190% -3.0801% -2.2173%] Performance has regressed. Found 169 outliers among 1000 measurements (16.90%) 62 (6.20%) high mild 107 (10.70%) high severe
@fitzgen Wow, those are some nice numbers! It might also mean that I can use this for flatbuffers! |
Great! :) |
let new_ptr = footer.ptr.get(); | ||
// NB: we know it is non-overlapping because of the size check | ||
// in the `if` condition. | ||
ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_ptr(), new_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested how much we lose by using ptr::copy
instead and now having the new_size <= old_size / 2
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. If we are shrinking but not by a lot, we can just use the same pointer. We could reclaim it, but it is a lot of trouble for very few bytes saved. I agree with this implementation! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, and since we already are doing this calculus, we might was well choose the threshold where we get to do a faster copy as well. That said, if you want to experiment with other implementations and benchmark them, I'm happy to accept results-driven PRs! :)
So I have tried bumping downwards in a linear allocator and even tough it saves a couple of instructions it was always slower on Jaguar CPUs. |
This changes
bumpalo
's implementation fromto
This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code.
Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code.
The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well.
Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below.
Note that there is a ~4% regression in
realloc
performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however.Benchmark Results