-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change SipHash implementation to an optimized assembly version #35735
Comments
I've updated /~https://github.com/alexcrichton/siphash-bench to include the highwayhash repo, and the results look like (on nightly):
That is, SipHash13 in our standard library still beats out everything. There's a lot of hashes in that repo, though, and I could be building it wrong. @brson could you clarify which one you were referring to? |
Huh, siphash in the highway hash repo is not AVX2 optimized. It's just compatible C. It's only SipTreeHash and HighwayTreeHash that are AVX2 optimized. Neither are the same algorithm (though I think any crypto-hash is fair-game for the default). But furthermore, the AVX2 optimized hashes are even slower for small inputs than siphash, so we would really need a way to determine the two cases. Frustrating. |
Here's what I think I want in an ideal world:
|
@alexcrichton In your benches c2_siphash24 is faster than rust_siphash24, or am I misreading? |
@brson yes I believe that c2 implementation is faster than the Rust implementation, no idea why though (just threw up some numbers) |
SipHash was specifically designed to be fast for small inputs and I think most of Google's effort for things like CityHash have been targeted around high throughput for long strings. |
Here's another person who implemented sse2/ssse3 siphash and if I understand what they mean correctly, on x86-64, their basic (no explict simd) version was the best. /~https://github.com/floodyberry/siphash It seems there's a few things pointing towards siphash not being that amenable for optimization using sse/avx? |
This is why Google made siptreehash: /~https://github.com/google/highwayhash#siptreehash. Note that also Java abandoned the whole "strong hash function" approach due to performance concerns, in favor of a totally different way of structuring the hash table buckets in the face of many collisions: https://bugs.openjdk.java.net/browse/JDK-8046170 |
Thanks for that link @briansmith. In the spirit of the original OP i guess the only thing to do here is to replace the Rust siphash with the C siphash if it is indeed faster, which @alexcrichton's benchmarks indicate may be the case, by a decent percent, assuming the C siphash13 performance compares to the Rust performance similarly to 2-4. |
Although if we were to add new C code to std that invalidates some of the logic in the |
One way to make rust's siphasher faster is to enable one-shot hashing, there's an ongoing rfc for one approach at that rust-lang/rfcs#1666 one-shot hashing would be that the SipHasher type receives a whole byte slice that it can hash into a final hash value in one go. Currently SipHasher is implemented to take chunks at a time, and must have logic that keeps track of non-multiple-of-8 remainders between each call to Hasher::write. The C implementation of siphash that was compared with does it all in one shot, so that's certainly a factor in its favour. I'm sure Rust would be equivalent if it did too. |
Small improvement to SipHasher Very small but constant improvement, the objective is to lower latency for u16, u32 and small strings. CC #35735 ``` ➜ siphash-bench git:(master) ✗ sudo nice -n -20 target/release/foo-648738a54f390643 --bench | tee benches.txt [sudo] password for arthurprs: running 62 tests test _same ... bench: 0 ns/iter (+/- 0) test _warmup ... bench: 0 ns/iter (+/- 0) test rust_siphash13::int_u16 ... bench: 12 ns/iter (+/- 1) test rust_siphash13::int_u32 ... bench: 14 ns/iter (+/- 0) test rust_siphash13::int_u64 ... bench: 11 ns/iter (+/- 1) test rust_siphash13::int_u8 ... bench: 11 ns/iter (+/- 1) test rust_siphash13::slice::_10 ... bench: 18 ns/iter (+/- 1) test rust_siphash13::slice::_100 ... bench: 42 ns/iter (+/- 2) test rust_siphash13::slice::_11 ... bench: 19 ns/iter (+/- 1) test rust_siphash13::slice::_12 ... bench: 21 ns/iter (+/- 3) test rust_siphash13::slice::_2 ... bench: 16 ns/iter (+/- 2) test rust_siphash13::slice::_200 ... bench: 68 ns/iter (+/- 3) test rust_siphash13::slice::_3 ... bench: 17 ns/iter (+/- 3) test rust_siphash13::slice::_4 ... bench: 18 ns/iter (+/- 1) test rust_siphash13::slice::_5 ... bench: 19 ns/iter (+/- 4) test rust_siphash13::slice::_6 ... bench: 19 ns/iter (+/- 1) test rust_siphash13::slice::_7 ... bench: 20 ns/iter (+/- 1) test rust_siphash13::slice::_8 ... bench: 16 ns/iter (+/- 1) test rust_siphash13::slice::_9 ... bench: 18 ns/iter (+/- 2) test rust_siphash13::str_::_10 ... bench: 18 ns/iter (+/- 1) test rust_siphash13::str_::_100 ... bench: 41 ns/iter (+/- 2) test rust_siphash13::str_::_11 ... bench: 19 ns/iter (+/- 1) test rust_siphash13::str_::_12 ... bench: 20 ns/iter (+/- 2) test rust_siphash13::str_::_2 ... bench: 16 ns/iter (+/- 1) test rust_siphash13::str_::_200 ... bench: 68 ns/iter (+/- 3) test rust_siphash13::str_::_3 ... bench: 17 ns/iter (+/- 1) test rust_siphash13::str_::_4 ... bench: 18 ns/iter (+/- 2) test rust_siphash13::str_::_5 ... bench: 19 ns/iter (+/- 6) test rust_siphash13::str_::_6 ... bench: 20 ns/iter (+/- 5) test rust_siphash13::str_::_7 ... bench: 23 ns/iter (+/- 1) test rust_siphash13::str_::_8 ... bench: 15 ns/iter (+/- 1) test rust_siphash13::str_::_9 ... bench: 17 ns/iter (+/- 1) test sip1b::int_u16 ... bench: 10 ns/iter (+/- 1) test sip1b::int_u32 ... bench: 9 ns/iter (+/- 1) test sip1b::int_u64 ... bench: 12 ns/iter (+/- 1) test sip1b::int_u8 ... bench: 7 ns/iter (+/- 0) test sip1b::slice::_10 ... bench: 12 ns/iter (+/- 1) test sip1b::slice::_100 ... bench: 33 ns/iter (+/- 2) test sip1b::slice::_11 ... bench: 13 ns/iter (+/- 0) test sip1b::slice::_12 ... bench: 12 ns/iter (+/- 1) test sip1b::slice::_2 ... bench: 10 ns/iter (+/- 0) test sip1b::slice::_200 ... bench: 62 ns/iter (+/- 2) test sip1b::slice::_3 ... bench: 10 ns/iter (+/- 1) test sip1b::slice::_4 ... bench: 9 ns/iter (+/- 0) test sip1b::slice::_5 ... bench: 10 ns/iter (+/- 1) test sip1b::slice::_6 ... bench: 10 ns/iter (+/- 0) test sip1b::slice::_7 ... bench: 11 ns/iter (+/- 0) test sip1b::slice::_8 ... bench: 11 ns/iter (+/- 1) test sip1b::slice::_9 ... bench: 12 ns/iter (+/- 1) test sip1b::str_::_10 ... bench: 15 ns/iter (+/- 1) test sip1b::str_::_100 ... bench: 37 ns/iter (+/- 3) test sip1b::str_::_11 ... bench: 16 ns/iter (+/- 1) test sip1b::str_::_12 ... bench: 14 ns/iter (+/- 1) test sip1b::str_::_2 ... bench: 13 ns/iter (+/- 1) test sip1b::str_::_200 ... bench: 67 ns/iter (+/- 5) test sip1b::str_::_3 ... bench: 14 ns/iter (+/- 2) test sip1b::str_::_4 ... bench: 12 ns/iter (+/- 1) test sip1b::str_::_5 ... bench: 13 ns/iter (+/- 1) test sip1b::str_::_6 ... bench: 13 ns/iter (+/- 0) test sip1b::str_::_7 ... bench: 16 ns/iter (+/- 1) test sip1b::str_::_8 ... bench: 14 ns/iter (+/- 1) test sip1b::str_::_9 ... bench: 15 ns/iter (+/- 1) test result: ok. 0 passed; 0 failed; 0 ignored; 62 measured ➜ siphash-bench git:(master) ✗ cargo benchcmp rust_siphash13:: sip1b:: benches.txt name rust_siphash13:: ns/iter sip1b:: ns/iter diff ns/iter diff % int_u16 12 10 -2 -16.67% int_u32 14 9 -5 -35.71% int_u64 11 12 1 9.09% int_u8 11 7 -4 -36.36% slice::_10 18 12 -6 -33.33% slice::_100 42 33 -9 -21.43% slice::_11 19 13 -6 -31.58% slice::_12 21 12 -9 -42.86% slice::_2 16 10 -6 -37.50% slice::_200 68 62 -6 -8.82% slice::_3 17 10 -7 -41.18% slice::_4 18 9 -9 -50.00% slice::_5 19 10 -9 -47.37% slice::_6 19 10 -9 -47.37% slice::_7 20 11 -9 -45.00% slice::_8 16 11 -5 -31.25% slice::_9 18 12 -6 -33.33% str_::_10 18 15 -3 -16.67% str_::_100 41 37 -4 -9.76% str_::_11 19 16 -3 -15.79% str_::_12 20 14 -6 -30.00% str_::_2 16 13 -3 -18.75% str_::_200 68 67 -1 -1.47% str_::_3 17 14 -3 -17.65% str_::_4 18 12 -6 -33.33% str_::_5 19 13 -6 -31.58% str_::_6 20 13 -7 -35.00% str_::_7 23 16 -7 -30.43% str_::_8 15 14 -1 -6.67% str_::_9 17 15 -2 -11.76% ``` from a modified hash-rs suite (preallocating maps and adding having slice/str variants) graph version: http://imgur.com/a/DuoI4 ``` ➜ hash-rs git:(rfc-extend-hasher) ✗ cargo benchcmp sip13:: sip13opt:: benches.txt name sip13:: ns/iter sip13opt:: ns/iter diff ns/iter diff % slice::mapcountdense_000000001 27,343 (36 MB/s) 26,401 (37 MB/s) -942 -3.45% slice::mapcountdense_000000002 28,982 (69 MB/s) 26,807 (74 MB/s) -2,175 -7.50% slice::mapcountdense_000000003 29,304 (102 MB/s) 27,360 (109 MB/s) -1,944 -6.63% slice::mapcountdense_000000004 30,411 (131 MB/s) 25,888 (154 MB/s) -4,523 -14.87% slice::mapcountdense_000000005 32,625 (153 MB/s) 27,486 (181 MB/s) -5,139 -15.75% slice::mapcountdense_000000006 34,920 (171 MB/s) 27,204 (220 MB/s) -7,716 -22.10% slice::mapcountdense_000000007 33,497 (208 MB/s) 28,330 (247 MB/s) -5,167 -15.43% slice::mapcountdense_000000008 31,153 (256 MB/s) 28,617 (279 MB/s) -2,536 -8.14% slice::mapcountdense_000000009 30,745 (292 MB/s) 29,666 (303 MB/s) -1,079 -3.51% slice::mapcountdense_000000010 31,509 (317 MB/s) 29,804 (335 MB/s) -1,705 -5.41% slice::mapcountdense_000000011 32,526 (338 MB/s) 30,520 (360 MB/s) -2,006 -6.17% slice::mapcountdense_000000012 32,981 (363 MB/s) 28,739 (417 MB/s) -4,242 -12.86% slice::mapcountdense_000000013 34,713 (374 MB/s) 30,348 (428 MB/s) -4,365 -12.57% slice::mapcountdense_000000014 34,635 (404 MB/s) 29,974 (467 MB/s) -4,661 -13.46% slice::mapcountdense_000000015 35,924 (417 MB/s) 30,584 (490 MB/s) -5,340 -14.86% slice::mapcountdense_000000016 31,939 (500 MB/s) 30,564 (523 MB/s) -1,375 -4.31% slice::mapcountdense_000000032 36,545 (875 MB/s) 34,833 (918 MB/s) -1,712 -4.68% slice::mapcountdense_000000064 44,691 (1432 MB/s) 43,912 (1457 MB/s) -779 -1.74% slice::mapcountdense_000000128 67,210 (1904 MB/s) 64,630 (1980 MB/s) -2,580 -3.84% slice::mapcountdense_000000256 110,320 (2320 MB/s) 108,713 (2354 MB/s) -1,607 -1.46% slice::mapcountsparse_000000001 29,686 (33 MB/s) 28,673 (34 MB/s) -1,013 -3.41% slice::mapcountsparse_000000002 32,073 (62 MB/s) 30,519 (65 MB/s) -1,554 -4.85% slice::mapcountsparse_000000003 33,184 (90 MB/s) 31,208 (96 MB/s) -1,976 -5.95% slice::mapcountsparse_000000004 34,344 (116 MB/s) 30,242 (132 MB/s) -4,102 -11.94% slice::mapcountsparse_000000005 34,536 (144 MB/s) 30,552 (163 MB/s) -3,984 -11.54% slice::mapcountsparse_000000006 35,791 (167 MB/s) 30,813 (194 MB/s) -4,978 -13.91% slice::mapcountsparse_000000007 36,773 (190 MB/s) 31,362 (223 MB/s) -5,411 -14.71% slice::mapcountsparse_000000008 33,101 (241 MB/s) 32,399 (246 MB/s) -702 -2.12% slice::mapcountsparse_000000009 34,025 (264 MB/s) 33,065 (272 MB/s) -960 -2.82% slice::mapcountsparse_000000010 34,755 (287 MB/s) 33,152 (301 MB/s) -1,603 -4.61% slice::mapcountsparse_000000011 35,682 (308 MB/s) 33,631 (327 MB/s) -2,051 -5.75% slice::mapcountsparse_000000012 36,422 (329 MB/s) 32,604 (368 MB/s) -3,818 -10.48% slice::mapcountsparse_000000013 37,561 (346 MB/s) 32,978 (394 MB/s) -4,583 -12.20% slice::mapcountsparse_000000014 38,476 (363 MB/s) 33,376 (419 MB/s) -5,100 -13.26% slice::mapcountsparse_000000015 39,202 (382 MB/s) 33,750 (444 MB/s) -5,452 -13.91% slice::mapcountsparse_000000016 34,898 (458 MB/s) 33,621 (475 MB/s) -1,277 -3.66% slice::mapcountsparse_000000032 39,767 (804 MB/s) 38,013 (841 MB/s) -1,754 -4.41% slice::mapcountsparse_000000064 47,810 (1338 MB/s) 46,332 (1381 MB/s) -1,478 -3.09% slice::mapcountsparse_000000128 64,519 (1983 MB/s) 63,322 (2021 MB/s) -1,197 -1.86% slice::mapcountsparse_000000256 101,042 (2533 MB/s) 99,754 (2566 MB/s) -1,288 -1.27% str_::mapcountdense_000000001 27,183 (36 MB/s) 24,007 (41 MB/s) -3,176 -11.68% str_::mapcountdense_000000002 28,940 (69 MB/s) 24,574 (81 MB/s) -4,366 -15.09% str_::mapcountdense_000000003 29,000 (103 MB/s) 24,687 (121 MB/s) -4,313 -14.87% str_::mapcountdense_000000004 29,822 (134 MB/s) 24,377 (164 MB/s) -5,445 -18.26% str_::mapcountdense_000000005 31,962 (156 MB/s) 25,184 (198 MB/s) -6,778 -21.21% str_::mapcountdense_000000006 32,218 (186 MB/s) 25,020 (239 MB/s) -7,198 -22.34% str_::mapcountdense_000000007 35,482 (197 MB/s) 27,705 (252 MB/s) -7,777 -21.92% str_::mapcountdense_000000008 28,643 (279 MB/s) 25,563 (312 MB/s) -3,080 -10.75% str_::mapcountdense_000000009 30,112 (298 MB/s) 26,773 (336 MB/s) -3,339 -11.09% str_::mapcountdense_000000010 31,554 (316 MB/s) 27,607 (362 MB/s) -3,947 -12.51% str_::mapcountdense_000000011 32,062 (343 MB/s) 27,770 (396 MB/s) -4,292 -13.39% str_::mapcountdense_000000012 32,258 (372 MB/s) 25,612 (468 MB/s) -6,646 -20.60% str_::mapcountdense_000000013 33,544 (387 MB/s) 26,908 (483 MB/s) -6,636 -19.78% str_::mapcountdense_000000014 34,681 (403 MB/s) 27,267 (513 MB/s) -7,414 -21.38% str_::mapcountdense_000000015 37,883 (395 MB/s) 30,226 (496 MB/s) -7,657 -20.21% str_::mapcountdense_000000016 30,299 (528 MB/s) 27,960 (572 MB/s) -2,339 -7.72% str_::mapcountdense_000000032 34,372 (930 MB/s) 32,736 (977 MB/s) -1,636 -4.76% str_::mapcountdense_000000048 38,610 (1243 MB/s) 36,437 (1317 MB/s) -2,173 -5.63% str_::mapcountdense_000000064 43,052 (1486 MB/s) 41,269 (1550 MB/s) -1,783 -4.14% str_::mapcountdense_000000128 64,059 (1998 MB/s) 62,007 (2064 MB/s) -2,052 -3.20% str_::mapcountdense_000000256 109,608 (2335 MB/s) 107,184 (2388 MB/s) -2,424 -2.21% str_::mapcountsparse_000000001 29,155 (34 MB/s) 26,151 (38 MB/s) -3,004 -10.30% str_::mapcountsparse_000000002 31,536 (63 MB/s) 27,787 (71 MB/s) -3,749 -11.89% str_::mapcountsparse_000000003 32,524 (92 MB/s) 27,861 (107 MB/s) -4,663 -14.34% str_::mapcountsparse_000000004 33,535 (119 MB/s) 27,585 (145 MB/s) -5,950 -17.74% str_::mapcountsparse_000000005 34,239 (146 MB/s) 27,520 (181 MB/s) -6,719 -19.62% str_::mapcountsparse_000000006 35,485 (169 MB/s) 27,437 (218 MB/s) -8,048 -22.68% str_::mapcountsparse_000000007 39,098 (179 MB/s) 30,465 (229 MB/s) -8,633 -22.08% str_::mapcountsparse_000000008 30,882 (259 MB/s) 29,215 (273 MB/s) -1,667 -5.40% str_::mapcountsparse_000000009 33,375 (269 MB/s) 29,301 (307 MB/s) -4,074 -12.21% str_::mapcountsparse_000000010 33,531 (298 MB/s) 29,008 (344 MB/s) -4,523 -13.49% str_::mapcountsparse_000000011 34,607 (317 MB/s) 29,800 (369 MB/s) -4,807 -13.89% str_::mapcountsparse_000000012 35,700 (336 MB/s) 28,380 (422 MB/s) -7,320 -20.50% str_::mapcountsparse_000000013 36,692 (354 MB/s) 29,350 (442 MB/s) -7,342 -20.01% str_::mapcountsparse_000000014 37,326 (375 MB/s) 29,285 (478 MB/s) -8,041 -21.54% str_::mapcountsparse_000000015 41,098 (364 MB/s) 33,073 (453 MB/s) -8,025 -19.53% str_::mapcountsparse_000000016 33,046 (484 MB/s) 30,717 (520 MB/s) -2,329 -7.05% str_::mapcountsparse_000000032 37,471 (853 MB/s) 35,542 (900 MB/s) -1,929 -5.15% str_::mapcountsparse_000000048 41,324 (1161 MB/s) 39,332 (1220 MB/s) -1,992 -4.82% str_::mapcountsparse_000000064 45,858 (1395 MB/s) 43,802 (1461 MB/s) -2,056 -4.48% str_::mapcountsparse_000000128 62,471 (2048 MB/s) 60,683 (2109 MB/s) -1,788 -2.86% str_::mapcountsparse_000000256 101,283 (2527 MB/s) 97,655 (2621 MB/s) -3,628 -3.58% ```
Since the SipHasher has been deprecated in the core library, I think the improvement is not important. So the issue could be closed. |
It doesn't seem like the customized version of SipHash brings any performance improvement over what we already have, so I'm just going to close this. |
It's crucial that our secure hash maps are as fast as we possibly can make them, and there's more we can do yet. This is worth dropping down to assembly or even C for.
The best candidate is /~https://github.com/google/highwayhash. Unfortunately, it is apache2 only. We would need to get them to dual-license, but let's not bug them until we're sure we have a patch that's acceptable on our end.
This implementation requires AVX2, which we can not count on any of our shipping targets to support. But it seems relatively simple to me to do cpu feature detection dynamically in some strategic place, then drop to the fast version.
So the steps here:
An obvious concern is the overhead of that flag, and of the FFI crossing.
cc @alexcrichton what you think?
cc @briansmith
The text was updated successfully, but these errors were encountered: