-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a faster stable sort algorithm #90545
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon. Please see the contribution instructions for more information. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 858462c3317ec419fe2503d4aa3812ddb90b7079 with merge eef64e8caf17ac76641c91a2409cc93ba7e381b4... |
☀️ Try build successful - checks-actions |
Queued eef64e8caf17ac76641c91a2409cc93ba7e381b4 with parent 4061c04, future comparison URL. |
Before even considering performance improvements, which I personally find very important, I do hope the Rust maintainers get convincing fuzz test results, and possibly apply AddressSanitizer and MemorySanitizer. Furthermore, can the insertion sort be factored out or reused? Maybe precomputed sorting networks would be even better. |
Finished benchmarking commit (eef64e8caf17ac76641c91a2409cc93ba7e381b4): comparison url. Summary: This change led to small relevant mixed results 🤷 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
The git history contains merge commits, you'll have to rebase that. https://rustc-dev-guide.rust-lang.org/contributing.html#opening-a-pr |
The benchmarking code includes a fuzz test of sorts. Pass
Each run sorts over 900 randomly sized arrays between 0 and 100,000 in length.
I started out with that idea too, but insertion sort was the fastest. |
Thanks. I'm not sure how to do that. I have my fork locally. I think I made a mistake by fetching upstream changes between my commits. How do I fix it? |
Rebasing automatically removes merge commits.
Add
|
59ee0dd
to
cf6d57d
Compare
@wpwoodjr Thank you for the detailed answer! I just saw it now because I didn't get notifications. That sounds quite good so far. There is one thing that came to my mind recently: At the end of the paper it says:
The PR author from the current stable sort also mentioned Galloping:
I have not implemented it so far, but compared the non-galloping current stable sort to stable in place symmerge (which uses binary searching and probably could use galloping, too). In many cases symmerge will outperform the non-galloping current stable sort by factor 2 to 3. I assume, that after the implementation of galloping for stable sort, symmerge will not be faster anymore and stable sort will be more than 3 times faster than the current implementation for inputs, where runs are merged that have different lengths (the bigger distance between the lengths, the better). Since you are using the same current merge function, it would be interesting to know, if the galloping search could benefit in your approach, too. And if not, if your approach is still better than the current TimSort with Galloping. I'm not too deep into the topic, so I can't say it for sure :( I've also developed an hybrid approach to make current stable sort use constant memory instead of N/2 and still be fast. But this will also work with your code, so it's not important here. |
CC @orlp, who also had some improvements to the stable sort code that he was interested in PRing. |
Well, yesn't. I am implementing a complete new stable algorithm from scratch (glidesort), which goes way beyond "some improvements" (and thus would also be a significantly larger PR to review). I have a working version without panic safety, and am working on a version with full panic safety with less unsafe code (which is getting delayed due to personal circumstances). I recently held a talk about glidesort here: https://www.youtube.com/watch?v=2y3IK1l6PI4 The results are thus also much more impactful. On my machine (2021 Apple M1 Pro) the speedups are such that glidesort is faster than When dealing with a low cardinality dataset (under projection of the comparison operator), e.g. sorting by age stored in integer years, which will likely only have ~120 unique values, glidesort becomes faster still (benchmark is 256 unique values by masking integers with |
@orlp Congratulations to this algorithm, this sounds very promising! In your presentation you say, that GlideSort will assume cheap comparisons. Will it become slower than another algorithm, when the comprisons are not cheap (for example image comparisons)? It would be interesting to see tests when sizeof(T) >= 256 or even sizeof(T) >= 384, in my opinion, a sizeof(T) < sizeof(usize) is not so common and thus not so important for a sorting algorithm but of course a very good feature. Also, during my research I found your |
I did briefly talk about it on the "creating more parallelism slide": https://youtu.be/2y3IK1l6PI4?t=727 I tried symmerge but it was slower and more complicated. I use a simplified scheme based on symmerge that uses block swaps rather than rotations. This is slightly less optimal in terms of comparisons but the smaller, simpler code and better codegen for blockswaps compared to rotations more than made up for it. It also made implied block swaps simpler. The logic I use is very similar to symmerge. I find the "crossover point" using binary search: // Given sorted left, right of equal size n it finds smallest i such that
// for all l in left[i..] and r in right[..n-i] we have l > r. By vacuous truth
// n is always a solution (but not necessarily the smallest).
//
// Panics if left, right aren't equal in size or if is_less does.
pub fn crossover_point<T, F: Cmp<T>>(left: &[T], right: &[T], is_less: &mut F) -> usize { ... } Which then partitions an array as such: let minlen = left.len().min(right.len());
let left_skip = left.len() - minlen;
let i = crossover_point(
&left.as_mut_slice()[left_skip..],
&right.as_mut_slice()[..minlen],
is_less,
);
let (left0, left1) = left.split_at(left_skip + i).unwrap_abort();
let (right0, right1) = right.split_at(minlen - i).unwrap_abort();
(left0, left1, right0, right1) Then merging
If comparisons dominate runtime entirely glidesort pretty much uses the same number of comparisons as Note that if your comparison operator is exceptionally expensive like image comparison, you are much better suited with an algorithm that gets much closer to optimality than mergesort, like merge-insertion sort.
I honestly don't really care about benchmarking the performance for sorting such large structs/arrays, because if anyone actually cared for the performance of sorting those they would be using indirection. Constantly shuffling such large structures around during sorting will never be performant. But yes, with a fixed-size buffer glidesort is O(n log(n)²).
I am generally focusing on objects that fit in a couple machine words. E.g. strings, integers, an
I hold the exact opposite opinion, that
Neither of these functions are found in pdqsort, so I don't know. And traditional 'galloping' is mostly a waste of time for a system sort in my opinion. It is relatively complex, mostly improves very academic notions of pre-sortedness that have little to no real-world justification (note how the new paper you cited does not contain any real-world benchmarks, or even synthetic benchmarks at all), and slows down the hottest core loops. There are some tricks similar to galloping (which I'd rather call 'optimistic binary search') I still might fit in at some points out of the core loops, but I really doubt they would benefit much. Most of the points you bring up very much come from the old school of thinking about sorting that glidesort (and pdqsort) precisely tries to puncture. Real-world system sorting isn't about minimizing comparisons at all. It's not about moves either, for the most part. It's about minimizing time. And most of the time is gained or lost in (instruction-level) parallelism, branch mispredictions and cache misses. That's why glidesort is 3.5x faster than the current |
Thank you for the detailed answer, sounds really good to me!
I would agree to this, on my machine |
Ah alright, I thought you meant bytes. |
If we are discussing possible sort improvements, maybe vectorizing would be of interest? This might be possible to implement using portable SIMD (probably not, though, as in the linked blog post they use SIMD runtime dispatch, which would probably not work with Rust's portable SIMD). |
Vectorized quicksort / sorting networks have been around for quite a while, but they're pretty much limited to sorting small integers without custom comparison functions. Not very applicable to a generic sort for arbitrary types with arbitrary comparison functions. Maybe in the future we can specialize for it, but I wouldn't consider it high priority. |
Does it also do this for the end of the slice? One of the things that's nice about the current sort is that you can |
@wpwoodjr https://spin.atomicobject.com/2020/05/05/git-configurations-default/ |
Looking at the total comparisons done:
The result suggests a rather minimal overall change. Which is good. |
Running my test suite miri complains about UB:
|
As noted in the other PR I ran my benchmark suite, but only zen3, with this implementation and got: A statistically significant speedup in most cases, with mostly ascending_saw and descending_saw seeing regressions. Levelling out at around 6% for random u64 inputs hot. Strings see smaller improvements, but overall a slight improvement. You can find the full results here. |
☔ The latest upstream changes (presumably #107143) made this pull request unmergeable. Please resolve the merge conflicts. |
Yes, it will detect the sorted prefix collection, then merge in the second collection. |
This will take a bit longer than I anticipated, as the sort routines have been moved to the core library, and I am traveling next week. I will get back to it when I return. |
ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks! FYI: when a PR is ready for review, send a message containing |
@wpwoodjr @rustbot label: +S-inactive |
Hi everyone, this is my first PR.
This PR replaces slice's TimSort with a stable two stage merge sort having pre-sorted prefix optimization. The total running time is O(n * log(n)) worst-case.
Highlights:
Benchmark:
The benchmark was run on six OS's / CPUs:
For each OS/CPU, the benchmark was run on vecs of
i16
,i32
,i64
,i128
, andString
. The vecs were of random lengths from 0 to 10,000,000.The proposed two stage merge sort indicates a speedup of 13.1% for random unsorted data, with 95% of sort runs being faster. For random data, including unsorted and forward / reverse sorted variants, the results indicate a speedup of 20.2%, with 92% of sort runs being faster. Other data patterns are faster too, except for the sawtooth pattern, which is about the same. The number of comparisons performed is -2.92% less over all tests.
Note: The above benchmark figures have been corrected since my original post. The original post did not include sort results for
vecs
of length 0-100.Full results and benchmark code are at wpwoodjr/rust-merge-sort
The new sort
While not really new, the proposed sort benefits from:
The new sort builds on existing code wherever possible.