-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using counting_sort to improve integer_spread_sort #11
Comments
On Thu, May 19, 2016, 4:08 PM Morwenn notifications@github.com wrote:
|
The check For example in the test « Shuffled » there are one million elements ranging from 0 to 1000000 while in the test « Shuffled (16 values) » there are one million elements ranging from 0 to 15. Basically, unless when there is an explicit « 16 values », every value is unique, even though the size of the range never exceeds the number of values (except maybe for « Alternating » where it might be k = 2N). Even in spreadsort's best cases, we need to have something akin to k > 2N for counting sort to be slower. I can't tell how much k and N differ in real-world programs, I guess it differs depending on the application and that there isn't a single answer. |
Hi Morwenn, On Fri, May 20, 2016 at 7:49 AM Morwenn notifications@github.com wrote:
You're welcome to try increasing max_splits and (optionally) decreasing The Boost.Sort distribution has a tune.pl script that is designed to pick Your graph also suggests that spreadsort should use pdq sort as its |
Aw, I always forget that there are tuning constants as well as an automated program to generate them. I will try that it report the results when I have time. Concerning pdqsort, I actually already use it as a fallback for spreadsort in my library (the one used for the benchmarks), and it always improves it compared to using |
All my code is available under the zlib license, and could be relicensed under the Boost license. If that's not enough, I hereby also grant explicit permission to relicense it under Boost. If that is still not enough let me know what needs to be done - copyright is no issue here. On top of the github repo containing the full code + some of the benchmarks I wrote (which @Morwenn modified for the graphs above), I'm also working on a paper explaining pattern-defeating quicksort. It is not ready for publication, but the draft is in an acceptable enough of a state to learn about and discuss pdqsort's inner workings/design decisions. Speling erors and incorrektness may occur - the code in my repo is authorative, |
On Sun, May 22, 2016, 12:59 PM Morwenn notifications@github.com wrote:
Once integrated with pdqsort, the min sort size and log min split count —
|
It's in the process of evaluation for libstdc++, and the last I've heard of libc++ was a year ago, when they said they were working on a benchmark framework first, but were definitely interested. (Google libcxx/libstdc++ and pdqsort should reveal the relevant mailing list posts). I have not contacted or have been contacted by anyone in the MSVC team. I would still consider already implementing pdqsort into Boost.Sort, as progress tends to move slow with these projects, and MSVC is especially uncertain. |
Hi Orson, First off, pdqsort looks pretty good, and I'd be interested in including it
Feel free to copy the Boost.Sort repository, edit it to include pdqsort, On Sun, May 22, 2016 at 1:08 PM Orson Peters notifications@github.com
I was reading through your docs and encountered this: With single-pivot random selection, T(n, 1/4) is what you'd expect from
|
Not without first knowing what my role in the project would be, and what sort of edits would be considered appropriate and/or expected. |
To follow up on the re-licensing, here is pdqsort's implementation released by me under the boost license: https://gist.github.com/orlp/2c417ab76391b126e1d58bac4bf4af0f I only changed the bare minimum (prefixed BOOST_ to macros, put |
On Tue, May 31, 2016 at 9:10 AM Orson Peters notifications@github.com
|
All right, so if I understand correctly I
And if I understand correctly, I don't
Did I understand this correctly? P.S.: you didn't answer my question regarding edit access. |
On Wed, Jun 1, 2016 at 7:23 AM Orson Peters notifications@github.com
|
I apologize for the delay. I got distracted with other things, but now something interesting came up. I was made aware of a new paper on arXiv: BlockQuicksort: How Branch Mispredictions don't affect Quicksort. I have implemented this in an experimental branch of pdqsort: /~https://github.com/orlp/pdqsort/tree/block. I'm still working on benchmarking, tuning and 100% checking it's bug-free, but it's very promising so far. I want to get this finished before integration. |
On Tue, Jul 5, 2016, 6:30 AM Orson Peters notifications@github.com wrote:
|
Back to the original topic of this issue: I tried to use tune.pl to actually test spreadsort with optimal tuning, but wasn't able to run the script. My system was unable to find |
The README /~https://github.com/boostorg/sort/blob/develop/README.md On Mon, Aug 1, 2016 at 10:16 AM Morwenn notifications@github.com wrote:
|
What the status of algorithm's implementation? Can i help? |
Great! |
Alright, I believe I am done. These are the files you can merge: https://gist.github.com/orlp/a344e9e9783b4adf5cc61a26410f0dbd There are 3 files: pdqsort.hpp, pdqsort.png, sort.qbk. The first is self-explanatory, the second contains a little bechmark image included by the documentation I've written (read: mostly copy/pasted from my Github repo), and sort.qbk contains the section I've written. I'm still working on the paper for pattern-defeating quicksort, but when that is done I will notify you so you can add a reference to it somewhere. If anything is missing, let me know. |
Also you should prepare test and example before merging. Did you compare pdqsort with timsort? |
Considering every benchmark to date, timsort is cute compared to pdqsort :p Timsort can probably end up as a winner when the comparisons are the real expensive operation since it's been designed for this use case. We'd have to check that. |
Timsort is the default sort in some Java implementations; it's good for particular types of problems, especially mostly-sorted data, which is a common situation. With regards to pdqsort: I benchmarked replacing std::sort with pdqsort using my tune.pl script (included in the libs/sort dir in the boost release; you may want the fix I just made to tune.pl in the Boost.Sort repository to run it on linux). I saw substantial speedups on integer and float sorting, but non-trivial slowdowns with string sorting (specifically the stringsort and reversestringsort examples). You can either investigate those issues and try to fix them, or I can look into just using pdqsort for integer_sort and float_sort. Also, did you want pdqsort offered directly as option in Boost.Sort (as your documentation and namespacing suggest), or to just include it as a subsidiary algorithm in spreadsort? We might need a mini-review to include it as a prominent direct option, though I can check on that if you want. |
It would be nice if it was directly offered as an option in Boost.Sort. This is what you also said earlier (assuming the account @spreadsort isn't controlled by multiple people):
I will look into In the meantime you could just continue integrating only enabling pdqsort for integer/float cases while I track down what is going on for strings. |
Alright, I believe I have figured out the slowdown for strings. It has nothing to do with the ideas in pattern-defeating quicksort, but rather the ideas of BlockQuicksort I integrated into pattern-defeating quicksort. I give a very brief description of this technique at /~https://github.com/orlp/pdqsort#the-average-case. It seems that (in the way I've implemented it right now), it provides a tremendous speedup (50-80%) for the branchless case, such as comparing integers/floats, but a small but nontrivial slowdown in the case of branches in the comparison operator (~10%), which is the case for strings. I will look into alternative ways of implementing it to eliminate this, but I'm not sure if it's possible. You should be able to reproduce my findings by replacing
This is the old version before I had integrated the BlockQuicksort technique. Could you perform your benchmarks again with this function substituted in? If my hypothesis is correct you should see a significant slowdown of the integer/float case (compared to pdqsort with BlockQuicksort), but Now I go back into my dungeon and try to see if I can make BlockQuicksort fast even in the presence of branches... |
I've identified a micro-optimization that in my benchmarks puts Now we do this:
instead of:
The full code then becomes:
@spreadsort It would be nice if you could also benchmark this version. This should retain the extra speedups for branchless comparison operators while maintaining (almost) full speed for branches. |
In my testing that helped speed up reverse_string_sort by 2%, making it only 5% slower than std::sort, but string_sort was still 5% slower and it seemed to hurt other (non-string) times a little. My inclination is to just skip pdqsort for string_sort, and just put a warning in the documentation about the cases where it might be a bit slower. For references, this is my system info: 8GB RAM, Intel® Core™ i7-3612QM CPU @ 2.10GHz × 8 , Ubuntu 16.04.1 LTS 64-bit If all this sounds reasonable to you, I'll test it on large datasets and windows too. |
Forgot to mention: I'd prefer to keep your original code, as it is faster in non-string settings in my testing. |
@spreadsort I posted two different versions to try out - one older version and a newer version, did you benchmark both? And with There really never should be any reason for the older version to be slower than |
@spreadsort Also, numbers like 2% and 5% seem dangerously close to measurement error - are you making sure you do multiple trials, long enough benchmarks and letting each implementation 'warm up' in the cache? P.S.: how exactly do you run your benchmarks, so I could perhaps reproduce and see what's going on? |
I had benchmarked the newer change, which is about 5% slower, consistently (variation of <1% between runs). I just went back and benchmarked the older version, and it's roughly neutral vs. std::sort, except in some cases with more complex data structure sorting where it looks slightly faster. I find that carefully optimized code tends to perform better on O3 than O2, but that varies depending on the code. I'm building with the project default, whatever that is. |
@spreadsort The whole crux of the block quicksort is that it's branchless. It's 'straight' code the CPU can march through without ever stalling or undoing a branch mispredict. If the comparison function (e.g. Now in the perfect world there was some magic compiler directive @spreadsort One option is to include both old versions (block version, and regular), and select the block version when sorting integers or floats, the regular version otherwise. |
How about something like this: and only using the block sort if a built-in comparison like std::less (or std::greater) is being used for a built in integer or float type? Branches are very common in comparisons of anything else. |
Yeah, something like that. Although it may also be an option to expose two A third option is to expose both versions, as |
Considering the magnitude of the speed differences I'm seeing, that's a reasonable solution, as long as you document all 3. I actually see little benefit to pdqsort_branch vs std::sort (inside spreadsort), so I'm inclined to use just pdqsort_block where I know it's safe inside spreadsort (plain integer_sort and float_sort when they use std::less), and use std::sort everywhere else. This is worth a mini-review, with 3 potential options:
And both 2&3 would involve exposing pdqsort as a top-level call. At this point I'm thinking of option 2, but all will be possibilities in the review. |
@spreadsort But compare From studying the libstdc++ implementation, there really shouldn't be any scenario where That being said, I don't know under what circumstances spreadsort calls a fallback sorting algorithm, so it might be possible those patterns never occur. For me personally the prime goal would probably be to offer pattern-defeating quicksort as a sorting algorithm in Boost.Sort so the general public can find and use it. Exactly in what fashion spreadsort would use This is of course me being presumptuous in that pdqsort passes the mini-review. What exactly does this process look like? |
Now with the latest commit in /~https://github.com/orlp/pdqsort I've made it so that
However, you can always manually request branchless partitioning by calling |
That sounds good. I'll tell Francisco, who wants to run the review. |
@spreadsort As I know, Francisco is unavailable now, so... What the current pdqsort status? Can you review it? |
@spreadsort Why do you not want to include counting_sort to Boost.Sort? It's a very-very-very good algorithm fo sorting a large set of small values. And any algorithm can't beat counting_sort in this case. |
What's the difference between counting sort and a single iteration of
spreadsort? I believe spreadsort is the more general solution. I'm also
skeptical of the importance this relatively simple special case.
…On Thu, May 18, 2017, 11:29 AM Alexander ***@***.***> wrote:
@spreadsort </~https://github.com/spreadsort> Why do you not want to
include counting_sort to Boost.Sort? It's a very-very-very good algorithm
fo sorting a large set of small values. And any algorithm can't beat
counting_sort in this case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>, or mute
the thread
</~https://github.com/notifications/unsubscribe-auth/AE2j3Of3X5ht7dP4lP0HEC_efLWYzzH3ks5r7GO9gaJpZM4IinXK>
.
|
I know of two flavours of counting sort:
If I'm not mistaken, a single iteration of spreadsort would be equivalent to the first of these algorithms. When I augmented spreadsort to use projections in my sorting library, I couldn't use the second counting sort because it does not handle projections. Well, I could dispatch on the |
I see; it's basically using the fact that it can just write out the values for the ints that it counted. Arguably the better solution is to just count the values and skip calling it a sort, just carry around the counts which are more compact in this case. |
To be honest, I don't know any real-world problem that spends non-trivial compute resources simply sorting integers (except when sorting a collection of objects on an integer member, which can be done with projections, but then we're not exactly sorting integers anymore) :p |
Hi, it's me again :)
No bug or memory leak today, but I was benchmarking some sorting algorithms again and was surprised by how fast a simple counting sort. Here are the results of a benchmark testing several sorting algorithms against collections of one million of integer elements with several data distributions:
While not always the fastest, counting sort tends to be insanely fast and always beats spreadsort in the benchmarks. Of course, counting sort has its share of problems, but I believe that it can be used to improve the integer-only version of spreadsort that does not take any custom comparison or shift function. Basically, we could add the following implementation somewhere (this version uses C++11, but it can be done in C++03):
Then we could modify the beginning of
spreadsort_rec
to use it as follows:Thanks to
is_sorted_or_find_extremes
, we already know the minimum and maximum values of the range[first, last)
, so we don't have to compute them again incounting_sort
. The check*max - *min <= last - first
ensures thatcounting_sort
will be called only if thestd::vector
it needs to allocate is no bigger than the original array, which means that spreadsort keeps a O(n) auxiliary memory guarantee (IIRC it is the current memory guarantee of spreadsort, right?). That check also ensures that the algorithm won't even run into the pathological worst cases of counting sort, so we only keep the good parts.In other words, we can significantly speed up some scenarios at the cost of a single condition that shouldn't even weight too much when it fails. If I find the time, I will try to compare a version of spreadsort with this trick against the current one and share the benchmark results here.
The text was updated successfully, but these errors were encountered: