-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul std collections and algorithm documentation #36318
Comments
I'm not against additional rigor, but I would rather have the API documentation err on the side of too little details than too many:
Also, spec'ing the number of moves/copies makes little sense (except as one of many factors influencing run time): C++ specifies it because moves and copies call constructors and are thus observable behavior, but in Rust they're plain |
I both agree and disagree here. I think it is worth it to document as much as it makes sense (that is "many details") but I think it makes sense to err in the "conservative" side, like for example specify conservative complexity guarantees (e.g. for stable sort "at most O(N) extra memory" doesn't prevent the implementation from switching to grail sort, which requires O(1) memory, for those cases in which it is faster).
Yes I agree in that these things are obvious, and repetitive. The tricky part will be achieving consensus on what should be documented here and what should not be documented, particularly in the effect section. My strawman tries to mention everything since that is one of the extremes, but probably something in between will be better. Having said this, I do think that having a terse bullet-point form of all the side-effects of an operation is useful even if they are already documented in the prose. For example for
Even though API docs aren't standards, guaranteeing O(N) complexity and changing that afterwards to O(N log N) is an API breaking change, so we want to be very conservative at first with what do we guarantee for each method. Tightening any of those guarantees can always be done later, but I think that should follow the RFC process, e.g., want to tighten the guarantee that stable sort requires O(1) extra memory instead of O(N)? Write a (small) RFC for it.
I should have said move/clone instead of move/copy but I still do think that it is worth it to guarantee or at least document this to allow users to reason about performance. Some algorithms are in place and do not require any memcpys, while others need to perform a memcpy of O(N). Again I am not suggesting to making these guarantees as tight as possible (or as tight as the current implementation), but guaranteeing that O(N) copies happen as opposed to O(N^2) copies tells users that:
I think a good example are the two kinds of sorting algorithms one generally encounters: unstable in-place, and stable with a buffer. The number of comparisons alone don't tell you the whole picture about both algorithms. For unstable in-place the number of swaps and the stack usage (O(1) vs O(logN)) tell you the larger story (*), while for stable sort you also need to know how much memory it allocates, how often is it going to copy elements into the buffer, whether it can survive OOM (in Rust obviously it cannot, but maybe in the future with Allocators it could be made to work without a buffer). (*) I think that with respect to stack usage constant factors don't matter "that much", since for a O(1) algorithm you know that if it works for zero elements it won't overflow the stack for N elements, but the same cannot be said about an algorithm with O(logN) stack usage. |
That is precisely my worry. Giving any guarantee at all is a commitment that needs to be made carefully and consciously, unless the bounds given are incredibly loose. Such loose bounds, on the other hand, can be harmful by confusing people and giving a wrong impression.
You are right that some of these metrics give a better idea of the algorithm used, but this requires giving bounds that aren't trivially true for any sensible implementation. Again, giving overly loose bounds (e.g., O(log N) stack space, O(N) scratch space, O(N²) comparisons and swaps worst case) would be misleading, and narrowing it down would be a large commitment that I do not believe people would be or should be willing to make. Guaranteeing details of the algorithm used (the raison d'être for these metrics) is precisely what we should be reluctant to do. The more I think and write about this topic, the less happy I am about giving precise asymptotic complexity bounds. Don't get me wrong, knowing that a program takes somewhere between O(1) and O(N log N) time and space is a very valuable sanity check that it won't blow up on larger data sets. We should give this information. But more details are both hard to provide, and may not give all that much detail. For realistic performance estimates, one needs more information than a programming language should be willing to guarantee, including information that can't be captured in asymptotic complexity at all. In short, it is information about the specific implementation your program is actually using, and that's the exact opposite of what API docs should do. I don't know if and how this information should be made available to those fine-tuning their code's performance, but I am increasingly convinced that API docs are the wrong place for these sorts of details. |
@rkruppe Perhaps it might be valuable to create another RFC to finalize asymptotic bounds for operations on standard library collections before putting them in the docs. Also, I think it would be nice if docs could mention significant dropping and ownership behavior (e.g. Vec::pop gives ownership of the popped item) and events that could cause unwanted behavior (e.g. killing another thread while that thread is executing Vec::drain could result in memory leaks). |
Also, I would rather err on the side of too many details as long as they are well organized. People can skim through stuff they don't care about, but not getting the information you need is maddening... |
I think it is worth remembering that currently we are at the opposite end of the spectrum, in which "almost nothing" is documented (a lot is documented, but not really in a method per method basis with strict guarantees that users can rely on). Maybe I started this discussion with the wrong foot by suggesting specific things to document. I guess it would be better to agree on general goals for the API documentation first (from the point-of-view of an user), and after we have consensus on that, try to formulate some more specific guidelines that achieve those goals (and then iterate). In my opinion, for a user reading the API documentation of some collection method or algorithm, it should be clear whether the method:
So what do I mean by the "opposite end of the spectrum above"? Consider finding a key in a HashSet, using
So going through the goals stated above, the current documentation basically fails at all of them. The API docs don't specify the complexity (so I don't know if calling it inside a loop will blow up my program), whether it is implemented recursively or not (i.e. if the stack could overflow), whether this operation can panic, can leak or allocate memory, ... Does this mean that Realistically though, I seriously doubt that anybody at this point would break any of the current guarantees that HashSet provides, and the same goes for the other collections. If anything, those guarantees might get better, but not worse. At least I would consider making e.g. the complexity worse a breaking change "even if the API doesn't promise anything". So IMO we might as well just go on and document this. This would not only serve as guidelines for users, but also as guidelines for those improving HashSet (what can/cannot be done without breakage). I think that how to exactly document this should probably be handled in an RFC, but at least agreeing on a set of goals for the documentation first would make it much easier to start writing such an RFC.
For me changing the complexity (among other things mentioned above, like memory usage, stack usage, allocating memory, panics, ...) is a breaking change even if it is not documented. If the std collections do not guarantee any of these things, this means to me that:
Cheapest compared to what? Comparisons while using stable sort? They are relatively cheap for types that are small and expensive to compare, and relatively expensive for types that are large and cheap to compare (like
Whether stack space used is O(1) or O(logN) is the difference between a stack overflow being possible or not (and thus the difference between a panic being possible or not, or at least that if the algorithm for N == 0 doesn't overflow the stack, it won't overflow the stack for any N). Maybe I am weird for caring about this, but I do like to know. And if I write some code that should never panic, and use some algorithm that uses O(1) stack space for this, I would not want this behavior to change in the next rustc upgrade under the covers. I agree that this information does leak implementation details of the algorithm, and I agree the need to be conservative to give implementors enough leeway to perform optimizations, but one cannot have it both ways.
I think we should give enough guarantees to allow people to reason about how and when to use the methods and classes in std collections without severely constraining the implementation. Documentation wise it is probably better to move some of the details to a "Non-binding implementation details" section in the API documentation comments that might add some extra information that might be relevant for those wanting to know exactly what is going on, but that is allowed to change between rustc versions.
Deciding which information to give exactly and how is a hard. Sadly we only have the API docs for this, since that is the reference. As mentioned above I think it would be better to have a non-binding section in the API docs that tells users implementation details that "are good to know" but that are allowed to change (e.g. the value of |
@gnzlbg What you said makes a lot of sense :) I have a question about your 5th point:
What do you mean by this? Also, I would like to add to the list: "Has interesting static side effects, such as giving up ownership of some piece of data."
I think space complexity is important (especially for collection types), but I have not come across examples where stack space complexity has caused my program to overflow... In my limited experience, this tends to be more a function of time complexity... Do you have a specific example where you have had problems?
Here, I disagree:
The reason is that it breaks modularity. If we put something in the docs, even if it is marked "non-binding", somebody will write code that depends on the non-binding behavior, and that code will break when the non-binding behavior changes. That said, I would like a section specifying what will not be standardized. For example, if for some reason it is not likely for the complexity of something to be specified, that should be well-noted. More generally, I think it is also important to specify what should be mention in the module-level docs. IMHO, this should be:
|
@gnzlbg I have to agree that hashing out the details is out of scope for a single issue and at this stage the focus should be on the general direction. Unfortunately that still leaves us with a major philosophical disagreement: What to guarantee and what not. I don't really disagree with this sentiment:
Certainly if someone filed an issue about a reasonable program that:
then I am pretty sure someone would try to fix that. At the same time, I am entirely opposed to enshrining such "reasonable" guarantees to a degree that terms like "breaking change" float around. We even reserve the right to make programs stop compiling as a side effect of tweaking type inference or adding trait implementations. It is simply not realistic to freeze all the myriad details that can factor into a program compiling and running at peak performance, and occasionally some metric will get worse, accidentally or deliberately as part of a change that's for the better in other aspects. So I am quite receptive to the idea of documenting these things as "non-binding"/"non-guarantees". This doesn't mean that you "cannot use std collections if [you] need stable behavior". Pragmatically, it just means that programs might occasionally get slower (and at other times faster). But this is nothing new and nothing that can be solved at the level of asymptotic complexity — codegen and optimization passes also have a huge effect and change much more frequently. For example, see #35408: This is not something that can be willed away with API docs and RFCs, it is an unfortunate fact of software engineering. |
But wouldn't this be labeled as a breaking change? I can't imagine something like this happening without at least an RFC? I'm just curious because this is a very strong statement... |
@mark-i-m See RFC 1122 for the official policy on language changes and RFC 1105 on library stuff like trait implementations. The people pulling the levers are responsible and careful enough that there aren't big changes dropping from the sky every other day (e.g., often there will be a crater run), but virtually every change anywhere in the API surface or certain parts of the compiler has the potential to break some code, even if it's usually hypothetical or pathological. |
Ah, I see what you mean now. Thanks On Sep 7, 2016 12:07 PM, "Robin Kruppe" notifications@github.com wrote:
|
I mean things like "this method changes the result of
Unstable sort is the typical example, e.g., if implemented using recursive quicksort it can overflow the stack, but if the implementation is a bit more clever and doesn't use recursion (or uses some sort of tail recursion), then that cannot happen. The guarantee is often O(1), and in some rare cases (dive and conquer algorithms) it might be O(logN), but the thing to read from there is "the implementation is allowed (or not) to use recursion, which as a consequence might overflow your stack". On a desktop this might be irrelevant (N would need to be too large), but on a tiny embedded system the amount of stack you get can be, well tiny.
There are some things that are good to know, and having to read the code to find them is painful. For example, average programmers know that So I believe such a section with "summary of relevant implementation details that might change" is still worth it in some cases. Stuff like the complexities do hint at the implementation details, but having them in bullet point form avoid you to guess. For example if the worst case complexity of
Some details are always important, some are often important, and some might be important in corner cases. My point being, not all details are equally important. We might not be willing to commit on all details (e.g. the exact growth factor of a vector), but this doesn't mean we shouldn't pass this information to users. Requiring everybody to "read the code" doesn't scale.
This happens independently of whether this information is inside some "do not rely on this" documentation section, or in the source code. That is, those users willing to rely on things they shouldn't rely on are going to do it independently of this. I don't think we should penalize all users by not including this information.
I think that if we mean the same thing by the unspecified/unstandardized section (what I was referring to as "non-biding"), that doesn't make much sense to me at the module level because that information is going to be highly method specific. Are you referring to something else?
Yes, for programs that are arguably already broken.
@rkruppe Maybe we are talking past each other, but IIUC your point is that knowing the algorithmic and memory complexities of the methods, and whether they panic and have some form of side-effect or not, is worth documenting, but not worth "guaranteeing"? And that users should still rely on these even if they are not guaranteed, but if they change and their programs break, then they should complain? I'm a bit lost. I also have the feeling that for some reason you think that I am proposing guaranteeing that sorting X ints will run in Y micro-seconds on a Xeon Z, which couldn't be farther off from what I am proposing. What I am proposing in a nutshell is guaranteeing the information required such that users know whether they can call sort in a loop of their application or not, whether that call can blow up their stack and segfault or not, and whether for some input they might get an OOM or not. In particular, most of this information is already documented and guaranteed for |
I think we are actually thinking of converse ideas. I mean a section in which we explicitly say "XYZ behavior/property is not going to be standardized and you should not rely on it" rather than "You should not rely on it, but XYZ is actually done using ABC". More concretely, we should say "Do not rely on any particular allocator" rather than "You should not rely on the allocator, but we happen to be using such and such". But I do agree that something like this could also be useful on a per-function basis.
I understand where you are coming from, but I disagree. Building software and assuming things that might change seems like a very wobbly foundation. If somebody is willing to change their design completely based on which sort we are using, it might be worth it to actually specify the properties of the sort we are using (or maybe they are making too many implementation specific assumptions)... As soon as the sorting algorithm changes, their code will break or become really slow. And the only fix is for them to change their design, which is exactly what we want to avoid by writing modular code. After thinking more about your earlier post, I still think it is a good idea to give some guarantees about the behavior of algorithms in the references. A library with such fundamental data structures needs to give corresponding fundamental guarantees, like complexity bounds, even if they are a bit loose. I honestly don't expect the bounds to change too much either (that is, the time and space complexity bounds). Most of these data structures are pretty standard, and their implementations are pretty well known. |
On Thu, Sep 8, 2016 at 2:48 AM, Not Mark notifications@github.com wrote:
I am just trying to see the documentation from the point-of-view of users Some information is IMO just good to know. You go on, use std::sort, and |
No, for perfectly reasonable programs too, if they are judged to occur rarely enough in the wild (that's one use of crater!) and be easy enough to fix (e.g. by explicitly disambiguating the method call or adding a type method).
That is exactly what I am saying, since, as you yourself later wrote:
The distinction that might confuse you is that I draw a line between a hard guarantee given in the API docs, which will influence all Rust implementations ever unless heroically changed against inertia and the ugliness of the words "breaking changes", versus so-called Quality of Implementation (QoI) issues. A C compiler that segfaults without output when there is any problem with the input program is a really shitty compiler, but it's in conformance with the C standard(s). Likewise, an algorithm that blows the stack on reasonable inputs really really ought to be fixed, but this is not a question of standards and guarantees, just an QoI issue. Furthermore, like the (absence of) diagnostics from the hypothetical C compiler, it "only" has to do with a specific implementation, not with the language.
Complexity bounds are neither necessary nor sufficent for knowing that. They are a crude approximation. Don't get me wrong, I have the greatest respect for complexity theory and I'm the first to nit pick the correctness of some bound. But for these questions they are heuristics at best (especially with the stack, where the upper limit is not just a constant but a quite small one). To give an example, there's a function in libcore right now ( Furthermore, I do not think adding hard guarantees adds too much value over just documenting implementation details in a non-binding manner. For answering question of the form "will this program work reasonably well" or "why does this algorithm not work well" on the day-to-day, the implementation details are almost as good. For long-term evolution, non-guarantees are not much worse:
So, in summary:
|
Thanks for opening it @gnzlbg! |
Worsening the complexity of an algorithm breaks programs silently at run-time without a warning. The programs weren't broken before, the language wasn't broken (no soundness issue), breakage cannot be fixed automatically, and warnings cannot be emitted. I don't see how the current guidelines for allowed breaking changes would allow worsening the complexity of an algorithm even if it wasn't documented without a major version bump.
I don't think that space and time complexity are QoI issues since without this information I cannot reason about my programs.
That's why in the first post I used in some of the examples the term exactly, as in "it allocates exactly 2*N memory". Exactly / at most, and similar terms give enough information to reason about methods like
Which existing guarantees? The API docs of most methods don't currently guarantee anything (not even time complexity).
As stated in the first post, every major change to the API of the standard library has to go through the RFC process. Guaranteeing anything that wasn't guaranteed before (like what is being proposed here) changes the API of the standard library and has to go through the RFC process as well. This issue was just filled to "raise the issue" that currently almost nothing is documented / guaranteed.
We disagree on what are implementation details. You think that the time complexity of an algorithm is an implementation detail that should not be guaranteed, and that this somehow allows higher quality implementations. I think that without these guarantees I cannot reason about the behavior of even the most trivial programs (is calling It's a bit moot to discuss subtler topics like space complexity or effects without agreeing on this first. Let's wait to see what ideas and thoughts other people have, maybe somebody has an idea about how to achieve consensus here. |
Asymptotic complexity is a very rough heuristic. No asymptotic complexity bound can ever tell you whether a concrete program with a concrete input will exceed a concrete number of time steps or bytes of heap/stack memory. Even if every single method in Therefore, whenever you have a resource limit you need to stay under, complexity bounds can't tell you whether you will manage. At most, they will give you a quick sanity check whether it's likely that it will work out okay. And that's good to have, but it doesn't completely solve your problem. It just tells you whether you can proceed to the actual testing, or scrap the code because it most likely won't pan out. And even for this they can be misleading! In short: Hard guarantees about complexity won't translate into hard guarantees about what people actually need to know. Thus I do not see the need to provide hard guarantees about complexity. (Note: Everywhere I say complexity, I don't necessarily mean asymptotic upper bounds, it encompasses everything that hides constants.)
On the contrary, the current guidelines don't mention the time and space used by a program at all. Behavioral changes in general are a bit brushed over, but RFC 1105 spends a paragraph on it. In particular, changes to stuff that isn't documented is ruled a minor breaking change (at least; we can always decide that a change would be too disruptive in practice). And that's assuming the time and space complexity fall under "behavioral" at all — I don't have an authoritative source but my impression is that they generally don't. Regardless of rules lawyering, the guiding principle is that a Rust program that works fine in 1.X should mostly work fine in 1.(X+N). There are numerous caveats to this, mostly relating to the practicality of preventing such a regression and the positive and negative effects the change will have on the programs that have actually been written so far. We are essentially debating where a change of complexity falls on this scale. Simply asserting that it can make a program stop working is not enough, every change can do that. At this point it would perhaps be more useful to let the philosophical debate rest and go search for practical examples of programs breaking because a standard library changed under their feet. I'd be especially interested in programs that got seriously hurt as trade off for an improvement for other programs, since a change that's just bad across the board will be fixed like any other regression.
Most phrases like this contain implicit constant factors. No sane API designer would ever guarantee the exact amount of stack space used, which is super volatile and highly dependent on compiler optimizations and even on the hardware and the ABI! Even in most algorithms which perform a heap allocation for a specific purpose you usually won't want to give exact numbers, for example to allow over-allocation or to tune buffer sizes for efficiency (a buffer for a stable sort might be an exception). But giving exact numbers without hidden constants is exactly what would be needed to be able to reason so precisely that no "breaking change" (in your wide sense of the word) happens.
|
On Thu, Sep 8, 2016 at 2:42 PM, Robin Kruppe notifications@github.com
You are the only one that has mentioned counting bytes or "timesteps" (do
Indeed, this is why the initial post discusses multiple guarantees beyond
C is the only low-level language besides Rust that doesn't guarantee the Your argument seems to be "I don't want to document these things because Can you give an example in which guaranteeing any complexity bound for a
It is trivial to find examples of algorithms in which the stack usage can The main argument you seem to have against guaranteeing any sort of space
You still want to at least provide whether the algorithm can allocate at 5 methods x 5 containers = complexity of 25 methods documented there + in Vec alone, however, has 35 methods (vs the 5 documented there) without This issue is about consistently documenting the API of each method. To |
The reason I am focusing on exact number is that those exact numbers are, at the end of the day, what makes or breaks a program. There was no watchdog killing that C program because
I am, as a matter of principle, opposed to guaranteeing anything unless it's demonstrably a win. That includes the guarantee being useful enough to be worth the hassle, and it being unlikely to cause problems later on. A formal guarantee is a very heavy hammer, it requires a rather heavy process to establish and must be kept in mind in the future. Consequently, I am mostly arguing that giving hard guarantees of the form that can be given reliably, cross-implementation, are not that useful for the purposes you mention, and thus not worth the hassle. Keep in mind that the alternative is not "never tell anyone anything about the implementation and maliciously make it worse once in a blue moon", the alternative is "keep making code perform as well as possible, give users an idea of the algorithms used, but permit ourselves to change them when useful". The program you refer to didn't only break because the C standard gave no guarantees (for that matter, the C++ standards guarantee time complexity but say nothing at all about stack space), it also broke because the
http://stackoverflow.com/questions/10134836/complexity-of-stdlistsplice-and-other-list-containers
Post-C++11 rules out randomized sorting algorithms with average-case O(N log N) time but a worse worst-case, though I am not quite willing to say that one of those would be "better" (even though it probably would be faster in some use cases).
Not giving a guarantee doesn't preclude anyone from optimizing the code. Conversely, if the current implementation doesn't conform to a proposed bound and nobody steps up to fix that, the proposal fails. And all that assuming the argument were true, which it isn't. Except tail recursive algorithms, converting to a loop usually requires moving the stuff that used to be on the stack to the heap. That's rarely faster, since it's basically the same work plus the work of allocating and deallocating.
But that, as I said before, is only a crude stand-in for "will this actually overflow". Additionally, I am opposed to documenting this growth behavior for most methods. Maybe for some selected methods, but not for enough that you can really look at an entire big program and say "yup this will never have unbounded stack use anywhere". Of course, wild recursion is rather fragile and should be avoided when reasonably possible, but again, that's a QoI issue for me. |
The choice of making size constant or linear results in two different containers, one is not better than the other. If you find a way to make splice constant time without changing size complexity you can go on an implement it, the standard just requires worst case linear time, which constant time satisfies. If you want to have constant time splice you need a different container (that among other things have different memory requirements).
The complexity of deque push_back is amortized constant time. The reason it is implemented as a linked list of blocks is because it guarantees that on insertion and removal iterators to the elements are never invalidated. Specifying whether iterators are invalidated or not in C++ is required for correctness. The same "type" of container with or without iterator invalidation is going to be implemented completely different, and have completely different performance and correctness guarantees. In this case, again, the tradeoff produces essentially different containers. The standard containers aim to be safe, and in general try not to invalidate iterators. People have reimplemented most of them without this guarantee, but even if their performance might be better in most cases, it is still not better in all cases (since they are implemented using completely different data-structures).
Worst case complexity is what makes programs explode, services go down, and introduces security vulnerabilities. This also rules out other algorithms like quicksort, which has worst case O(N^2). And this is a good thing, since std::algorithms should prevent your program from exploding.
Can you give an example? In my experience this isn't the case (e.g. iterative quicksort doesn't need any heap allocations nor recursion).
In a wild guess, I would say that currently, for almost all methods in the standard library, the space complexity is O(1). It is trivial to document, trivial to guarantee, and in general faster (without an example of a non-tail recursive algorithm of the std::lib for which this is not possible I stick to what I know). |
They are two different containers alright, but that doesn't mean one can't be better. The standard took list implementations that were useful and efficient for some use cases, and ruled them out. Conversely, if we found out that constant time splice was desirable for linked lists, a previous guarantee for constant time
Oh, right. Sorry.
I take it you're not a fan of hash maps, then? As I said, I wouldn't argue that they are necessarily better, but a (sufficiently rare) worst case doesn't need to rule out algorithms in all circumstances.
Iterative quicksort needs a stack. You can bound the size of that stack with O(log N) by always pushing the smaller half to the stack and processing the larger one right away (the same trick works in recursive implementations that make use of tail call optimization), but it's large and non-constant-sized scratch space. Other examples: Tree traversals without sibling/parent (depending on the traversal mode) links, depth-first/breadth-first search in graphs, any backtracking algorithm that needs to remember something non-trivial to restore a prior state.
(I assume you mean stack space.) You'd be wrong. Try the BTrees in |
@gnzlbg I still disagree on documenting implementation details, but not as strongly as I used to... if these need to be documented, I would like them to be placed somewhere not on the standard docs (a separate reference perhaps).
One could argue that memory safety is QoI issue, yet Rust go to great lengths to guarantee it... The fact is that giving guarantees about safety and liveliness is actually useful, though I agree that there is a limit to their usefulness. I think that the usefulness is worth it.
This is possibly an ideology difference. I would like the APIs I use to give me contracts.
Such an RFC would still be a breaking change even if it is not labeled so... This is a collections library, if you don't make certain guarantees explicit, people will very naturally assume them. Fundamentally, I am still not convinced that giving non-binding details is the direction I would want for a standard library.
Really? Vec, LinkedList, Deque, HashMap, HashSet, BTree, sorts... What do you mean by that statement? There are a lot of things in the standard library that people could make assumptions about if we don't tell them explicitly. Overall, I agree that this has kind of devolved into a philosophical debate, though... |
The complexity wasn't guaranteed, depending on the standard library implementation users were getting completely different containers. If you did not care, you did not care, but if you did care, this made the standard container useless, since whether it was good for your use case or not would change across platforms. The consensus was that consistent complexity across implementations was necessary, and that independently of which container was chosen as the default, users could always implement the other one. This was the right call. The O(1) size implementation was chosen as the standard one because all other standard containers have O(1) size. Whether that was the right call or not is debatable, but even for those that wanted to get O(1) splice, this was a win, since now they would know that the only way to get it was to use something different than std::list.
As I said, the art is in choosing the right bounds. You can choose worst case complexity O(NlogN), or O(Nlog^2N). The second is slightly worse, but might allow better algorithms to be implemented. Choosing the right bounds is not easy, but other standard libraries have done a lot of work for us already. Rust can learn from their experience and do better.
Of course, but there is a huge performance difference between using a stack and heap allocated memory (in particular for shorter sequences). The initial argument was that heap allocations were necessary (and thus the larger part of the Rust run-time, and syscalls), and that the cost of those would outweight the benefits. My point was only that they are not necessary.
Sure, and when Rust gets guaranteed tail call optimization O(1) space complexity bounds will be able to be upholded even when using recursive implementations. Still, even without guaranteed tail call optimization, doing so is still possible in Rust today.
I've worked a lot on binary, quad and octrees the past two years, and while it is trivial to implement tree traversals using recursion, it is not that hard to do so iteratively (with a loop, without extra stack storage like for quicksort). I've also read that another way to implement efficient traversals is using state machines, but for my use cases plain old loops were fine. Anyhow, even if it is slightly more complicated to implement these iteratively, none of these data-structures are in the standard library (the trees in the standard library are easy). So I would say that we should worry about which guarantees we provide for these when it comes time to move any of these into the std::collections (which might probably be never).
Even in BTrees, the majority of the methods don't do anything fancy (
Yes. I think that it would be better to put in the work to, e.g., try to come up with a strawman documentation of a particular collection like Vec. That would give us more concrete examples. Maybe coming up with bounds that make sense for everybody turns out to be easier than what @rkruppe think, but maybe he is right and it becomes very controversial and more hard than I think it is. @steveklabnik I would like to put in the work for this (e.g. come up with a strawman for documenting complexity and other guarantees with Vec as the use case). I would like to do it as a PR where we can easily discuss the specifics (the comments per line are just the right way to keep different discussions alive), and I would like to do so before writing an RFC (since I think having such an example for the RFC would be required anyways). Obviously the PR should not be merged until the RFC is accepted. Do you think this will be a good way to proceed? |
Implement a faster sort algorithm Hi everyone, this is my first PR. I've made some changes to the standard sort algorithm, starting out with a few tweaks here and there, but in the end this endeavour became a complete rewrite of it. #### Summary Changes: * Improved performance, especially on partially sorted inputs. * Performs less comparisons on both random and partially sorted inputs. * Decreased the size of temporary memory: the new sort allocates 4x less. Benchmark: ``` name out1 ns/iter out2 ns/iter diff ns/iter diff % slice::bench::sort_large_ascending 85,323 (937 MB/s) 8,970 (8918 MB/s) -76,353 -89.49% slice::bench::sort_large_big_ascending 2,135,297 (599 MB/s) 355,955 (3595 MB/s) -1,779,342 -83.33% slice::bench::sort_large_big_descending 2,266,402 (564 MB/s) 416,479 (3073 MB/s) -1,849,923 -81.62% slice::bench::sort_large_big_random 3,053,031 (419 MB/s) 1,921,389 (666 MB/s) -1,131,642 -37.07% slice::bench::sort_large_descending 313,181 (255 MB/s) 14,725 (5432 MB/s) -298,456 -95.30% slice::bench::sort_large_mostly_ascending 287,706 (278 MB/s) 243,204 (328 MB/s) -44,502 -15.47% slice::bench::sort_large_mostly_descending 415,078 (192 MB/s) 271,028 (295 MB/s) -144,050 -34.70% slice::bench::sort_large_random 545,872 (146 MB/s) 521,559 (153 MB/s) -24,313 -4.45% slice::bench::sort_large_random_expensive 30,321,770 (2 MB/s) 23,533,735 (3 MB/s) -6,788,035 -22.39% slice::bench::sort_medium_ascending 616 (1298 MB/s) 155 (5161 MB/s) -461 -74.84% slice::bench::sort_medium_descending 1,952 (409 MB/s) 202 (3960 MB/s) -1,750 -89.65% slice::bench::sort_medium_random 3,646 (219 MB/s) 3,421 (233 MB/s) -225 -6.17% slice::bench::sort_small_ascending 39 (2051 MB/s) 34 (2352 MB/s) -5 -12.82% slice::bench::sort_small_big_ascending 96 (13333 MB/s) 96 (13333 MB/s) 0 0.00% slice::bench::sort_small_big_descending 248 (5161 MB/s) 243 (5267 MB/s) -5 -2.02% slice::bench::sort_small_big_random 501 (2554 MB/s) 490 (2612 MB/s) -11 -2.20% slice::bench::sort_small_descending 95 (842 MB/s) 63 (1269 MB/s) -32 -33.68% slice::bench::sort_small_random 372 (215 MB/s) 354 (225 MB/s) -18 -4.84% ``` #### Background First, let me just do a quick brain dump to discuss what I learned along the way. The official documentation says that the standard sort in Rust is a stable sort. This constraint is thus set in stone and immediately rules out many popular sorting algorithms. Essentially, the only algorithms we might even take into consideration are: 1. [Merge sort](https://en.wikipedia.org/wiki/Merge_sort) 2. [Block sort](https://en.wikipedia.org/wiki/Block_sort) (famous implementations are [WikiSort](/~https://github.com/BonzaiThePenguin/WikiSort) and [GrailSort](/~https://github.com/Mrrl/GrailSort)) 3. [TimSort](https://en.wikipedia.org/wiki/Timsort) Actually, all of those are just merge sort flavors. :) The current standard sort in Rust is a simple iterative merge sort. It has three problems. First, it's slow on partially sorted inputs (even though #29675 helped quite a bit). Second, it always makes around `log(n)` iterations copying the entire array between buffers, no matter what. Third, it allocates huge amounts of temporary memory (a buffer of size `2*n`, where `n` is the size of input). The problem of auxilliary memory allocation is a tough one. Ideally, it would be best for our sort to allocate `O(1)` additional memory. This is what block sort (and it's variants) does. However, it's often very complicated (look at [this](/~https://github.com/BonzaiThePenguin/WikiSort/blob/master/WikiSort.cpp)) and even then performs rather poorly. The author of WikiSort claims good performance, but that must be taken with a grain of salt. It performs well in comparison to `std::stable_sort` in C++. It can even beat `std::sort` on partially sorted inputs, but on random inputs it's always far worse. My rule of thumb is: high performance, low memory overhead, stability - choose two. TimSort is another option. It allocates a buffer of size `n/2`, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in [Rust](/~https://github.com/notriddle/rust-timsort), [C++](/~https://github.com/gfx/cpp-TimSort), and [D](/~https://github.com/dlang/phobos/blob/fd518eb310a9494cccf28c54892542b052c49669/std/algorithm/sorting.d#L2062). The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones. #### The new algorithm Choosing the best algorithm is not easy. Plain merge sort is bad on partially sorted inputs. TimSort is bad on random inputs and block sort is even worse. However, if we take the main ideas from TimSort (intelligent merging strategy of sorted runs) and drop galloping, then we'll have great performance on random inputs and it won't be bad on partially sorted inputs either. That is exactly what this new algorithm does. I can't call it TimSort, since it steals just a few of it's ideas. Complete TimSort would be a much more complex and elaborate implementation. In case we in the future figure out how to incorporate more of it's ideas into this implementation without crippling performance on random inputs, it's going to be very easy to extend. I also did several other minor improvements, like reworked insertion sort to make it faster. There are also new, more thorough benchmarks and panic safety tests. The final code is not terribly complex and has less unsafe code than I anticipated, but there's still plenty of it that should be carefully reviewed. I did my best at documenting non-obvious code. I'd like to notify several people of this PR, since they might be interested and have useful insights: 1. @huonw because he wrote the [original merge sort](#11064). 2. @alexcrichton because he was involved in multiple discussions of it. 3. @veddan because he wrote [introsort](/~https://github.com/veddan/rust-introsort) in Rust. 4. @notriddle because he wrote [TimSort](/~https://github.com/notriddle/rust-timsort) in Rust. 5. @bluss because he had an attempt at writing WikiSort in Rust. 6. @gnzlbg, @rkruppe, and @mark-i-m because they were involved in discussion #36318. **P.S.** [quickersort](/~https://github.com/notriddle/quickersort) describes itself as being universally [faster](/~https://github.com/notriddle/quickersort/blob/master/perf.txt) than the standard sort, which is true. However, if this PR gets merged, things might [change](https://gist.github.com/stjepang/b9f0c3eaa0e1f1280b61b963dae19a30) a bit. ;)
@gnzlbg : coming back to this thread after a long time, a bit tough to catch up.
So, last time I chatted with @rust-lang/libs about something like this, they basically said "we'll take it case by case". So, I think the right move here would be to make one PR per guarantee, and we get libs to sign off on it. No RFC needed. Thoughts, libs team? |
I'm guessing the motivation is simplifying the effort? or is there another reason? |
That's reasonable, I can send pull-requests for the individual guarantees. The only thing to be careful about is the order of these pull-request, since lots of guarantees build on-top of each other. |
I'm going to give this one a close, because it's not particularly actionable. If anyone wants to improve collections docs, please send in PRs for the improvements you want to see! |
The documentation of the
std::collections
:I really think that being more thorough with the documentation will help both find and fix bugs in the documentation and make the std library way better.
The solution I propose is to make all the relevant information accessible from all the methods.
The first thing to do would be to achieve consensus on which information should be available for each method. Taking inspiration from the C++ standard, and to kickstart the discussion, I would propose that each method / algorithm / function documentation comment should contain the following "fields" below the short and long descriptions. If a field doesn't make sense for some particular method, I would propose to still have it there but explicitly leave it blank.
Vec::push(t)
: amortized O(1): exactly 1 move/copy oft
ifVec::size() < Vec::capacity()
, otherwise worst case O(N): exactlyN + 1
moves/copies whereN == Vec::size()
.Vec::reserve(M)
: O(1) time ifM <= Vec::capacity()
, O(N) otherwise: exactly N moves/copies whereN == Vec::size()
.Vec::push(t)
: amortized O(1) stack space, otherwise worst case O(1) stack space and a single memory allocation of O(k*N) where k is an implementation-defined growth factor independent ofN
and whereN == Vec::size()
.Vec::reserve(M)
: O(1) stack space ifM <= Vec::capacity()
, otherwise a single allocation ofO(k*N)
where k is an implementation-defined growth factor independent ofN
andN == Vec::size()
.Vec::push(t)
: Before:size() == N
,capacity == M
, After:size() == N + 1
,capacity() == M
(amortized case) orcapacity() == k*M
(worst case, wherek
is an implementation defined growth factor),pop() == Some(t)
.Vec::reserve(N)
: Before:capacity() == M
, After:capacity() == max(N, M)
.Vec::push(t)
panics if OOM or if a copy of t is performed and copying panics.Vec::size()
never panics.I consider this the bare minimum level of "formalization" that at least the basic collections and algorithms in the standard library should have, but as you can see things become extremely verbose so one might want to explore how to factor out some parts of the documentation (e.g. that for
Vec
the growth-factor is a constant independent of the number of elements or the capacity of the vector) that are to be repeated in a lot of comments.So that's for my strawman proposal. I guess the steps to follow (please correct me if I'm wrong):
For a feeling of the current state of things:
HashMap::insert
doesn't mention that insertions are O(1) so if one lands there (e.g. via google) one is on its own. Relevant information should be visible right away.vec::insert
: in the amortized case requires O(1) stack memory and will perform exactly O(end - pos) moves but in the worst case it requires O(kN) memory and O(N) moves. None of this is "obvious" from the docs. We should make it obvious.std::slice::sort
complexity is O(NlogN) in the number of elements, but it mentions that it allocates approximately "2*n" elements. Does that mean at worst 2N? Can it be more? While this is better documented than other methods, it is still extremely vague. In particular, trivial stable sort implementations require 1/2N extra memory, a bad one at least N, and the good ones are in place (e.g. GrailSort requires O(1)). The rust standard library doesn't have an inplace unstable sorting algorithm (AFAIK), but the documentation of one implemented with something quicksort-based like intro-sort should say that it uses O(logN) stack space in some cases (I did not have any examples above with anything but O(1) stack space, but basically every recursive algorithm requires something different than O(1)).cc @GuillaumeGomez
The text was updated successfully, but these errors were encountered: