-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alter Once to remove unnecessary SeqCst usage #31650
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
@@ -71,7 +71,7 @@ impl Once { | |||
#[stable(feature = "rust1", since = "1.0.0")] | |||
pub fn call_once<F>(&'static self, f: F) where F: FnOnce() { | |||
// Optimize common path: load is much cheaper than fetch_add. | |||
if self.cnt.load(Ordering::SeqCst) < 0 { | |||
if self.cnt.load(Ordering::Acquire) < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more efficient would be to use a Relaxed
load and only fence(Acquire)
if the count is in fact negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, in this case you would want to acquire to be on the load itself. This is because some architectures (ARM, AArch64) have a load-acquire instruction that is much faster than the fence instruction. Since this branch is going to be taken in the majority of cases, you would want to avoid the fence overhead here.
Memory orderings are tricky enough that I think that having comments explaining why the more relaxed bounds are correct would be very useful. |
@@ -102,11 +102,11 @@ impl Once { | |||
// calling `call_once` will return immediately before the initialization | |||
// has completed. | |||
|
|||
let prev = self.cnt.fetch_add(1, Ordering::SeqCst); | |||
let prev = self.cnt.fetch_add(1, Ordering::AcqRel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be Relaxed
with a fence(Acquire)
if the count is negative: since swap
and fetch_add
are atomic, no extra synchronization is necessary to make the number of threads that pass through the fetch_add seeing a positive integer exactly the number that will be read as the result of the swap
.
@gereeter Thanks a lot for the thorough review - really made me think about/learn more about the use of relaxed ordering, where I'd just been concerned with the sequential consistency aspect. Made changes which hopefully address your thoughts. |
Ah, looks like there's an issue filed for this: #27610 |
@Amanieu (continuing out of a now hidden comment) Pro
Pro
Neither of the two options would even allow LLVM to produce the optimal code on PowerPC and ARMv7, which would be an instruction sync only in the branch for the early exit. The fence would prevent LLVM from using an instruction sync at all, while the Given all that, I'm inclined to support switching back to a |
I'm personally very wary of relaxing any orderings in the standard library, especially if we're coming up with the design ourselves. These are notoriously hard to get right unfortunately. Along those lines, do you have any benchmarks to show the impact of these relaxed orderings? It would be good to get a handle on what sort of runtimes we're talking about. I suspect x86 will benefit, but the greatest benefit is likely from ARM. I also suspect that we only really need to relax one ordering in this function to get any real benefit, which is the first one. None of the orderings really have any contention which needs to be super fast, so I would personally prefer to see what our best perf increase would be and then move as much as possible back to |
@alexcrichton The benefit of relaxing SeqCst for the first instruction will likely be zero on x86 - the extra barrier for SeqCst comes on the write. Acquire/Release comes for free on x86, while the extra fence for sequential consistency adds (IIRC) about 100 cycles to each write. I can understand a fear of using Ordering::Relaxed - I do tend to think it can leave code harder to change safely. On other other hand I'd argue against insisting on full sequential consistency everywhere in the standard library that isn't absolutely performance critical. The difference between sequential consistency and acquire/release is not complicated for simple code like this, and given the broad scope of usage of standard library constructs, it seems a shame to spend cycles on obviously pointless work - even if the overall benefit is pretty minor. |
To me at least these orderings seem pretty non-trivial. I've found that to reason through these you basically have to reason about all orderings with respect to other orderings, and there's quite a few going on here. In terms of performance I think we'll only really get wins by fixing the first few (the ones that check to see if the initialization has already run and finished). Only modifying those would be much easier to reason about (to me at least) |
Yeah, I don't by any means intend to say that Acquire/Release vs Sequential Consistency is always trivial, just that it is in this particular case - if we were to ignore Relaxed and use Acquire/Release for everything, all the synchronization necessary to guarantee visibility of the results of the closure call occurs on the self.cnt variable. That being the case, sequential consistency is irrelevant, because to be useful over AcqRel it requires a minimum of two different atomic variables being altered in different threads. All that said, any performance gain from this is obviously going to be too minor to quibble extensively about, so I'm of course perfectly happy to change this so that it just uses Acquire on the early exit checks. I'm busy tonight but I'll sort it out as soon as I can :-). |
Ok, I think that making the first two operations unconditionally I'd want to confirm with others still, though. |
Since we're talking about optimizing the hot path, I think it would be nice to split the cold path of |
I would be ok doing so assuming that benchmarks are shown that it's a sizable improvement. |
I am not familiar enough with this code to have a strong opinion, and don't care to become so. ;) cc @aturon |
+1 for cold path split |
☔ The latest upstream changes (presumably #32325) made this pull request unmergeable. Please resolve the merge conflicts. |
This has been stuck in limbo for a while, so @rust-lang/libs should discuss it. I'm in favour of optimising it, but we should definitely be sure it is correct. I imagine there's other implementations of As @alexcrichton says, it seems like we could at the very least just weaken the first ordering (together with its associated |
The big issue here is that we need to construct a
|
@Amanieu note that after #32325 we no longer have a mutex at all, so that concern is essentially moot. In my opinion the only ordering which needs to change is this one which can likely be relaxed to |
My apologies, this completely fell off my radar. I shall try to take a look at it this weekend. |
OK, looks like it's a pretty simple change at this point. @alexcrichton is correct on the potential performance impact, although alas I have no multi-core ARM machine to test on. |
I ran some tests on ARM & AArch64 machines and the performance of load(Acquire) and load(SeqCst) are identical. However I still think this change is the right thing to do. |
The libs team discussed this during triage yesterday and the conclusion was that we don't want merge this at this time. This unfortunately reduces readability as the non- Along those lines, without evidence in favor of this, we'd like to keep the code as it is today, but thanks regardless for the PR @AlisdairO! |
Additionally, @alexcrichton looked at the ARM assembly, and apparently it was identical for both |
It seems that while load(Acquire) and load(SeqCst) does generate the exact same assembly on ARM, there is a difference on PowerPC: load(Acquire): ld 3, 0(3)
lwsync load(SeqCst): sync
ld 3, 0(3)
lwsync Since |
Perhaps, but if we're just fishing for platforms where this makes a difference then we'll of course find such a platform. It still stands that no performance measurements show an improvement and it decreases code readability, so until that state changes we'll likely leave as-is. |
I have a different opinion on readability. According to this paper, |
@jeehoonkang heh I think you're far more well versed on this topic than I, so I'm definitely willing to defer to you! I was mostly just probing for rationale on #44331 as the rule of thumb seems to be |
Just had a look at Once, and it looks like it's unnecessarily using SeqCst where acquire/release semantics would be sufficient. Would appreciate a review to be sure, but I can't see why it would require SeqCst.