-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve the sleeping thread algorithm #5
Conversation
It isn't very clear what the sleepy state do now - does it continuously yield to the OS scheduler like it currently does, or does it just hold a lock and goes on to sleep? The yield operation has felt like a major source of non-determinism and I'm looking forward to getting rid of it. |
By the way, an implementation of this algorithm is available in my latch-target-thread branch. The results so far are...not encouraging. It seems to perform terribly and not to help with CPU usage. =) However, I only finished hacking it up this morning and I haven't tuned or examined it at all. So it may well be buggy or maybe even incomplete. Regarding the RFC:
One thing I think would be very useful also is to try and clarify our benchmarks. I have been using the rayon-demo benchmarks, but if people have other suggestions that seems good. In the past, we've had some FF folks do experimentation, but it'd be good to get some other "real world" benchmarks. |
Update:
I made some improvements and this is no longer the case. Performance now seems to be comparable to the existing scheduler, perhaps if somewhat slower. CPU usage is indeed improved. I've been testing primarily on the If you take a look at the branch, you'll also see that I implemented a number of variations. I'd like to start doing some "more rigorous" benchmarking to try and decide between them. There are a number of things we can tune, I'll try to update the RFCs with a bit more details on those factors -- although preliminary tinkering hasn't shown that much impact. Here are some examples: For Running with In terms of CPU usage:
|
Not sure if it makes a difference, but can you try disabling Turbo Boost? I imagine there might be some interactions between the CPU frequency and the spinning threads. |
@lnicola any advice on how to do that on Fedora linux? |
Check if you're using
If so, then:
If you're not using |
Fascinating. Disable turbo boost yielded the following results. These are the results of ``` for n in 1 2 3; do cargo run --release -- life bench --size 1024 --skip-bridge ; done I ran the command twice and I am reporting the second round of results. (The first round were basically comparable.)
So basically identical. |
How does this compare with what crossbeam-channel does? @stjepang |
The algorithm as described is flawed -- as I wrote, the 'new work' notifications are not guaranteed to be observed, but this means that it is possible for a job to be injected from outside the pool without actually waking any threads. As a result, all the threads can be sleeping and they'll never be awoken. This isn't a problem with jobs from within the pool because at least one thread (the injector) is awake. This seems correctable in a variety of ways, but it's too late for me to think about the best way to do it at this moment. =) |
OK, I did some experimenting this morning. Really, more than I should have, as I should be doing some rustc work right now. I came up with two approaches to solve the problem. The first is to keep a counter of all "injected jobs". We would read the value of this counter upon entering the idle loop and then check that it has not changed before going to sleep. If you are careful with respect to the ordering of events, and you use seq-cst guarantees, you can then ensure that you would have observed either a new job in the queue or the counter changing. The downside of this approach is that (a) injected jobs are kind of special and -- more importantly -- (b) it requires an atomic increment per injected job even if no threads are sleepy. This seems bad in the steady state. It feels like a valid use case would be injecting jobs from outside the pool over time. (I'd eventually like to make threads outside the pool able to be more active participants, as well, and this seems like it would work against that.) So I experimented with another approach. I brought back the "sleepy worker" concept from the existing sleeping pool. This means that workers fall asleep one at a time, coordinating via a per-thread-pool atomic. In particular, before actually going to sleep, they must first become "the sleepy worker" (of which there is only 1). This requires an atomic compare-exchange. When they then try to go to sleep, they release the "sleepy worker" state with another exchange. When we inject new job, meanwhile, we can check if there is a sleepy worker and pre-emptively clear the flag. This means that when going to sleep, a sleepy worker will notice that they are no longer sleepy, and hence they will go back to searching for work again. Experimentally, both of these approaches prevent the deadlock I was seeing, though I'd like to document them more carefully. I compare the performance on a handful of benchmarks and found that overall the sleepy worker performs best. You can view a plot of my measurements here -- I'll try to put them in some more consumable form later. Each point is the output of a single
You can see that sleepy worker performs better than injection counter, and master seems to generally perform best of all (though the difference is small). The biggest change is the quicksort-par-bench benchmark. |
It's still a performance problem if that happens within, because we won't use all of the threads that we could. Not as bad as a hang making no progress though. Anyway, the "sleepy worker" solves both aspects, right? How is the idle CPU usage with this back in place? |
Correct. I argue in the RFC why this is unlikely to be a major problem, but it's still something to be minimized.
Partially correct. It does apply equally to both cases, but it's still possible for us to have fewer threads active than the "optimum". For example, it two new jobs appear very close to one another, and we have only one idle thread, both of them might conclude that no threads need to be re-awoken -- but the idle thread will only be able to handle one of those jobs. However, we do now guarantee that we will never have zero active threads when there are waiting jobs.
I've not tested, I'll take a look. I wouldn't expect a major impact but you never know. |
Reran the CPU usage benchmarks (with turbo boost disabled, this time). You can see a difference from before, but still markedly improved from master, at least for smaller sizes.
It's also worth pointing out that I haven't tried tuning the "number of rounds" values -- currently I have 16 rounds until sleepy, then 1 more round until asleep. Probably worth experimenting. Oh, and the no-op command from rayon-rs/rayon#642 still uses very little CPU overall (though slightly more): > time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 1
real 1.12
user 0.05
sys 0.05
> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 10
real 10.15
user 0.05
sys 0.05 |
Also, we could in principle allow workers to fall asleep more than one at a time with the same basic mechanism, but it would require us to use e.g. 1 bit per worker, which would put a hard cap on the number of workers. I didn't like the sound of that.
It occurs to me that some kind of pseudo-random counter that tries to pick a random number of rounds might help to "stagger" threads and lessen the chance of them blocking on one another. |
I tried playing with different number of rounds. The results surprised me.
OK, off to bed. =) |
I had an idea last night but I've not had time to try it out. I thought I'd jot it down before I forget, since I don't have time to try it out this morning. The goal would be to:
but without requiring a write in the "steady state" of jobs being injected regularly. Basically, you have two counters:
They both start out at zero.
Ideally, you would make these two 32-bit counters in a 64-bit word, so they can be easily read and manipulated atomically. One complication is how you handle rollover. Particularly w/ 32-bit counters, I think that is a real possibility. But it..seems like it should be possible somehow. I'm imagining that a SLEEPY job that would roll over has to also adjust the JOBS counter back to zero, and then that the workers going to sleep need to check if SLEEPY < S -- that indicates rollover and they should probably just start over and go sleepy again. (Technically, of course, they might delay so long that SLEEPY gets re-incremented back up to equal S, but that seems like a scenario we can discount as being a truly pathological scheduler. If we were really worried, we'd have a separate 64-bit "epoch" counter they could look at, I guess, that also gets incremented on rollover?) Anyway, as I said, I came up with this last night while drifting off to sleep, so maybe there's a flaw. Presuming it works, though, I think it should avoid steady state writes because -- basically -- if all threads are busy, then nobody is getting sleepy, and there is no need to increment JOBS, you just push the work onto the queue. (Something about this is bothering me; it feels like there is such overlap between the idle/sleeping job counters and these counters, but I don't yet see how to consolidate them.) |
OK, so I got some more time to mess with this today. First of all, I implemented the algorithm described in the previous comment, with one slight variation which I edited into the comment: when announcing new jobs, we always CAS to make JOBS equal to SLEEPY. This forces all SLEEPY workers to cycle around one more time. This seems to be pretty important for overall performance, empirically. While this algorithm remains my personal favorite (*), it doesn't really perform particularly differently from the rest. In particular, the The problem seems to be precisely this matter of not keeping enough threads awake. The sleep algorithm on master, when monitored with A few updates: I've looked into the logs of what's happening. As best I can tell, with the algorithm as described here, we suffer at least sometimes from the expected race of "many jobs being published but only 1 idle worker". The jobs all expect that idle worker to service them, but it can only handle 1 job. I went and re-read the Go scheduler comment and noticed one idea I had not considered before, which is to have the last idle thread, when it finds work, awake a replacement. I implemented this but it didn't seem to help much (still, it seems smart, so I kept it). I also stared at the "logs" that Rayon can generate. It's hard though to know how must to trust them because they just dump out with One thing I am considering is trying to implement a better logging mechanism. The idea would be to have each thread kind of log events in a lightweight fashion (perhaps pushing to a thread-local vector), dump them out, and then later try to reconstruct the overall state (how many threads idle, sleeping, how many jobs lingering in queues). I've not thought too much about this but it seems like it'd be a super useful tool. It also seems like it might overlap a lot with what @wagnerf42 has proposed in #4, which could be good. It also seems like a fair amount of work. =) Barring future improvements, though, we have a few choices. We could land the new scheduler roughly as is and accept that One other thing I should probably do is to try and update that measurement spreadsheet with other kinds of data, such as the benchmark results for different variants of this RFC. (*) Update: What are the major things I've explored thus far? There are two axes. First, how to detect idle workers? With a counter or a heuristic? I think a counter is probably better. It seems simpler, performs roughly the same, and it allows us to detect things like "when the last idle thread finds work". Second, how to avoid deadlock? I tried three variants:
So you can see why the latest variant is my favorite, I guess. |
OK, I did a bit more digging. I added a new-and-improved logging mechanism that lets us (a) measure without interfering with wall-clock times by deferring to a separate logging thread and (b) reproduce the state of the rayon workers at each point in time. Right now it produces a CSV file with the number of sleeping/idle/notified threads, number of pending jobs, along with the state of each worker. I'd like to connect this to a nifty chart to let us visualize what is going on. Studying the data let to two tweaks to improve what seemed to be failure modes. Unfortunately, these didn't appear to improve performance on the quicksort-par-bench, but they still seem like good ideas: I realized we can combine the "local queue is empty" heuristic with precise counters. So now, if we see that the local queue is non-empty, and there are sleeping threads, then we always try to wake a sleeping thread -- no matter if there are idle threads. The premise is "well those idle threads didn't seem to be consuming the things in my queue". This helps in particular to deal with the races where there may be a lot of jobs pushed but only a small number of idle threads, leading us to wake too few workers. The other change I made is to tweak how the notification mechanism works. Since we have a per-worker-thread boolean state, we can now set it to false when a thread is notified (not when it actually awakens). We can also subtract the number of sleeping threads at that time. This fixes a problem I observed where many new jobs arrive and each re-notifies the same sleeping worker, since they didn't yet awake. At this point, when I look over the data, everything seems to roughly be working "as it should". When new jobs come, we start to wake up workers, etc. We do seem some temporary blockage of large number of jobs waiting to be stolen (sometimes up to 10 or 12) but it seems like that is largely a result of threads not waking as fast as one might like. In any case, got to run for this morning, maybe I'll hook the CSV up to gnuplot and try to get a nice figure that visualizes what's going on. I imagine that might help in spotting any other anti-patterns. |
OK, a few updates: I realized that my "Separate JOBS and SLEEPY counters" implementation was pretty bogus and prone to deadlock -- I've replaced it with a new, cleaner impl that (so far) seems to work fine. In the process (and crucially) it combines all the counters (i.e., also those tracking the number of idle/sleepy threads). I'm still testing this but it seems to be working now. Finally, though, while I still have a few more things to try, there is a distinct possibility I won't be able to recover the perf on In terms of next steps:
|
hi, sorry if I'm asking dumb questions here. I don't really know where to start with this. the good points are:
we used to use in a slightly different context so i'm not sure it would work here but I don't see why it would not. |
@wagnerf42 certainly worth a try! |
I haven't looked at the code, but this might serve as inspiration: /~https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs. |
@nikomatsakis do you have a target release in which you plan to improve this? Just wondering as there hasn't been much movement since the initial diagnosis. |
The latch-target-thread branch produces an enormous speedup for rav1e's threading, on my 2990wx using 56 tile threads. Before: INFO rav1e::stats > encoded 200 frames, 6.404 fps, 1052.40 Kb/s After: INFO rav1e::stats > encoded 200 frames, 9.243 fps, 1052.40 Kb/s |
Update: Sorry for the radio silence. I got overwhelmed for a while. But I've come back to this work in the last week or so. I've got some good news, though not perfect. The biggest concern when @cuviper and I talked last was the fact that certain benchmarks -- notably our parallel sort routine, but also the I spent some time investigating what is causing this slowdown and what we can do about it. Along the way, I did find one bug in the handling of idle threads. Fixing that reduces the slowdown on I also backported the "event logging" framework that I added on the I spent the last few days closely studying I'm not sure how much we can do about this slowdown. It feels somewhat inherent. Any changes that lead to more idle tasks hanging around will also increase latency. One thing I tinkered with was modified the routines to try and make idle tasks spend a larger percentage of their time searching for stealing tasks, but I didn't have any success with that yet. I've not yet done a detailed look at the |
I'm pulling this into a separate comment. Looking for helpMy time is pretty limited -- is anybody interested in collaborating on this work? I'm enjoying it, but I'm also thinking it would go faster if somebody else wanted to help a bit. If so, ping me (on Discord, Zulip, or even gitter, although I don't notice pings there quite as well). Next steps
|
well, I can re-run the sort benchmark and take a look at the logging part that's for sure. |
well, the latch branch is slightly faster on my desktop (5,37 vs 5,3). |
Huh, interesting! How many cores etc is your desktop? I've not tested with different values of
uh oh.
I am reminded that I have to update this RFC, which is woefully out of date in some particulars. |
hi niko, |
I didn't save the parameters I used for the previous test, but here's a newer one: On this CPU I would consider the difference between the last two to be noise. |
746: new scheduler from RFC 5 r=cuviper a=nikomatsakis Implementation of the scheduler described in rayon-rs/rfcs#5 -- modulo the fact that the RFC is mildly out of date. There is a [walkthrough video available](https://youtu.be/HvmQsE5M4cY). To Do List: * [x] Fix the cargo lock * [x] Address use of `AtomicU64` * [x] Document the handling of rollover and wakeups and convince ourselves it's sound * [ ] Adopt and document the [proposed scheme for the job event counter](#746 (comment)) * [ ] Review RFC and list out the places where it differs from the branch Co-authored-by: Niko Matsakis <niko@alum.mit.edu> Co-authored-by: Josh Stone <cuviper@gmail.com>
The implementation has merged in rayon#746 -- I guess we should fix any inconsistencies here and merge as well... |
793: Release rayon 1.4.0 / rayon-core 1.8.0 r=cuviper a=cuviper - Implemented a new thread scheduler, [RFC 5], which uses targeted wakeups for new work and for notifications of completed stolen work, reducing wasteful CPU usage in idle threads. - Implemented `IntoParallelIterator for Range<char>` and `RangeInclusive<char>` with the same iteration semantics as Rust 1.45. - Relaxed the lifetime requirements of the initial `scope` closure. [RFC 5]: rayon-rs/rfcs#5 Co-authored-by: Josh Stone <cuviper@gmail.com>
I'm wondering, what's the current status of this? :) |
Ah, this shipped in rayon-rs/rayon#793. |
The last commit was WIP with an incomplete section, but I don't really expect to revisit and complete that at this point, so I just removed that. I'll merge it as-is. |
Rayon's existing approach to putting threads to sleep can lead to
excessive CPU usage (see e.g. rayon-rs/rayon#642). In the current
algorithm, threads can gradually put themselves to sleep if they don't
find work to do. They do this one at a time. But as soon as any
work arrives (or -- in fact -- even any work completes) all threads
awaken.
This RFC proposes an alternative algorithm, and explores some of the
design options and tradeoffs available. It does not claim to be
exhaustive and feedback is most certainly desired on alternative
approaches!