Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with rustc 1.81 #203

Closed
plusvic opened this issue Nov 29, 2024 · 18 comments · Fixed by #204
Closed

Performance regression with rustc 1.81 #203

plusvic opened this issue Nov 29, 2024 · 18 comments · Fixed by #204

Comments

@plusvic
Copy link
Contributor

plusvic commented Nov 29, 2024

rustc 1.81 completely replaced the implementation of slice::sort_unstable_by_key. These changes are mentioned in the release notes, and more information can be found in the corresponding PR: rust-lang/rust#124032.

The new sorting algorithm was supposed to be faster in the general case, but in some specific cases it looks like it is way slower, and this has impacted the wasmtime crate via regalloc2. I've experienced a 10x times slowdown when compiling some WASM modules using wasmtime with rustc 1.81 (more details can be found in: VirusTotal/yara-x#209)

I tracked down the issue to be caused by the merge_bundles function, which uses sort_unstable_by_key. According to my investigation, sort_unstable_by_key is being called with slices that are mostly sorted. Apparently, two already sorted slices are being concatenated and then sorted again. This is confirmed by the following comment in the code of merge_bundles:

// Two non-empty lists of LiveRanges: concatenate and
// sort. This is faster than a mergesort-like merge into a new
// list, empirically.

This pattern, which worked well with the previous implementation of sort_unstable_by_key, seems to be pathologically bad for the new one. The assumption that concatenating the two lists and sorting them again is faster than merging the two sorted lists likely no longer holds true.

If you are curious about which WASM code is triggering this issue, it is an implementation of the "switch" statement in WASM using multiple nested blocks and a br_table instruction that jumps out the block that corresponds to the input value. The overall schema is shown below.

(func $switch_example (param $x i32)
  (block $default  ;; Default branch
    (block $case2   ;; Case 2
      (block $case1 ;; Case 1
        (block $case0 ;; Case 0
          ;; Use br_table to jump based on $x
          (local.get $x)    ;; Push x onto the stack
          (br_table $case0 $case1 $case2 $default) ;; Jump to the right case
        )
        ;; Code for Case 0
        ;; Add your case 0 logic here
        (br $default) ;; Continue to the end
      )
      ;; Code for Case 1
      ;; Add your case 1 logic here
      (br $default) ;; Continue to the end
    )
    ;; Code for Case 2
    ;; Add your case 2 logic here
    (br $default) ;; Continue to the end
  )
  ;; Default case code
)
@Amanieu
Copy link
Contributor

Amanieu commented Dec 1, 2024

cc @orlp @Voultapher (authors of the new sort implementation)

@cfallin
Copy link
Member

cfallin commented Dec 1, 2024

Fascinating, thanks for tracking this down!

At least at a surface level, I'm not super-inclined to rewrite part of the register allocator based on a performance bug in std; that feels like something that should be fixed in std. Sorting of already almost-sorted data, or of two concatenated sorted subsequences, should be fast (and hopefully will return to being fast in the future, at which time we'd want to throw away whatever ad-hoc less-optimized custom workaround that we write). I'd encourage you to report the bug upstream though, if not already, and I'd be interested to hear what they think.

@Voultapher
Copy link

Hi, thanks @plusvic for raising the issue. Looking at the std library documentation:

It is typically faster than stable sorting, except in a few special cases, e.g., when the slice is partially sorted.

If your input is nearly sorted or all you want to do is merging, slice::sort is the much better fit as explained in the documentation, it's a hybrid merge and quicksort so it can efficiently handle nearly sorted sub-slices.

@Voultapher
Copy link

Voultapher commented Dec 1, 2024

To expand a little more, @orlp and I are aware of the append + sort use case and did include it in our design and benchmarking, under the pattern name random_s95, 95% sorted + 5% unsorted unstable sort design document and stable sort design document.

The previous implementation of slice::sort_unstable did a partial insertion sort of up to 5 elements. So if the appended slice is at most 5 elements you get a pretty fast run-time. Something I suspect happened here. However this approach has a couple issues, it creates a rather arbitrary performance cliff, if you append 6 instead of 5 elements you suddenly get random pattern performance plus you wasted 5 near full scans of the input, as seen when looking at the number of comparisons performed:

image

The robust and generic fix is to run detection + lazy merging, exactly what driftsort does, which is the current implementation of slice::sort.


For questions about methodology design choices and evaluation, please go and read the design documents.

@cfallin
Copy link
Member

cfallin commented Dec 1, 2024

Thanks for the info @Voultapher. I definitely respect that you and @orlp spent a ton of time on this update. It's unfortunate that it has edge-cases where performance regresses; I've been in your position where one is trying to make a cross-ecosystem update so I definitely understand the frustration of seeing these regressions.

Unfortunately I don't have any cycles to spend on this right now; it was an unexpected and unplanned performance regression (from my point of view); and I have no way of reproducing it (the above Wasm is a "schema" but not an actual file to test). @plusvic since you were the reporter, would you be willing to take your regressing use-cases, and perhaps try the fix suggested above (sort instead of sort_stable) with a local patch of regalloc2? I'd be very happy to review a PR that makes this update if you find that it helps, and then we can cut releases and propagate it through to Cranelift and Wasmtime.

Thanks!

@plusvic
Copy link
Contributor Author

plusvic commented Dec 2, 2024

I can confirm that by using sort_by_key instead of sort_unstable_by_key, performance improves a lot in rustc 1.81. However, rustc 1.80 keeps being faster. These are the results I got:

Rust 1.81 sort_unstable_by_key
[2024-12-02T08:46:47Z INFO  yara_x::compiler] WASM module build time: 29.76301438s

Rust 1.81  sort_by_key
[2024-12-02T08:41:44Z INFO  yara_x::compiler] WASM module build time: 6.737162719s

Rust 1.80 sort_unstable_by_key
[2024-12-02T08:51:17Z INFO  yara_x::compiler] WASM module build time: 2.684071474s

Rust 1.80 sort_by_key
[2024-12-02T08:54:38Z INFO  yara_x::compiler] WASM module build time: 4.614773787s

In resume, for this particular case sort_by_key is faster than sort_unstable_by_key in rustc 1.81 by a factor of 4-5x. However, in rustc 1.80 is the other way around, sort_by_key is slower. rustc 1.80 still beats rust 1.81.

I also decided to give a try to a different solution, so I replaced this code:

self.bundles[to].ranges.extend_from_slice(&from_list[..]);
self.bundles[to]
   .ranges
   .sort_by_key(|entry| entry.range.from);

With this...

for entry in from_list.into_iter() {
    let pos = self.bundles[to]
        .ranges
        .binary_search_by_key(&entry.range.from, |entry| entry.range.from)
        .unwrap_or_else(|pos| pos);
    self.bundles[to].ranges.insert(pos, entry)
}

This ended up being 2x faster that the existing implementation, and performance doesn't vary with the rustc version.

[2024-12-02T09:30:42Z INFO  yara_x::compiler] WASM module build time: 1.088564278s

Of course this is a very specific case, in which we have a very long sorted slice, and we are adding only a handful of new items to it. I don't know if this is the most common case, probably not.

As my knowledge about the regalloc2 crate is close to zero, my opinion should be taken with a grain of salt, but I would say that the most robust solution here is implementing the mergesort-like merge that is mentioned in the comment. It may be slower than the current "append and re-sort" solution in particular cases, but in the general case it's guaranteed to depend linearly on the total number of items, something that current method can't guarantee.

If it turns out that in most cases the merge_bundles function is trying to merge a relatively short slice into a larger one, then the "binary search and insert" method I proposed above is also a good alternative.

@orlp
Copy link

orlp commented Dec 2, 2024

@plusvic Would it be possible to log and post the distribution of (left.len(), right.len())?

I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:

1, 3, 5, 7             2, 4, 6, 8

That said, I do think if this is so performance critical a proper merge algorithm instead of concatenating + sorting is the play.

@Voultapher
Copy link

@plusvic I find it surprising that Rust 1.80 sort_by_key is ~1.45x faster than the Rust 1.81 version of it. The relevant code for such a pattern stayed almost the same, run detection stayed very similar and merge only saw small changes, none of them should be able to account for such a big difference. Could you please share a repro repository that contains everything including documentation how to run it to observe the same changes you got, I'd like to dig into it.

In the meantime feel free to try out just using the merge function from the stdlib link.

@plusvic
Copy link
Contributor Author

plusvic commented Dec 2, 2024

This is what I get while logging the length of both slices. The first one is self.bundles[to].ranges, the second one is from_list.

1,1
1,1
1,1
1,1
1,2
1,1
1,2
1,5
1,1
2,1
3,6
3,1
4,1
5,1
1,1
2,1
6,3
1,1
2,1
3,1
4,1
5,1

... a lot of lines removed here, all of them were N, 1 where N goes from 6 to 49998.

49999,1
50000,1
9,1
9,1
10,1
11,1
12,1
13,1
14,1

While compiling less extreme code I got this:

1,1
1,1
1,2
1,1
1,1
1,1
1,2
1,1
1,2
1,1
1,2
1,1
1,2
2,2
1,4
1,5
1,1
1,1
1,1
1,6
2,2
1,4
1,5
1,6
7,2
1,9
1,1
7,10
1,17
1,1
1,18
1,19
5,2
1,7
1,8
1,9
10,2
1,12
1,1
1,13
1,1
2,1
3,1
20,14
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
34,1
35,1
36,1
8,1
37,1
38,1
39,1
8,1
2,1
3,1
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
4,1
40,1
41,1
42,1
8,1
43,1
44,1
45,1
8,1
2,1
1,1
2,1
3,1
4,1
3,1
1,1
2,1
3,1
4,1
4,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
9,7
2,1
3,1
1,1
2,1
3,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
4,1
5,1
2,1
3,1
1,1
2,1
3,1
4,1
5,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
46,1
47,1
48,1

I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:

1, 3, 5, 7 2, 4, 6, 8

My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item). The key of this implementation is that, when some item is not found in the first list, binary_search_by_key returns an error that contains the position where that item would be located if it had existed. No matter if the item existed or not, you always get a position where you can insert the item and the list remains sorted. This is an example with the input you suggested:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e3af50bcb2d8e8d1205d72e05ccbedd3

@plusvic
Copy link
Contributor Author

plusvic commented Dec 2, 2024

@Voultapher, you can find the project at /~https://github.com/VirusTotal/yara-x

I recommend compiling it with the logging feature: cargo build --release --features=logging. You'll get a command-line tool at ./target/release/yr. Run the tool as:

RUST_LOG=info ./target/release/yr scan big_rule.yar README.md

This runs the scan command with two files, the second one (README.md) is not relevant in this case, you can use any file (but the tool expects some file, so I use my own README.md file). The big_rule.yar file is the important one, it contains what you need to produce the WASM code that triggers this edge case. I'm attaching the big_rule.yar file in a ZIP.

big_rule.yar.zip

Let me know if you need anything else.

@orlp
Copy link

orlp commented Dec 2, 2024

My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item).

@plusvic Yes, that's a really inefficient merge :) I missed that you iterated over each item.

@plusvic
Copy link
Contributor Author

plusvic commented Dec 2, 2024

I've tried a naive merge with two iterators that advance on one slice or the other depending on which item is smaller, and adding the smaller item to a new vector, but that's slow due to the additional allocations.

The best solution I've found so far is handing the case in which we are adding only one item as a special case. As it was shown above, this is a very common case (not only in my edge case, but in general), so I think it's worth optimizing for that case.

I compiled a gigantic WASM module and saw a 6-8% performance improvement with respect to the current implementation, even with rustc 1.80.

rustc 1.80 current implementation
[2024-12-02T15:51:10Z INFO  yara_x::compiler::rules] WASM build time: 381.213840248s

rustc 1.80 with optimization
[2024-12-02T16:01:53Z INFO  yara_x::compiler::rules] WASM build time: 350.861894663s

I'll send a PR for your consideration.

@Voultapher
Copy link

Preliminary results

I was able to reproduce the issue locally, and patched in my own version of regalloc2 and tested a couple sort implementations:

All tests done with the same rustc version to avoid other effects related to the impact of a new Rust version.

Linux 6.11
rustc 1.85.0-nightly (28fc2ba71 2024-11-24)
AMD Ryzen 9 5900X 12-Core Processor (Zen 3 micro-architecture)
CPU boost enabled.
$ hyperfine --min-runs 3 "RUST_LOG=info ./target/release/yr scan big_rule.yar README.md"

# sort_unstable_by_key as done by upstream v0.10.2
  Time (mean ± σ):     22.050 s ±  0.183 s    [User: 21.863 s, System: 0.182 s]
  Range (min … max):   21.937 s … 22.261 s    3 runs

# sort_by_key
  Time (mean ± σ):      8.349 s ±  0.074 s    [User: 8.217 s, System: 0.197 s]
  Range (min … max):    8.289 s …  8.431 s    3 runs

# sort_by_cached_key
  Time (mean ± σ):     27.395 s ±  0.350 s    [User: 27.193 s, System: 0.173 s]
  Range (min … max):   26.993 s … 27.628 s    3 runs

# No sort
  Time (mean ± σ):      5.129 s ±  0.134 s    [User: 5.049 s, System: 0.147 s]
  Range (min … max):    4.978 s …  5.235 s    3 runs

# Old Rust 1.80 unstable sort from source via /~https://github.com/Voultapher/sort-research-rs
  Time (mean ± σ):      6.327 s ±  0.011 s    [User: 6.271 s, System: 0.143 s]
  Range (min … max):    6.317 s …  6.338 s    3 runs

# Old Rust 1.80 stable sort from source via /~https://github.com/Voultapher/sort-research-rs
  Time (mean ± σ):      8.052 s ±  0.049 s    [User: 7.868 s, System: 0.156 s]
  Range (min … max):    8.010 s …  8.105 s    3 runs

# No sort lto = "thin"
  Time (mean ± σ):      4.918 s ±  0.026 s    [User: 4.791 s, System: 0.185 s]
  Range (min … max):    4.891 s …  4.942 s    3 runs

These first results tell us a couple things, sort_unstable_by_key has regressed in this scenario, and the partial insertion sort of the previous version is the best fit of the implementations tested above, this matches the findings by @plusvic. Given the input pattern of "full sorted -> append 1 at the end" this matches expectations. In contrast to measured times by @plusvic I can't reproduce a ~1.5x difference between the older version of sort_by_key, given that in this scenario they run into the exact same logic and behavior with some very minor changes this is what I expected initially.

Logging the length of both sides if self.bundles[to].ranges.len() > 1_000, I see that it calls merge_bundles from len 1k all the way up to 50k, and presumably for all increments below 1k as well. So in effect we are calling sort with non-trivial input length 50 thousand times. It's a miracle the previous version was even as fast as it was. The object we are sorting has a size_of() == 12 which isn't that big and we are sorting by a u32 key, which allows the sort implementations to generate efficient comparisons.

We could quick and dirty fix it by adding a ``if self.bundles[to].ranges.len() == 1` and adding a special case for that, though it's not clear to me if that would be helpful generally or in specific cases as encountered by @plusvic. I think the real fix would be to prepare the bundles once and call merge_bundles once.

for i in 0..self.blockparam_outs.len() {
    let BlockparamOut {
        from_vreg, to_vreg, ..
    } = self.blockparam_outs[i];
   
    let to_bundle = self.ranges[self.vregs[to_vreg].ranges[0].index].bundle;
    let from_bundle = self.ranges[self.vregs[from_vreg].ranges[0].index].bundle;
   
    self.merge_bundles(from_bundle, to_bundle);
}

->

for i in 0..self.blockparam_outs.len() {
    let BlockparamOut {
        from_vreg, to_vreg, ..
    } = self.blockparam_outs[i];
   
    // build bundles
}

self.merge_bundles(from_bundle, to_bundle);

@plusvic
Copy link
Contributor Author

plusvic commented Dec 2, 2024

@Voultapher I've submitted a PR with exactly with the quick and dirty fix you mentioned. I believe it can help in many other cases, as I've compiled some other WASM code besides my specific edge-case and there are many calls where the vector being appended contains a single element.

@cfallin
Copy link
Member

cfallin commented Dec 2, 2024

I think the real fix would be to prepare the bundles once and call merge_bundles once.

Unfortunately this is not an option: the merges performed for each blockparam are separate merges in general, of separate bundles. (In regalloc2 terminology, a "bundle" is a collection of liveranges that do not conflict; in the general case, all the separate blockparams flowing into a block will be separate values and so do conflict.)

I'll take a look at the PR, thanks @plusvic!

@cfallin
Copy link
Member

cfallin commented Dec 2, 2024

@Voultapher a clarification question on this comment:

So in effect we are calling sort with non-trivial input length 50 thousand times. It's a miracle the previous version was even as fast as it was.

I'm not sure I understand the subtext here. You seem to be implying there is some quadratic behavior of sorts (separate sorts for steps of the same process). But there is one sort per bundle merge operation; these are logically separate bundles with nothing in common. And while we see here that there are commonly very small sort bundles being merged into existing large bundles, it also sometimes happens that we merge two large bundles (e.g. there are merge cases where we merge the reused-input of an x86 destructive src/dest instruction and both sides will be large). I did extensively profile and compare against a manual merge loop in the pre-1.80 world and found the sort was faster. Do you have another suggestion for the general case, aside from @plusvic's point fix?

@Voultapher
Copy link

Voultapher commented Dec 2, 2024

@cfallin keep in mind my register allocation knowledge and specifically regalloc2 knowledge is very limited, so if what I say is nonsense please attribute it to that.

This issue here arises only because to_bundle == LiveBundleIndex(4) for all 50054 blockparam_outs, if the thing that is being merged into keeps staying the same - no clue how general a case that is - doing so with a sort for each subset is O(N^2 x log(N)), which isn't great. If the thing we merge into is usually a different LiveBundleIndex for each loop iteration then the constants are much smaller and it doesn't show up as impactful, but that would also mean the fix in #204 is hyper specific and I'm not convinced it's a good idea. Inversely if this happens a lot, wouldn't it be possible to collect all the bundles that are planned to be merged into the same bundle and merge it at once?

@cfallin
Copy link
Member

cfallin commented Dec 2, 2024

Ah, yes, that's true, we could pre-bin merges by destination bundle and do one N-way merge (or sort).

If I get some more time to work on this in the next month I'll see if I can come up with a more general fix -- thanks @Voultapher and all for the ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants