-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in tight loop since rust 1.25 #53340
Comments
godbolt shows quite a big diff between 1.24 and 1.25: |
I've scripted the checking of this across versions and the regression is present all the way through to nightly: packed12le : 1.24.1 BASE 3.53
packed12le : 1.25.0 FAIL 4.84 (+37%)
packed12le : 1.26.2 FAIL 4.80 (+35%)
packed12le : 1.27.2 FAIL 4.81 (+36%)
packed12le : 1.28.0 FAIL 4.87 (+37%)
packed12le : 1.29.2 FAIL 4.77 (+35%)
packed12le : 1.30.0 FAIL 4.83 (+36%)
packed12le : beta FAIL 4.83 (+36%)
packed12le : nightly FAIL 4.95 (+40%) The 35-40% increase in runtime is very consistent. |
The regression is still present in 1.31 and all the way to current nightly:
The same ~40% regression seen in the minimal test case is also seen in the full benchmark: http://chimper.org/rawloader-rustc-benchmarks/version-1.31.0.html |
What's the performance if you compare this version with something based on the newer |
Using
Starting with 1.31 the |
@pedrocr just to clarify, did you update all occurrences of chunks/_mut to be the exact version? |
@bluss Both the inner and outer loop. Here's the full code: fn decode_12le(buf: &[u8], width: usize, height: usize) -> Vec<u16> {
let mut out: Vec<u16> = vec![0; width*height];
for (row, line) in out.chunks_exact_mut(width).enumerate() {
let inb = &buf[(row*width*12/8)..];
for (o, i) in line.chunks_exact_mut(2).zip(inb.chunks(3)) {
let g1: u16 = i[0] as u16;
let g2: u16 = i[1] as u16;
let g3: u16 = i[2] as u16;
o[0] = ((g2 & 0x0f) << 8) | g1;
o[1] = (g3 << 4) | (g2 >> 4);
}
}
out
}
fn main() {
let width = 5000;
let height = 4000;
let mut buffer: Vec<u8> = vec![0; width*height*12/8];
// Make sure we don't get optimized out by writing some data into the buffer
for (i, val) in buffer.chunks_mut(1).enumerate() {
val[0] = i as u8;
}
for _ in 0..100 {
decode_12le(&buffer, width, height);
}
} I've also initialized the buffer with some data to avoid optimization. I think that became needed after one of the LLVM upgrades since 1.25. |
There's still a chunks in there, why not try with it exact too? |
I did, and it doesn't make a difference, it's just the initialization which doesn't take much time. I didn't change that one because it's not in the code that's actually being benchmarked, but it doesn't really make a difference. |
Maybe change the chunks to chunks exact on this line. I'm just curious.
|
Ah, right, missed that one. I'll check. |
It either works extremely well or somehow LLVM figured out how to optimize away too much:
|
That's exactly what we want :) |
The exact chunks code looks ok in godbolt. Nothing spectacular, just a clean inner loop with no redundant bounds checks, and a single loop exit conditional. A minor boring trick to the old code is to change the order to this: let g3: u16 = i[2] as u16;
let g2: u16 = i[1] as u16;
let g1: u16 = i[0] as u16;
o[1] = (g3 << 4) | (g2 >> 4);
o[0] = ((g2 & 0x0f) << 8) | g1; With the bounds check at |
The 10% improvement was already interesting but the 40% one definitely made me want to use this. I have almost 50 The bounds check trick is interesting but a little disappointing that the compiler doesn't figure that out itself. I've intentionally kept the code clean instead of trying to make it fast by being clever. It seemed to pay well in terms of productivity and maintainability. But I don't think this closes the regression itself though, or does it? I assume |
I agree, it looks like we should try to fix the performance of |
It can't make that change itself since it would change the visible behavior of the program: the panic message includes the index. |
@sfackler ah, that's annoying but makes perfect sense, thanks. |
1.32 maintains the same performance:
The |
1.33 maintains the same regression:
|
I don't know if this is useful but 1.34/1.35 and current beta/nightly still have the same regression:
|
1.38 recovers roughly half this regression:
The |
As an update it seems the latest rust versions have gotten this down to a ~4% penalty only:
|
You say the modern versions have almost closed the gap in performance; can you look at the modern assembly and see what still might be worse than 1.24, and what has improved since 1.25? |
My knowledge of x86 assembly is rudimentary, sorry. |
I have a potential fix in #86823 if you want to benchmark it you can grab the try build for |
Does the build system save its artifacts anywhere? I don't think I can currently build rustc on this machine. |
You can use /~https://github.com/kennytm/rustup-toolchain-install-master to install CI builds as rustup toolchain. |
That tool worked really well. Gave me an installed toolchain and then the benchmarking automation just worked. Here's the result of running that branch compared to the other recent toolchains:
Hopefully the patch doesn't have any soundness issues because it seems to fix things completely. The same code now becomes ~8% faster. |
Just as a confirmation for this here are full results:
Some recent versions were actually able to reach 20%+ performance improvements but now we're back to 0. Possibly that's a different performance improvement/regression going on. |
This might be related to whether it auto-vectorizes. This version gets a nice one: https://rust.godbolt.org/z/TMKvzozjv %58 = zext <4 x i8> %57 to <4 x i16>, !dbg !283
%59 = shl nuw <4 x i16> %45, <i16 8, i16 8, i16 8, i16 8>, !dbg !314
%60 = and <4 x i16> %59, <i16 3840, i16 3840, i16 3840, i16 3840>, !dbg !314
%61 = or <4 x i16> %60, %32, !dbg !316
%62 = shl nuw nsw <4 x i16> %58, <i16 4, i16 4, i16 4, i16 4>, !dbg !317
%63 = lshr <4 x i16> %45, <i16 4, i16 4, i16 4, i16 4>, !dbg !318
%64 = or <4 x i16> %62, %63, !dbg !319 So that would plausibly be a reason for the 40% improvement mentioned in #53340 (comment) But peeling the last iteration from the non-exact is probably hard, losing the auto-vectorization when exact things aren't used. |
This can probably be closed. The regression seems to be solved definitely since 1.55 and now hovers around 10 to 20% improvement depending on version:
|
Nice, adding a codegen test might be useful though since it seems like a fickle optimization |
I've finally gotten around to doing some proper benchmarking of rust versions for my crate:
http://chimper.org/rawloader-rustc-benchmarks/
As can be seen in the graph on that page there's a general performance improvement over time but there are some very negative outliers. Most (maybe all) of them seem to be very simple loops that decode packed formats. Since rust 1.25 those are seeing 30-40% degradations in performance. I've extracted a minimal test case that shows the issue:
Here's a test run on my machine:
The text was updated successfully, but these errors were encountered: