Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regressions of compiled code over the last year #47561

Open
Marwes opened this issue Jan 19, 2018 · 19 comments
Open

Performance regressions of compiled code over the last year #47561

Marwes opened this issue Jan 19, 2018 · 19 comments
Labels
A-libtest Area: `#[test]` / the `test` library A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Marwes
Copy link
Contributor

Marwes commented Jan 19, 2018

Started on an update to /~https://github.com/Marwes/combine after it being dormant for a while. When I ran the benchmarks to check that my changes hadn't regressed the performance I noticed that performance had regressed by ~28% (~116% with incremental compilation!) since the last time ran benchmarks (somewhere around September 2016).

I ran the benchmarks again against an old version of the library to be able to compile it with older rustc's but the regression is the same in the master branch as well.

cargo bench --bench http against /~https://github.com/Marwes/combine/tree/v2.3.2

cargo-0.18.0-nightly (a73a665 2017-02-14)

test http_requests_large ... bench:     439,961 ns/iter (+/- 30,684)
test http_requests_small ... bench:      87,508 ns/iter (+/- 5,173)

rustc 1.19.0-nightly (554c685 2017-06-14)

test http_requests_large ... bench:     475,989 ns/iter (+/- 10,477)
test http_requests_small ... bench:      95,175 ns/iter (+/- 23,751)

rustc 1.22.0-nightly (3681220 2017-09-06)

test http_requests_large ... bench:     494,088 ns/iter (+/- 27,462)
test http_requests_small ... bench:     102,798 ns/iter (+/- 67,446)

rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=0)

test http_requests_large ... bench:     551,065 ns/iter (+/- 420,621)
test http_requests_small ... bench:     112,375 ns/iter (+/- 2,098)

rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=1)

test http_requests_large ... bench:   1,001,847 ns/iter (+/- 40,639)
test http_requests_small ... bench:     188,091 ns/iter (+/- 1,958)

I'd like to bisect this further but the two tools I found for this do not appear to work in this case, is there any other tool that can be used for this?

/~https://github.com/kamalmarhubi/rust-bisect (Outdated)
/~https://github.com/Mark-Simulacrum/bisect-rust/tree/master/src (Only last 90 days)

@pnkfelix pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 19, 2018
@eddyb
Copy link
Member

eddyb commented Jan 19, 2018

@michaelwoerister
Copy link
Member

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
What happens if you do RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0 cargo bench?

cc @alexcrichton

@Marwes
Copy link
Contributor Author

Marwes commented Jan 19, 2018

(These are on a different machine so numbers won't be comparable to the ones above)

env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0

rustc 1.17.0-nightly (62eb605 2017-02-15)

test http_requests_large ... bench:     247,219 ns/iter (+/- 68,700)
test http_requests_small ... bench:      51,941 ns/iter (+/- 9,127)

rustc 1.25.0-nightly (0f9c784 2018-01-17)

test http_requests_large ... bench:     325,950 ns/iter (+/- 48,330)
test http_requests_small ... bench:      62,812 ns/iter (+/- 12,001)

Without specifying Ccodegen-units=1

rustc 1.25.0-nightly (0f9c784 2018-01-17)

test http_requests_large ... bench:     420,282 ns/iter (+/- 75,396)
test http_requests_small ... bench:      85,946 ns/iter (+/- 13,067)

rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

@Marwes
Copy link
Contributor Author

Marwes commented Jan 19, 2018

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.

Yeah I figured ThinLTO might be one of the culprits but I don't believe there is a way to use full LTO atm (#47521). Depending on how much ThinLTO affects the runtimes though that might still be a problem however.

@michaelwoerister
Copy link
Member

247 vs 325 microseconds is indeed quite the slowdown. I would be interesting to see where all the additional time goes. Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

Yes, that is out of date since a few weeks ago, I think. cc @alexcrichton & @rust-lang/cargo

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

At some point we'll probably have incremental ThinLTO, at which point runtime performance should be quite a bit better. However, I'm not sure if that will ever be the default since it will cause longer compile times than regular incremental compilation. Incremental compilation strives to give you code that is "fast enough" and otherwise clearly prioritizes short compile times.

@Marwes
Copy link
Contributor Author

Marwes commented Jan 22, 2018

Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Tried gleaning something from profiler output I didn't spot anything as the runtime is spread out in a lot of different places and inlining tears through the code completely. I may give it another shot but I don't expect much.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

It is unlikely to do with the benchmark framework. I am comparing these numbers to /~https://github.com/Geal/nom_benchmarks/blob/master/http/nom-http/src/main.rs which also uses it and that benchmark hasn't regressed. (I actually use /~https://github.com/bluss/bencher/ but switching to the builtin one shows no change in runtime).

Added another result to #47561 (comment) . Forcing full lto helps a bit but is not the sole problem.

rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

@kennytm kennytm added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Jan 22, 2018
@kennytm
Copy link
Member

kennytm commented Jan 22, 2018

I've checked this from 1.13.0 to the current nightly and it is obvious that the regression is introduced in 1.19.0 and 1.20.0. The speed is recovering in more recent version (with CGU=1 and ThinLTO=off) but still haven't come back to the 1.19.0 level.

Graphs

Large result

Small result

Raw data
Version Large LRange Small SRange
25-δ 298793 26302 61109 8759
24-β.6 320784 33707 63668 6996
23 312017 38840 63081 6354
22 330144 30876 65170 6320
21 328676 41291 65799 8213
20 336373 37812 65845 5612
19 287879 24456 56188 9432
18 242170 25084 49207 5695
17 250096 22134 50041 5387
16 #N/A #N/A #N/A #N/A
15 266197 29355 52775 6734
14 264196 24070 52792 5694
13 263274 26703 51758 5650

@arielb1
Copy link
Contributor

arielb1 commented Jan 22, 2018

@kennytm

Cool. Could you try to bisect to the specific commit? I might then take a go at investigating it.

@kennytm
Copy link
Member

kennytm commented Jan 22, 2018

@arielb1 Probably not to a specific commit, but I'm now doing a date-based bisection which should point to a narrow-enough commit range.

@kennytm
Copy link
Member

kennytm commented Jan 22, 2018

From the timing of nightlies between 2017-04-24 to 2017-07-17:

Graphs

Large tests
Small tests

Raw data
Nightly Large LRange Small SRange Commit
2017-04-24 237376 29677 47239 6068  
2017-04-25 241729 22569 48152 4449 63c7721
2017-04-26 283963 35674 57046 6098 2b4c911
2017-04-27 283284 31492 56515 7331  
2017-04-29 280613 38855 57309 5711  
2017-05-05 288286 22754 57488 5544  
2017-05-06 280634 44019 56483 6311 f420965
2017-05-09 271223 36183 53666 7654 f1140a3
2017-05-10 271546 29414 54456 4247  
2017-05-13 278337 29537 55530 6867 e17a122
2017-05-14 304265 44141 60732 8017 826d8f3
2017-05-16 300772 34152 60028 7134  
2017-06-07 302343 34978 61109 7028  
2017-06-17 295258 29874 58697 5658  
2017-06-20 281801 24845 58202 5233 0414594
2017-06-21 325467 42166 63920 7960 4450779
2017-06-22 323663 35073 64315 6130 622e7e6
2017-06-23 343497 43136 69235 6408 ab5bec2
2017-06-24 342053 36216 67890 8065 229d0d3
2017-06-25 352200 41641 71437 8251 c9bb935
2017-06-27 352002 44692 69970 6756  
2017-06-29 344796 30291 69683 7940 69c65d2
2017-06-30 365836 43236 72866 7919 3bfc18a
2017-07-01 360115 42362 73310 7862  
2017-07-02 357479 35203 72631 8314  
2017-07-03 368134 47205 73680 7208  
2017-07-04 377648 45819 75442 8086  
2017-07-05 383190 46242 75982 8814  
2017-07-06 382141 45204 74689 8027 3610a70
2017-07-07 329873 35233 65957 6922 696412d
2017-07-17 329498 39955 66395 5884  

@arielb1
Copy link
Contributor

arielb1 commented Jan 23, 2018

Regression at nightly-2017-04-26 (+43µs): 63c7721...2b4c911 (likely #40123 (LLVM 4.0 Upgrade))

Not much we can do about that without a fairly deep investigation. We're about to update to LLVM 5.0 or 6.0, so let's hope that fixes the regression

Regression at nightly-2017-06-21 (+36µs): 0414594...4450779 (???)
Improvement at nightly-2017-07-07 (-48µs): 3610a70...696412d (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))

This looks like #42313 and #42727 - the first try at allocator integration had some perf issues - missing inline attribute, which were fixed by #42727.

Regression at nightly-2017-05-14 (+26µs): e17a122...826d8f3 (likely #41920 (remove the #[inline] attribute from drop_in_place))
Regression at nightly-2017-06-23 (+22µs): 622e7e6...ab5bec2 (likely #42771 (mark calls in the unwind path as !noinline))

This sort of makes sense, but is weird. I might look into it

@jkordish jkordish added A-libtest Area: `#[test]` / the `test` library C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jan 24, 2018
@Marwes
Copy link
Contributor Author

Marwes commented Feb 7, 2018

@arielb1 A regression from #41920 seems consistent with my observations. I have another variant of the large benchmark above where the error type does not need drop (the normal error contains a Vec). The version that contains that benchmarks requires a recent version of rust so I am not able to compare the results but it does appear to have regressed less than one that needs drop.

@nikomatsakis
Copy link
Contributor

triage: P-medium

@rust-highfive rust-highfive added the P-medium Medium priority label Feb 22, 2018
@steveklabnik
Copy link
Member

Triage; It's been a while. @Marwes lots has changed since you've filed this issue; do you still track this stuff on combine?

@Marwes
Copy link
Contributor Author

Marwes commented Sep 24, 2019

Reran on the old branch so I could use 1.19

 name                 1.19 ns/iter  1.39 ns/iter  diff ns/iter  diff %  speedup
 http_requests_large  525,200       629,625            104,425  19.88%   x 0.83
 http_requests_small  104,955       123,803             18,848  17.96%   x 0.85

Regression still seems to exist.

@saethlin
Copy link
Member

saethlin commented Oct 9, 2022

The regression still exists, but I think I've been able to root-cause this to some kind of change in LLVM's inlining (in combination with codegen-units?).

Poking through the benchmark executables with perf annotate indicates that previously after LLVM was done we had about 3 functions left, and now for whatever reason we have a lot more than 3, and some look suspiciously small to me (i.e. they fit on a screen but are a lot more than a trivial getter/setter), but LTO has little effect so the problem is not that these functions are not eligible for inlining, the problem is that the heuristics are concluding they shouldn't be inlined.

If I compile now on nightly:

rustc 1.66.0-nightly (8796e7a9c 2022-10-08)
binary: rustc
commit-hash: 8796e7a9cfd4c5c4f1de15ec1c53994ddf288665
commit-date: 2022-10-08
host: x86_64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.2

With

RUSTFLAGS="-Ccodegen-units=1 -Cllvm-args=-inline-threshold=100000" cargo bench

I can beat the old benchmark times. All you need is a sufficiently large inline threshold, and the threshold above is high enough to inline the whole benchmark into one function.

This regression still exists, but I don't know what action we could take here. I don't think there is some particular pattern which isn't being optimized here which we could implement logic for in LLVM. It seems likely that tuning the inliner heuristic(s) in order to benefit these benchmarks would pessimize others.

@rustbot label +I-prioritize

@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Oct 9, 2022
@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 9, 2022
@saethlin
Copy link
Member

I'm sifting through old issues. I think this situation should be discussed because I'm not sure there is anything for us to do here on account of this being an inlining heuristic issue, and also this issue's age. It's been 4 years and 9 or 10 LLVM releases since this issue was opened. What are the odds we manage to fix this regression without creating other issues?

@rustbot label -I-prioritize +I-compiler-nominated

@rustbot rustbot added I-compiler-nominated Nominated for discussion during a compiler team meeting. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 10, 2022
@gustav3d

This comment was marked as abuse.

@apiraino
Copy link
Contributor

issue discussed during last week's T-compiler meeting (notes). Follow-up on this Zulip topic

@rustbot label -I-compiler-nominated

@rustbot rustbot removed the I-compiler-nominated Nominated for discussion during a compiler team meeting. label Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-libtest Area: `#[test]` / the `test` library A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
Status: No status
Development

No branches or pull requests