Performance regressions of compiled code over the last year #47561

Marwes · 2018-01-19T00:48:25Z

Started on an update to /~https://github.com/Marwes/combine after it being dormant for a while. When I ran the benchmarks to check that my changes hadn't regressed the performance I noticed that performance had regressed by ~28% (~116% with incremental compilation!) since the last time ran benchmarks (somewhere around September 2016).

I ran the benchmarks again against an old version of the library to be able to compile it with older rustc's but the regression is the same in the master branch as well.

`cargo bench --bench http` against /~https://github.com/Marwes/combine/tree/v2.3.2

cargo-0.18.0-nightly (a73a665 2017-02-14)

test http_requests_large ... bench:     439,961 ns/iter (+/- 30,684)
test http_requests_small ... bench:      87,508 ns/iter (+/- 5,173)

rustc 1.19.0-nightly (`554c685` 2017-06-14)

test http_requests_large ... bench:     475,989 ns/iter (+/- 10,477)
test http_requests_small ... bench:      95,175 ns/iter (+/- 23,751)

rustc 1.22.0-nightly (`3681220` 2017-09-06)

test http_requests_large ... bench:     494,088 ns/iter (+/- 27,462)
test http_requests_small ... bench:     102,798 ns/iter (+/- 67,446)

rustc 1.25.0-nightly (`6828cf9` 2018-01-06) (CARGO_INCREMENTAL=0)

test http_requests_large ... bench:     551,065 ns/iter (+/- 420,621)
test http_requests_small ... bench:     112,375 ns/iter (+/- 2,098)

rustc 1.25.0-nightly (`6828cf9` 2018-01-06) (CARGO_INCREMENTAL=1)

test http_requests_large ... bench:   1,001,847 ns/iter (+/- 40,639)
test http_requests_small ... bench:     188,091 ns/iter (+/- 1,958)

I'd like to bisect this further but the two tools I found for this do not appear to work in this case, is there any other tool that can be used for this?

/~https://github.com/kamalmarhubi/rust-bisect (Outdated)
/~https://github.com/Mark-Simulacrum/bisect-rust/tree/master/src (Only last 90 days)

The text was updated successfully, but these errors were encountered:

eddyb · 2018-01-19T13:18:57Z

cc @michaelwoerister @arielb1 @nagisa

michaelwoerister · 2018-01-19T14:08:33Z

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
What happens if you do RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0 cargo bench?

cc @alexcrichton

Marwes · 2018-01-19T15:07:36Z

(These are on a different machine so numbers won't be comparable to the ones above)

env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0

rustc 1.17.0-nightly (`62eb605` 2017-02-15)

test http_requests_large ... bench:     247,219 ns/iter (+/- 68,700)
test http_requests_small ... bench:      51,941 ns/iter (+/- 9,127)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17)

test http_requests_large ... bench:     325,950 ns/iter (+/- 48,330)
test http_requests_small ... bench:      62,812 ns/iter (+/- 12,001)

Without specifying Ccodegen-units=1

rustc 1.25.0-nightly (`0f9c784` 2018-01-17)

test http_requests_large ... bench:     420,282 ns/iter (+/- 75,396)
test http_requests_small ... bench:      85,946 ns/iter (+/- 13,067)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

Marwes · 2018-01-19T15:23:27Z

The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.

Yeah I figured ThinLTO might be one of the culprits but I don't believe there is a way to use full LTO atm (#47521). Depending on how much ThinLTO affects the runtimes though that might still be a problem however.

michaelwoerister · 2018-01-22T11:26:44Z

247 vs 325 microseconds is indeed quite the slowdown. I would be interesting to see where all the additional time goes. Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .

Yes, that is out of date since a few weeks ago, I think. cc @alexcrichton & @rust-lang/cargo

Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)

At some point we'll probably have incremental ThinLTO, at which point runtime performance should be quite a bit better. However, I'm not sure if that will ever be the default since it will cause longer compile times than regular incremental compilation. Incremental compilation strives to give you code that is "fast enough" and otherwise clearly prioritizes short compile times.

Marwes · 2018-01-22T11:52:15Z

Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.

Tried gleaning something from profiler output I didn't spot anything as the runtime is spread out in a lot of different places and inlining tears through the code completely. I may give it another shot but I don't expect much.

Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?

It is unlikely to do with the benchmark framework. I am comparing these numbers to /~https://github.com/Geal/nom_benchmarks/blob/master/http/nom-http/src/main.rs which also uses it and that benchmark hasn't regressed. (I actually use /~https://github.com/bluss/bencher/ but switching to the builtin one shows no change in runtime).

Added another result to #47561 (comment) . Forcing full lto helps a bit but is not the sole problem.

rustc 1.25.0-nightly (`0f9c784` 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

test http_requests_large ... bench:     373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench:      78,019 ns/iter (+/- 14,368)

kennytm · 2018-01-22T20:20:39Z

I've checked this from 1.13.0 to the current nightly and it is obvious that the regression is introduced in 1.19.0 and 1.20.0. The speed is recovering in more recent version (with CGU=1 and ThinLTO=off) but still haven't come back to the 1.19.0 level.

Graphs

Raw data

Version	Large	LRange	Small	SRange
25-δ	298793	26302	61109	8759
24-β.6	320784	33707	63668	6996
23	312017	38840	63081	6354
22	330144	30876	65170	6320
21	328676	41291	65799	8213
20	336373	37812	65845	5612
19	287879	24456	56188	9432
18	242170	25084	49207	5695
17	250096	22134	50041	5387
16	#N/A	#N/A	#N/A	#N/A
15	266197	29355	52775	6734
14	264196	24070	52792	5694
13	263274	26703	51758	5650

arielb1 · 2018-01-22T20:31:05Z

@kennytm

Cool. Could you try to bisect to the specific commit? I might then take a go at investigating it.

kennytm · 2018-01-22T20:42:29Z

@arielb1 Probably not to a specific commit, but I'm now doing a date-based bisection which should point to a narrow-enough commit range.

kennytm · 2018-01-22T23:25:22Z

From the timing of nightlies between 2017-04-24 to 2017-07-17:

Regression at nightly-2017-04-26 (+43µs): 63c7721...2b4c911 (likely LLVM 4.0 Upgrade #40123 (LLVM 4.0 Upgrade))
Regression at nightly-2017-05-14 (+26µs): e17a122...826d8f3 (likely remove the #[inline] attribute from drop_in_place #41920 (remove the #[inline] attribute from drop_in_place))
Regression at nightly-2017-06-21 (+36µs): 0414594...4450779 (???)
Regression at nightly-2017-06-23 (+22µs): 622e7e6...ab5bec2 (likely mark calls in the unwind path as !noinline #42771 (mark calls in the unwind path as !noinline))
Improvement at nightly-2017-07-07 (-48µs): 3610a70...696412d (likely Switch to rust-lang-nursery/compiler-builtins #42899 (Switch to rust-lang-nursery/compiler-builtins) or rustc: Implement stack probes for x86 #42816 (rustc: Implement stack probes for x86))

Graphs

Raw data

Nightly	Large	LRange	Small	SRange	Commit
2017-04-24	237376	29677	47239	6068
2017-04-25	241729	22569	48152	4449	`63c7721`
2017-04-26	283963	35674	57046	6098	`2b4c911`
2017-04-27	283284	31492	56515	7331
2017-04-29	280613	38855	57309	5711
2017-05-05	288286	22754	57488	5544
2017-05-06	280634	44019	56483	6311	`f420965`
2017-05-09	271223	36183	53666	7654	`f1140a3`
2017-05-10	271546	29414	54456	4247
2017-05-13	278337	29537	55530	6867	`e17a122`
2017-05-14	304265	44141	60732	8017	`826d8f3`
2017-05-16	300772	34152	60028	7134
2017-06-07	302343	34978	61109	7028
2017-06-17	295258	29874	58697	5658
2017-06-20	281801	24845	58202	5233	`0414594`
2017-06-21	325467	42166	63920	7960	`4450779`
2017-06-22	323663	35073	64315	6130	`622e7e6`
2017-06-23	343497	43136	69235	6408	`ab5bec2`
2017-06-24	342053	36216	67890	8065	`229d0d3`
2017-06-25	352200	41641	71437	8251	`c9bb935`
2017-06-27	352002	44692	69970	6756
2017-06-29	344796	30291	69683	7940	`69c65d2`
2017-06-30	365836	43236	72866	7919	`3bfc18a`
2017-07-01	360115	42362	73310	7862
2017-07-02	357479	35203	72631	8314
2017-07-03	368134	47205	73680	7208
2017-07-04	377648	45819	75442	8086
2017-07-05	383190	46242	75982	8814
2017-07-06	382141	45204	74689	8027	`3610a70`
2017-07-07	329873	35233	65957	6922	`696412d`
2017-07-17	329498	39955	66395	5884

arielb1 · 2018-01-23T18:33:24Z

Regression at nightly-2017-04-26 (+43µs): 63c7721...2b4c911 (likely #40123 (LLVM 4.0 Upgrade))

Not much we can do about that without a fairly deep investigation. We're about to update to LLVM 5.0 or 6.0, so let's hope that fixes the regression

Regression at nightly-2017-06-21 (+36µs): 0414594...4450779 (???)
Improvement at nightly-2017-07-07 (-48µs): 3610a70...696412d (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))

This looks like #42313 and #42727 - the first try at allocator integration had some perf issues - missing inline attribute, which were fixed by #42727.

Regression at nightly-2017-05-14 (+26µs): e17a122...826d8f3 (likely #41920 (remove the #[inline] attribute from drop_in_place))
Regression at nightly-2017-06-23 (+22µs): 622e7e6...ab5bec2 (likely #42771 (mark calls in the unwind path as !noinline))

This sort of makes sense, but is weird. I might look into it

Marwes · 2018-02-07T14:46:33Z

@arielb1 A regression from #41920 seems consistent with my observations. I have another variant of the large benchmark above where the error type does not need drop (the normal error contains a Vec). The version that contains that benchmarks requires a recent version of rust so I am not able to compare the results but it does appear to have regressed less than one that needs drop.

nikomatsakis · 2018-02-22T15:34:33Z

triage: P-medium

steveklabnik · 2019-09-22T15:07:06Z

Triage; It's been a while. @Marwes lots has changed since you've filed this issue; do you still track this stuff on combine?

Marwes · 2019-09-24T18:31:40Z

Reran on the old branch so I could use 1.19

 name                 1.19 ns/iter  1.39 ns/iter  diff ns/iter  diff %  speedup
 http_requests_large  525,200       629,625            104,425  19.88%   x 0.83
 http_requests_small  104,955       123,803             18,848  17.96%   x 0.85

Regression still seems to exist.

saethlin · 2022-10-09T18:33:51Z

The regression still exists, but I think I've been able to root-cause this to some kind of change in LLVM's inlining (in combination with codegen-units?).

Poking through the benchmark executables with perf annotate indicates that previously after LLVM was done we had about 3 functions left, and now for whatever reason we have a lot more than 3, and some look suspiciously small to me (i.e. they fit on a screen but are a lot more than a trivial getter/setter), but LTO has little effect so the problem is not that these functions are not eligible for inlining, the problem is that the heuristics are concluding they shouldn't be inlined.

If I compile now on nightly:

rustc 1.66.0-nightly (8796e7a9c 2022-10-08)
binary: rustc
commit-hash: 8796e7a9cfd4c5c4f1de15ec1c53994ddf288665
commit-date: 2022-10-08
host: x86_64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.2

With

RUSTFLAGS="-Ccodegen-units=1 -Cllvm-args=-inline-threshold=100000" cargo bench

I can beat the old benchmark times. All you need is a sufficiently large inline threshold, and the threshold above is high enough to inline the whole benchmark into one function.

This regression still exists, but I don't know what action we could take here. I don't think there is some particular pattern which isn't being optimized here which we could implement logic for in LLVM. It seems likely that tuning the inliner heuristic(s) in order to benefit these benchmarks would pessimize others.

@rustbot label +I-prioritize

saethlin · 2022-10-10T12:04:53Z

I'm sifting through old issues. I think this situation should be discussed because I'm not sure there is anything for us to do here on account of this being an inlining heuristic issue, and also this issue's age. It's been 4 years and 9 or 10 LLVM releases since this issue was opened. What are the odds we manage to fix this regression without creating other issues?

@rustbot label -I-prioritize +I-compiler-nominated

apiraino · 2022-10-17T16:05:11Z

issue discussed during last week's T-compiler meeting (notes). Follow-up on this Zulip topic

@rustbot label -I-compiler-nominated

pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 19, 2018

kennytm added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Jan 22, 2018

jkordish added A-libtest Area: `#[test]` / the `test` library C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jan 24, 2018

nagisa mentioned this issue Jan 25, 2018

codegen-units + ThinLTO is not as good as codegen-units = 1 #47745

Open

rust-highfive added the P-medium Medium priority label Feb 22, 2018

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Oct 9, 2022

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 9, 2022

rustbot added I-compiler-nominated Nominated for discussion during a compiler team meeting. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 10, 2022

This comment was marked as abuse.

Sign in to view

rustbot removed the I-compiler-nominated Nominated for discussion during a compiler team meeting. label Oct 17, 2022

epage added this to testing-devex backlog Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regressions of compiled code over the last year #47561

Performance regressions of compiled code over the last year #47561

Marwes commented Jan 19, 2018

eddyb commented Jan 19, 2018

michaelwoerister commented Jan 19, 2018

Marwes commented Jan 19, 2018 •

edited

Loading

Marwes commented Jan 19, 2018

michaelwoerister commented Jan 22, 2018

Marwes commented Jan 22, 2018

kennytm commented Jan 22, 2018

arielb1 commented Jan 22, 2018

kennytm commented Jan 22, 2018

kennytm commented Jan 22, 2018

arielb1 commented Jan 23, 2018 •

edited

Loading

Marwes commented Feb 7, 2018

nikomatsakis commented Feb 22, 2018

steveklabnik commented Sep 22, 2019

Marwes commented Sep 24, 2019

saethlin commented Oct 9, 2022 •

edited

Loading

saethlin commented Oct 10, 2022

This comment was marked as abuse.

apiraino commented Oct 17, 2022

Performance regressions of compiled code over the last year #47561

Performance regressions of compiled code over the last year #47561

Comments

Marwes commented Jan 19, 2018

cargo bench --bench http against /~https://github.com/Marwes/combine/tree/v2.3.2

cargo-0.18.0-nightly (a73a665 2017-02-14)

rustc 1.19.0-nightly (554c685 2017-06-14)

rustc 1.22.0-nightly (3681220 2017-09-06)

rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=0)

rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=1)

eddyb commented Jan 19, 2018

michaelwoerister commented Jan 19, 2018

Marwes commented Jan 19, 2018 • edited Loading

env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0

rustc 1.17.0-nightly (62eb605 2017-02-15)

rustc 1.25.0-nightly (0f9c784 2018-01-17)

rustc 1.25.0-nightly (0f9c784 2018-01-17)

rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

Marwes commented Jan 19, 2018

michaelwoerister commented Jan 22, 2018

Marwes commented Jan 22, 2018

rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

kennytm commented Jan 22, 2018

arielb1 commented Jan 22, 2018

kennytm commented Jan 22, 2018

kennytm commented Jan 22, 2018

arielb1 commented Jan 23, 2018 • edited Loading

Marwes commented Feb 7, 2018

nikomatsakis commented Feb 22, 2018

steveklabnik commented Sep 22, 2019

Marwes commented Sep 24, 2019

saethlin commented Oct 9, 2022 • edited Loading

saethlin commented Oct 10, 2022

This comment was marked as abuse.

apiraino commented Oct 17, 2022

`cargo bench --bench http` against /~https://github.com/Marwes/combine/tree/v2.3.2

rustc 1.19.0-nightly (`554c685` 2017-06-14)

rustc 1.22.0-nightly (`3681220` 2017-09-06)

rustc 1.25.0-nightly (`6828cf9` 2018-01-06) (CARGO_INCREMENTAL=0)

rustc 1.25.0-nightly (`6828cf9` 2018-01-06) (CARGO_INCREMENTAL=1)

Marwes commented Jan 19, 2018 •

edited

Loading

rustc 1.17.0-nightly (`62eb605` 2017-02-15)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

rustc 1.25.0-nightly (`0f9c784` 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)

arielb1 commented Jan 23, 2018 •

edited

Loading

saethlin commented Oct 9, 2022 •

edited

Loading