-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: improve base64 encoding performance #39701
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improvement (sure you tested with -O3 in both cases right?) is somewhat surprising but sure.
-O3 is the default in node builds, which I did not touch.
I'm not as surprised considering this is reading more bytes per loop iteration (and in one go) than before, which is not something I'm sure a compiler would necessarily know to do. |
Yeah I of course believe you and assumed it's run with -O3 (otherwise I wouldn't have LGTMd) I'm just surprised about:
It sounds like a pretty safe and straightforward form of loop unrolling which is a fairly simple/standard optimisation |
|
||
// Read in chunks of 8 bytes for as long as possible | ||
while (i < n64) { | ||
const uint64_t dword = *reinterpret_cast<const uint64_t*>(src + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some architectures don't support unaligned 64-bit access, so this could end up being somewhat slower on those. But still an optimization on most systems!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I was curious to check on ARM. However from what I'm seeing ARMv7 and newer support unaligned reads, with ARMv7 supposedly not supporting them for only 2 instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to start reading 1/4 bytes until 8 byte alignment is reached?
Alright, after more wrangling than I wanted, I managed to get some armv7 binaries compiled and ran the benchmarks on a Pi 2B before and after the changes in this PR:
I don't have an armv8 board available to test on. |
I can run the benchmark on a Pi 4B. Should I compare with master or 16.6.1? |
compared with 16.6.1 :
|
Can you link the issue where this is too slow in the real world? Not sure 15% perf improvement on a synthetic benchmark warrants this change. |
"Synthetic" benchmarks are all we have at the moment, after the benchmarking WG was dechartered/decommissioned. IMO base64 encoding/decoding is not something that I'd exactly call an uncommon task, especially within the realm of web applications. In general, any sensible changes that provide a measurable improvement should be welcomed, especially considering the number of performance hits we continue to take on over time as features get added and as V8 evolves. The improvements all add up. Anyway, as far as base64 encoding goes, I also have an alternative PR here that you may be more interested in. 🤷♂️ |
Closing in favor of #39775 |
Benchmark results on a Core i7-3770K:
I was tempted to run the benchmark on ARM out of curiosity, but I don't have a machine with a new enough build environment or one that I can easily swap the OS at the moment and trying to cross compile node.js with a Linaro aarch64 toolchain is impossible.