-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very poor performance of trigonometric functions #19284
Comments
When you say |
It is a custom js code inside a EM_ASM_ block which is as much close as possible to the c++ one. |
Yes, although the difference here is that JavaScript has trig functions builtin (e.g. Math.atan2), but IIRC WebAssembly does not. So you your C/C++ code will call into atan2() code which is implemented in userspace in terms of lower level math function: /~https://github.com/emscripten-core/emscripten/blob/main/system/lib/libc/musl/src/math/atan.c |
@JeromeDesfieux Out of curiosity, did you try with the linker flag |
@kripken As anticipated, using the linker flag |
I tried using O3 instead of O2 and I tried adding I must admit that it makes our use case for WebAssembly compromised... Do you have any ideas of what I can do to improve this ? I guess I would need to rewrite that musl part of code with faster (and less accurate) algo ? Or is it a possibility that WebAssembly provides builtin implementation of those functions ? Thanks for your help Note that I will try using other math lib (like gml or gcem). I will post the bench result. (I also found this StackOverFlow discussion where I posted a link to this issue: |
@kripken it looks like the implementation of |
In fact the auto test I am doing is to execute multiple (a lot) of computation inside the same block I measure:
So there is a overhead for each call to cos/sin if using In the other hand the custom EM_ASM code wraps the total loop (so only one EM_ASM overhead for the whole test -> my goal being to measure JavaScript time):
|
Ah ok yeah that makes sense, thanks. |
@JeromeDesfieux Interesting, thanks for the info. Overall, I think the question is whether native code has some ability wasm doesn't. I mean, it's possible atan etc. in your libc is using some inline assembly CPU magic that can't be expressed in wasm. To get a more apples-to-apples comparison, maybe provide the atan etc. implementation in the source files you build. That is, don't rely on libc to implement it, but use some C++ library that you can compile to both native and wasm. (Just adding source files to compile+link should work - they will override the libc versions.) If there is still a difference, then perhaps there is something we can improve on compiling that library. If there isn't a difference, then you would be able to get good wasm performance by using that math library instead of the default math code in emscripten's libc (which is basically musl, and like all codebases has some compromises between size and speed etc. - so you may be able to do better, for your use case, with another math library). |
@JeromeDesfieux: When benchmarking Native vs. WebAssembly math functions keep in mind that in emscripten sometimes speed is traded for code size. See for e.g. #15544. Disclaimer: I don't know the implementation details for musl trigonometric functions. |
Thanks all for your help. I changed my autotest to making it more accurate (bigger dataset, more iterations) and to target only one trig function (cosine). I tried other implementations (gcem and V8 one) with very similar results --> WebAssembly is always [30-40]% slower than native C++. Note that I have other benchmarks about arithmetics where wasm is very very close to C++ native so I guess there is something very specific with those trigonometric functions. When I test the V8 algo, I copied the whole cosine code in my cpp so it is exactly the same code which compiles on native and wasm, but I still see this +30%. When looking at the code I can see that it uses a lot of binary operations (&, >>, <<, ...). Example of such macro used in cosine computation: #define GET_HIGH_WORD(i, d) \
do { \
uint64_t bits = base::bit_cast<uint64_t>(d); \
(i) = bits >> 32; \
} while (false)
// with bit_cast being:
template <class Dest, class Source>
inline Dest bit_cast(Source const& source) {
static_assert(sizeof(Dest) == sizeof(Source),
"source and dest must be same size");
Dest dest;
memcpy(&dest, &source, sizeof(dest));
return dest;
} |
I would expect bitwise operations like What wasm does (Otherwise, I think inspecting the machine code in both native and wasm is really the only way to understand a 30-40% slowdown. Comparing to other VMs might also be helpful to get context.) |
Use handwritten asm.js to speed up critical code using trigonometric functions. // Original benchmark code
(function() {
let benchmarkData = [];
const nbData = 1000;
for(let i=0; i<nbData; ++i)
benchmarkData[i] = i;
let out = 0;
const start = Date.now();
for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
{
const sinI = Math.sin(benchmarkData[i]);
for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
out += Math.atan2(sinI, Math.cos(benchmarkData[j]));
}
const end = Date.now();
console.log("[JS] (out is " + out + ")");
return end-start; // takes 135ms in Chrome on my machine
})(); Below is my handwritten asm.js. It's 3 times faster than the former Javascript. (function(console) {
"use strict";
const nbData = 1000;
var benchmarkData = new Float64Array(16384);
for(let i=0; i<nbData; ++i)
benchmarkData[i] = i;
function asmModuleInit(stdlib, foreign, heap) {
"use asm";
// shared variables
var fround = stdlib.Math.fround;
var sin = stdlib.Math.sin;
var cos = stdlib.Math.cos;
var atan2 = stdlib.Math.atan2;
var benchmarkData = new stdlib.Float64Array(heap);
// function declarations
function benchmark(len) {
len = len | 0;
var i=0, j=0;
var out=0.0, sinI=0.0;
len = len << 3;
for(i=0; (i|0) != (len|0); i=i+8|0) {
sinI = sin(+benchmarkData[i >> 3]);
for(j=0; (j|0) != (len|0); j=j+8|0)
out = out + atan2(sinI, cos(+benchmarkData[j >> 3]));
}
return out;
}
// export function
return { benchmark: benchmark };
}
const asmModule = asmModuleInit(window, {}, benchmarkData.buffer);
var len = nbData-1|0;
asmModule.benchmark(len); // warm up because chrome doesn't compile AOT
console.time("asm.js code");
asmModule.benchmark(len);
console.timeEnd("asm.js code"); // takes 44ms in Chrome on my machine
})(console); |
why use javascript functions when LLVM provides intrinsics for most of them? https://llvm.org/docs/LangRef.html#llvm-atan2-intrinsic |
I stumbled over this in the context of https://crrev.com/c/6083877 (V8-side change how calls to imported JS Emscripten should probably change its JavaScript code when using
which prevents us from using the Well-Known-Imports optimization linked above. Without the JS wrapper function performance is substantially better (on an x64 workstation and V8 ToT build: ~5300ms down to 2600ms). That is, the Wasm module should instead just directly import the
Also a note on these kinds of micro-benchmarks. In this case, for accurately measuring peak performance (and comparing it to native code), one has to increase the iteration count (to tier-up to the optimizing compiler, and dilute the compilation time overhead in the measurement), or better directly pass |
Whether to use
Some performance numbers on x64 with a tip-of-tree d8 release build and a slightly modified version of the benchmark in the original post (compiled with
Note that (i) Code size:
Obviously relative code size improvements would be smaller with larger, more realistic programs. Given those performance numbers, it probably makes sense to enable |
On November 11 (so before these optimizations landed) I did my own benchmarks:
(JSC was on my iPhone 13) |
Yes this would be really nice. One tricky aspect is that the current JS library functions (e.g. JS_cos) can also be called from JS.. so we would need some new mechanism to signal that this is only available as a wasm import |
@sbc100 Alternatively, perhaps we could add an acorn optimization for function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
foo,
=>
function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
bar, // only this changed That is, we know that function identity does not matter in wasm imports, so we can optimize such pass-through functions to their target. (Other optimizations can then remove However I'm not sure how important this is, given |
Yes, or perhaps acorn could even inline all otherwise-unused function into the imports struct. I wonder if closure does this already and if not then why not? |
Inlining isn't enough, as the pass-through function would remain. And the pass-through can't be removed unless we know function identity does not matter - which we know about wasm imports, but in general, the code could compare the function reference to something else and see the difference that the wrapper makes. |
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
A couple of emscripten/src/library_math.js Line 22 in 1171ada
Is that going to be enough to make this work? e.g.
|
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See #19284
Unlike the current EM_JS implementations, the `em_math.h` functions map directly to `Math.xxx` without a wrapper function. The other advantage of doing it this way is that we avoid duplicate implementations of all these functions. Fixes: emscripten-core#19284
Nice, thanks!
Yes, that should work fine, just tried that out locally. |
See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284
Unlike the current EM_JS implementations, the `em_math.h` functions map directly to `Math.xxx` without a wrapper function. The other advantage of doing it this way is that we avoid duplicate implementations of all these functions. Fixes: emscripten-core#19284
Emscripten version: 3.1.35
Tests done on Windows desktop i7 12th 32GB RAM (native compiler is MSVC 2022, brower Chrome for wasm tests)
Build configuration is RelWithDebInfo (O2) for native and wasm builds
I am doing some benchmark and I am very surprised with the results I have regarding the performance of the trigonometric functions (
math.h
).Basically I wrote a program doing a conbination of atan2, cos and sin that I execute on a test data.
I have the following results on my computer:
I tried many other approach for this test (including Catch2 benchmark and using a non-sorted dataset) and I have always very bad results with WebAssembly (approx. 50% slower than JS).
Is there is a known reason for that ? Can I improve ?
(Please find attached the source code)
Benchmark_trigo.zip
The text was updated successfully, but these errors were encountered: