Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor performance of trigonometric functions #19284

Closed
JeromeDesfieux opened this issue May 3, 2023 · 24 comments · Fixed by #23147
Closed

Very poor performance of trigonometric functions #19284

JeromeDesfieux opened this issue May 3, 2023 · 24 comments · Fixed by #23147

Comments

@JeromeDesfieux
Copy link
Contributor

JeromeDesfieux commented May 3, 2023

Emscripten version: 3.1.35
Tests done on Windows desktop i7 12th 32GB RAM (native compiler is MSVC 2022, brower Chrome for wasm tests)
Build configuration is RelWithDebInfo (O2) for native and wasm builds

I am doing some benchmark and I am very surprised with the results I have regarding the performance of the trigonometric functions (math.h).

Basically I wrote a program doing a conbination of atan2, cos and sin that I execute on a test data.
I have the following results on my computer:

  • Native C++: 1.7 ms
  • WebAssembly: 3 ms
  • JavaScript: 2 ms

I tried many other approach for this test (including Catch2 benchmark and using a non-sorted dataset) and I have always very bad results with WebAssembly (approx. 50% slower than JS).

Is there is a known reason for that ? Can I improve ?

(Please find attached the source code)
Benchmark_trigo.zip

@sbc100
Copy link
Collaborator

sbc100 commented May 3, 2023

When you say JavaScript: 2 ms are you referring to the output of emcc -sWASM=0, or some other implementation written directly in JS?

@JeromeDesfieux
Copy link
Contributor Author

It is a custom js code inside a EM_ASM_ block which is as much close as possible to the c++ one.
If my understanding is good, that code is executed as-is by the JavaScript engine. Is it?

@sbc100
Copy link
Collaborator

sbc100 commented May 3, 2023

Yes, although the difference here is that JavaScript has trig functions builtin (e.g. Math.atan2), but IIRC WebAssembly does not. So you your C/C++ code will call into atan2() code which is implemented in userspace in terms of lower level math function: /~https://github.com/emscripten-core/emscripten/blob/main/system/lib/libc/musl/src/math/atan.c
Without investigating further I would guess that this is part of the reason for the differences you are seeing.

@kripken
Copy link
Member

kripken commented May 3, 2023

@JeromeDesfieux Out of curiosity, did you try with the linker flag -sJS_MATH ? That makes the wasm use JS's math functions. The overhead of calling out to JS might make it not worth it (this option is usually for code size), but it's worth comparing.

@JeromeDesfieux
Copy link
Contributor Author

@kripken As anticipated, using the linker flag -sJS_MATH gives worst results (WASM is then 3 times slower than JS).

@JeromeDesfieux
Copy link
Contributor Author

JeromeDesfieux commented May 4, 2023

I tried using O3 instead of O2 and I tried adding -msimd128 compiler flag without noticeable improvments.

I must admit that it makes our use case for WebAssembly compromised...
Our application is a graphic app which is doing a lot of trigonomy in framerate. Most of this trig computations are done on CPU during a kind of pre-culling phase that we can't push to the GPU...

Do you have any ideas of what I can do to improve this ? I guess I would need to rewrite that musl part of code with faster (and less accurate) algo ? Or is it a possibility that WebAssembly provides builtin implementation of those functions ?

Thanks for your help

Note that I will try using other math lib (like gml or gcem). I will post the bench result.

(I also found this StackOverFlow discussion where I posted a link to this issue:
https://stackoverflow.com/a/76171112/19077851)

@dschuff
Copy link
Member

dschuff commented May 5, 2023

@kripken it looks like the implementation of -sJS_MATH just calls directly out to the relevant JS math functions via EM_JS, so it seems very odd that it's noticeably worse than custom EM_ASM that does the same thing. That seems worth investigating, since it would seem like that's exactly what we'd want in this situation.

@JeromeDesfieux
Copy link
Contributor Author

JeromeDesfieux commented May 5, 2023

In fact the auto test I am doing is to execute multiple (a lot) of computation inside the same block I measure:

        auto start = std::chrono::high_resolution_clock::now();
        double out{0.};
        for(auto i=0 ; i < benchmarkData.size()-1 ; i+=1)
        {
            const double sinI = sin(benchmarkData[i]);
            for(auto j=0 ; j < benchmarkData.size()-1 ; j+=1)
                out += sinI * cos(benchmarkData[j]);
        }
        auto end = std::chrono::high_resolution_clock::now();

So there is a overhead for each call to cos/sin if using -sJS_MATH.

In the other hand the custom EM_ASM code wraps the total loop (so only one EM_ASM overhead for the whole test -> my goal being to measure JavaScript time):

        double time_taken = EM_ASM_DOUBLE(({
                let benchmarkData = [];
                const nbData = 10000;
                for(let i=0; i<nbData; ++i)
                    benchmarkData[i] = i;

                let out = 0;
                const start = Date.now();
                for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
                {
                    const sinI = Math.sin(benchmarkData[i]);
                    for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
                        out += sinI * Math.cos(benchmarkData[j]);
                }
                const end = Date.now();
                console.log("[JS] (out is " + out + ")");

                return end-start;
            }));

@dschuff
Copy link
Member

dschuff commented May 5, 2023

Ah ok yeah that makes sense, thanks.

@kripken
Copy link
Member

kripken commented May 5, 2023

@JeromeDesfieux Interesting, thanks for the info.

Overall, I think the question is whether native code has some ability wasm doesn't. I mean, it's possible atan etc. in your libc is using some inline assembly CPU magic that can't be expressed in wasm. To get a more apples-to-apples comparison, maybe provide the atan etc. implementation in the source files you build. That is, don't rely on libc to implement it, but use some C++ library that you can compile to both native and wasm. (Just adding source files to compile+link should work - they will override the libc versions.)

If there is still a difference, then perhaps there is something we can improve on compiling that library. If there isn't a difference, then you would be able to get good wasm performance by using that math library instead of the default math code in emscripten's libc (which is basically musl, and like all codebases has some compromises between size and speed etc. - so you may be able to do better, for your use case, with another math library).

@gl84
Copy link
Contributor

gl84 commented May 7, 2023

@JeromeDesfieux: When benchmarking Native vs. WebAssembly math functions keep in mind that in emscripten sometimes speed is traded for code size.

See for e.g. #15544.

Disclaimer: I don't know the implementation details for musl trigonometric functions.

@JeromeDesfieux
Copy link
Contributor Author

Thanks all for your help.

I changed my autotest to making it more accurate (bigger dataset, more iterations) and to target only one trig function (cosine).

I tried other implementations (gcem and V8 one) with very similar results --> WebAssembly is always [30-40]% slower than native C++. Note that I have other benchmarks about arithmetics where wasm is very very close to C++ native so I guess there is something very specific with those trigonometric functions.

When I test the V8 algo, I copied the whole cosine code in my cpp so it is exactly the same code which compiles on native and wasm, but I still see this +30%. When looking at the code I can see that it uses a lot of binary operations (&, >>, <<, ...).
Can those operators be the source of the slowness ?

Example of such macro used in cosine computation:

#define GET_HIGH_WORD(i, d)                      \  
  do {                                           \  
    uint64_t bits = base::bit_cast<uint64_t>(d); \  
    (i) = bits >> 32;                            \  
  } while (false)

// with bit_cast being:
template <class Dest, class Source>
inline Dest bit_cast(Source const& source) {
    static_assert(sizeof(Dest) == sizeof(Source),
                  "source and dest must be same size");
    Dest dest;
    memcpy(&dest, &source, sizeof(dest));
    return dest;
}

@kripken
Copy link
Member

kripken commented May 9, 2023

I would expect bitwise operations like << to be at full speed in wasm. It's things like memory access and calls, things that require sandboxing, that can be slower.

What wasm does base::bit_cast lower into?

(Otherwise, I think inspecting the machine code in both native and wasm is really the only way to understand a 30-40% slowdown. Comparing to other VMs might also be helpful to get context.)

@anonyco
Copy link

anonyco commented Nov 18, 2023

Use handwritten asm.js to speed up critical code using trigonometric functions.

// Original benchmark code
(function() {
	let benchmarkData = [];
	const nbData = 1000;
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	let out = 0;
	const start = Date.now();
	for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
	{
		const sinI = Math.sin(benchmarkData[i]);
		for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
			out += Math.atan2(sinI, Math.cos(benchmarkData[j]));
	}
	const end = Date.now();
	console.log("[JS] (out is " + out + ")");

	return end-start; // takes 135ms in Chrome on my machine
})();

Below is my handwritten asm.js. It's 3 times faster than the former Javascript.

(function(console) {
	"use strict";
	
	const nbData = 1000;
	var benchmarkData = new Float64Array(16384);
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	function asmModuleInit(stdlib, foreign, heap) {
		"use asm";
		// shared variables
		var fround = stdlib.Math.fround;
		var sin = stdlib.Math.sin;
		var cos = stdlib.Math.cos;
		var atan2 = stdlib.Math.atan2;
		var benchmarkData = new stdlib.Float64Array(heap);
		
		// function declarations
		function benchmark(len) {
			len = len | 0;
			var i=0, j=0;
			var out=0.0, sinI=0.0;
			len = len << 3;
			for(i=0; (i|0) != (len|0); i=i+8|0) {
				sinI = sin(+benchmarkData[i >> 3]);
				for(j=0; (j|0) != (len|0); j=j+8|0)
					out = out + atan2(sinI, cos(+benchmarkData[j >> 3]));
			}
			return out;
		}
		
		// export function
		return { benchmark: benchmark };
	}
	const asmModule = asmModuleInit(window, {}, benchmarkData.buffer);

	var len = nbData-1|0;
	asmModule.benchmark(len); // warm up because chrome doesn't compile AOT 
	
	console.time("asm.js code");
	asmModule.benchmark(len);
	console.timeEnd("asm.js code"); // takes 44ms in Chrome on my machine
})(console);

@bangbangsheshotmedown
Copy link

bangbangsheshotmedown commented Nov 11, 2024

@danleh
Copy link

danleh commented Dec 12, 2024

I stumbled over this in the context of https://crrev.com/c/6083877 (V8-side change how calls to imported JS Math.* functions are optimized.) CC'ing @jakobkummerow, the expert on this.

Emscripten should probably change its JavaScript code when using -sJS_MATH=1, because right now it won't be optimized as well by V8 as it could be. @kripken
Concretely, I see with -sJS_MATH=1 (and -g for readability):

function JS_cos(x) { return Math.cos(x) }
...
var wasmImports = {
  JS_cos,
  ...
}

which prevents us from using the Well-Known-Imports optimization linked above. Without the JS wrapper function performance is substantially better (on an x64 workstation and V8 ToT build: ~5300ms down to 2600ms). That is, the Wasm module should instead just directly import the Math.* functions:

var wasmImports = {
  JS_cos: Math.cos,
  ...
}

Also a note on these kinds of micro-benchmarks. In this case, for accurately measuring peak performance (and comparing it to native code), one has to increase the iteration count (to tier-up to the optimizing compiler, and dilute the compilation time overhead in the measurement), or better directly pass --no-liftoff when running in d8.
In other cases (e.g., which depend on inlining or speculative optimizations), one needs Liftoff for collecting feedback.
Finally, Wasm in V8 doesn't have on-stack replacement, so even with many iterations we might never tier-up to the optimizing compiler, unless the benchmark function is entered multiple times.
Summarizing: accurately measuring microbenchmarks is tricky, so we need to be careful.

@danleh
Copy link

danleh commented Dec 12, 2024

Whether to use -sJS_MATH=1 by default is an orthogonal question. The considerations seem to be (see also https://emscripten.org/docs/tools_reference/settings_reference.html#js-math):

  • performance (with engine optimization, it might be even performance neutral or positive, see numbers below),
  • code size (using JS_MATH is smaller),
  • portability (performance across other engines),
  • difference in results? (JS implementation vs. libc++/cmath versions), and also
  • whether the default should be different for asm.js vs. Wasm (right now it's off-by-default for both).

Some performance numbers on x64 with a tip-of-tree d8 release build and a slightly modified version of the benchmark in the original post (compiled with emcc -O3 ..., version 4171ae2):

variant runtime (ms)
native 2992
pure JS implementation 2723
asm.js (default) 6744
asm.js (with JS_MATH) 2641
Wasm (default, no additional d8 flags) 3549
Wasm (default, d8 flag --no-liftoff) 2845
Wasm (with JS_MATH as upstream, d8 flag --no-liftoff) 5360
Wasm (with JS_MATH + fix as proposed in previous comment, d8 flag --no-liftoff) 2633

Note that (i) --no-liftoff is important to properly compare peak performance, and (ii) with the well-known-imports optimization in V8, the JS_MATH version is even faster than the userspace musl implementation.

Code size:

variant JS code + Wasm binary (bytes)
asm.js (default) 365870
asm.js (with JS_MATH) 353227 (-12643 or -3.5%)
Wasm (default) 214347
Wasm (with JS_MATH) 206745 (-7602 or -3.5%)

Obviously relative code size improvements would be smaller with larger, more realistic programs.

Given those performance numbers, it probably makes sense to enable JS_MATH for asm.js by default. For Wasm, that would depend on whether similar optimizations/performance is achieved in other engines.

@CryZe
Copy link
Contributor

CryZe commented Dec 12, 2024

For Wasm, that would depend on whether similar optimizations/performance is achieved in other engines.

On November 11 (so before these optimizations landed) I did my own benchmarks:

Variant Time Per Call
V8 Native wasm sin 4.22ns
V8 Math.sin import 6.37ns
SM Native wasm sin 4.0ns
SM Math.sin import 5.35ns
JSC Native wasm sin 8.2ns
JSC Math.sin import 9.57ns
Native sin 3.5ns

(JSC was on my iPhone 13)

@sbc100
Copy link
Collaborator

sbc100 commented Dec 12, 2024

var wasmImports = {
JS_cos: Math.cos,
...
}

Yes this would be really nice. One tricky aspect is that the current JS library functions (e.g. JS_cos) can also be called from JS.. so we would need some new mechanism to signal that this is only available as a wasm import

@kripken
Copy link
Member

kripken commented Dec 12, 2024

@sbc100 Alternatively, perhaps we could add an acorn optimization for

function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
  foo,

=>

function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
  bar, // only this changed

That is, we know that function identity does not matter in wasm imports, so we can optimize such pass-through functions to their target. (Other optimizations can then remove foo, if it has no other uses - but it would be fine to have other uses, and leave it.)

However I'm not sure how important this is, given JS_MATH is mainly useful for code size. But maybe that "skip pass-through functions in wasm imports" pattern could help other things?

@sbc100
Copy link
Collaborator

sbc100 commented Dec 12, 2024

Yes, or perhaps acorn could even inline all otherwise-unused function into the imports struct. I wonder if closure does this already and if not then why not?

@kripken
Copy link
Member

kripken commented Dec 12, 2024

Inlining isn't enough, as the pass-through function would remain. And the pass-through can't be removed unless we know function identity does not matter - which we know about wasm imports, but in general, the code could compare the function reference to something else and see the difference that the wrapper makes.

sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 12, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 12, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 12, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 12, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
@sbc100
Copy link
Collaborator

sbc100 commented Dec 12, 2024

A couple of -sJSMATH cleanups: #23144 #23143. Once those two land I have a followup that has -sJSMATH use library_math.js.. and those already don't have wrapper functions so it might be enough:

emscripten_math_cos: 'Math.cos',

Is that going to be enough to make this work? e.g.

var mysin = Math.sin;
var Imports = {
  sin: mysin,
}

sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 12, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
sbc100 added a commit that referenced this issue Dec 13, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See #19284
sbc100 added a commit to sbc100/emscripten that referenced this issue Dec 13, 2024
Unlike the current EM_JS implementations, the `em_math.h` functions map
directly to `Math.xxx` without a wrapper function.

The other advantage of doing it this way is that we avoid duplicate
implementations of all these functions.

Fixes: emscripten-core#19284
@sbc100 sbc100 closed this as completed in 5efa98e Dec 13, 2024
@danleh
Copy link

danleh commented Dec 13, 2024

Nice, thanks!

Is that going to be enough to make this work? e.g.

var mysin = Math.sin;
var Imports = {
  sin: mysin,
}

Yes, that should work fine, just tried that out locally.

hedwigz pushed a commit to hedwigz/emscripten that referenced this issue Dec 18, 2024
See the comments at the top of `emscripten/js_math.h` for why JS
versions of these functions are not needed.

As a followup I plan to map `jsmath.c` functions to `em_math.h`
functions instead of using EM_JS here.

See emscripten-core#19284
hedwigz pushed a commit to hedwigz/emscripten that referenced this issue Dec 18, 2024
Unlike the current EM_JS implementations, the `em_math.h` functions map
directly to `Math.xxx` without a wrapper function.

The other advantage of doing it this way is that we avoid duplicate
implementations of all these functions.

Fixes: emscripten-core#19284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants