Very poor performance of trigonometric functions #19284

JeromeDesfieux · 2023-05-03T16:28:08Z

Emscripten version: 3.1.35
Tests done on Windows desktop i7 12th 32GB RAM (native compiler is MSVC 2022, brower Chrome for wasm tests)
Build configuration is RelWithDebInfo (O2) for native and wasm builds

I am doing some benchmark and I am very surprised with the results I have regarding the performance of the trigonometric functions (math.h).

Basically I wrote a program doing a conbination of atan2, cos and sin that I execute on a test data.
I have the following results on my computer:

Native C++: 1.7 ms
WebAssembly: 3 ms
JavaScript: 2 ms

I tried many other approach for this test (including Catch2 benchmark and using a non-sorted dataset) and I have always very bad results with WebAssembly (approx. 50% slower than JS).

Is there is a known reason for that ? Can I improve ?

(Please find attached the source code)
Benchmark_trigo.zip

The text was updated successfully, but these errors were encountered:

sbc100 · 2023-05-03T16:54:40Z

When you say JavaScript: 2 ms are you referring to the output of emcc -sWASM=0, or some other implementation written directly in JS?

JeromeDesfieux · 2023-05-03T17:24:54Z

It is a custom js code inside a EM_ASM_ block which is as much close as possible to the c++ one.
If my understanding is good, that code is executed as-is by the JavaScript engine. Is it?

sbc100 · 2023-05-03T17:30:47Z

Yes, although the difference here is that JavaScript has trig functions builtin (e.g. Math.atan2), but IIRC WebAssembly does not. So you your C/C++ code will call into atan2() code which is implemented in userspace in terms of lower level math function: /~https://github.com/emscripten-core/emscripten/blob/main/system/lib/libc/musl/src/math/atan.c
Without investigating further I would guess that this is part of the reason for the differences you are seeing.

kripken · 2023-05-03T19:41:01Z

@JeromeDesfieux Out of curiosity, did you try with the linker flag -sJS_MATH ? That makes the wasm use JS's math functions. The overhead of calling out to JS might make it not worth it (this option is usually for code size), but it's worth comparing.

JeromeDesfieux · 2023-05-04T06:47:46Z

@kripken As anticipated, using the linker flag -sJS_MATH gives worst results (WASM is then 3 times slower than JS).

JeromeDesfieux · 2023-05-04T07:48:56Z

I tried using O3 instead of O2 and I tried adding -msimd128 compiler flag without noticeable improvments.

I must admit that it makes our use case for WebAssembly compromised...
Our application is a graphic app which is doing a lot of trigonomy in framerate. Most of this trig computations are done on CPU during a kind of pre-culling phase that we can't push to the GPU...

Do you have any ideas of what I can do to improve this ? I guess I would need to rewrite that musl part of code with faster (and less accurate) algo ? Or is it a possibility that WebAssembly provides builtin implementation of those functions ?

Thanks for your help

Note that I will try using other math lib (like gml or gcem). I will post the bench result.

(I also found this StackOverFlow discussion where I posted a link to this issue:
https://stackoverflow.com/a/76171112/19077851)

dschuff · 2023-05-05T16:13:35Z

@kripken it looks like the implementation of -sJS_MATH just calls directly out to the relevant JS math functions via EM_JS, so it seems very odd that it's noticeably worse than custom EM_ASM that does the same thing. That seems worth investigating, since it would seem like that's exactly what we'd want in this situation.

JeromeDesfieux · 2023-05-05T16:27:15Z

In fact the auto test I am doing is to execute multiple (a lot) of computation inside the same block I measure:

        auto start = std::chrono::high_resolution_clock::now();
        double out{0.};
        for(auto i=0 ; i < benchmarkData.size()-1 ; i+=1)
        {
            const double sinI = sin(benchmarkData[i]);
            for(auto j=0 ; j < benchmarkData.size()-1 ; j+=1)
                out += sinI * cos(benchmarkData[j]);
        }
        auto end = std::chrono::high_resolution_clock::now();

So there is a overhead for each call to cos/sin if using -sJS_MATH.

In the other hand the custom EM_ASM code wraps the total loop (so only one EM_ASM overhead for the whole test -> my goal being to measure JavaScript time):

        double time_taken = EM_ASM_DOUBLE(({
                let benchmarkData = [];
                const nbData = 10000;
                for(let i=0; i<nbData; ++i)
                    benchmarkData[i] = i;

                let out = 0;
                const start = Date.now();
                for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
                {
                    const sinI = Math.sin(benchmarkData[i]);
                    for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
                        out += sinI * Math.cos(benchmarkData[j]);
                }
                const end = Date.now();
                console.log("[JS] (out is " + out + ")");

                return end-start;
            }));

dschuff · 2023-05-05T17:08:36Z

Ah ok yeah that makes sense, thanks.

kripken · 2023-05-05T22:01:01Z

@JeromeDesfieux Interesting, thanks for the info.

Overall, I think the question is whether native code has some ability wasm doesn't. I mean, it's possible atan etc. in your libc is using some inline assembly CPU magic that can't be expressed in wasm. To get a more apples-to-apples comparison, maybe provide the atan etc. implementation in the source files you build. That is, don't rely on libc to implement it, but use some C++ library that you can compile to both native and wasm. (Just adding source files to compile+link should work - they will override the libc versions.)

If there is still a difference, then perhaps there is something we can improve on compiling that library. If there isn't a difference, then you would be able to get good wasm performance by using that math library instead of the default math code in emscripten's libc (which is basically musl, and like all codebases has some compromises between size and speed etc. - so you may be able to do better, for your use case, with another math library).

gl84 · 2023-05-07T14:36:15Z

@JeromeDesfieux: When benchmarking Native vs. WebAssembly math functions keep in mind that in emscripten sometimes speed is traded for code size.

See for e.g. #15544.

Disclaimer: I don't know the implementation details for musl trigonometric functions.

JeromeDesfieux · 2023-05-09T12:14:46Z

Thanks all for your help.

I changed my autotest to making it more accurate (bigger dataset, more iterations) and to target only one trig function (cosine).

I tried other implementations (gcem and V8 one) with very similar results --> WebAssembly is always [30-40]% slower than native C++. Note that I have other benchmarks about arithmetics where wasm is very very close to C++ native so I guess there is something very specific with those trigonometric functions.

When I test the V8 algo, I copied the whole cosine code in my cpp so it is exactly the same code which compiles on native and wasm, but I still see this +30%. When looking at the code I can see that it uses a lot of binary operations (&, >>, <<, ...).
Can those operators be the source of the slowness ?

Example of such macro used in cosine computation:

#define GET_HIGH_WORD(i, d)                      \  
  do {                                           \  
    uint64_t bits = base::bit_cast<uint64_t>(d); \  
    (i) = bits >> 32;                            \  
  } while (false)

// with bit_cast being:
template <class Dest, class Source>
inline Dest bit_cast(Source const& source) {
    static_assert(sizeof(Dest) == sizeof(Source),
                  "source and dest must be same size");
    Dest dest;
    memcpy(&dest, &source, sizeof(dest));
    return dest;
}

kripken · 2023-05-09T15:44:54Z

I would expect bitwise operations like << to be at full speed in wasm. It's things like memory access and calls, things that require sandboxing, that can be slower.

What wasm does base::bit_cast lower into?

(Otherwise, I think inspecting the machine code in both native and wasm is really the only way to understand a 30-40% slowdown. Comparing to other VMs might also be helpful to get context.)

anonyco · 2023-11-18T00:12:21Z

Use handwritten asm.js to speed up critical code using trigonometric functions.

// Original benchmark code
(function() {
	let benchmarkData = [];
	const nbData = 1000;
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	let out = 0;
	const start = Date.now();
	for(let i=0 ; i < benchmarkData.length-1 ; i+=1)
	{
		const sinI = Math.sin(benchmarkData[i]);
		for(let j=0 ; j < benchmarkData.length-1 ; j+=1)
			out += Math.atan2(sinI, Math.cos(benchmarkData[j]));
	}
	const end = Date.now();
	console.log("[JS] (out is " + out + ")");

	return end-start; // takes 135ms in Chrome on my machine
})();

Below is my handwritten asm.js. It's 3 times faster than the former Javascript.

(function(console) {
	"use strict";
	
	const nbData = 1000;
	var benchmarkData = new Float64Array(16384);
	for(let i=0; i<nbData; ++i)
		benchmarkData[i] = i;

	function asmModuleInit(stdlib, foreign, heap) {
		"use asm";
		// shared variables
		var fround = stdlib.Math.fround;
		var sin = stdlib.Math.sin;
		var cos = stdlib.Math.cos;
		var atan2 = stdlib.Math.atan2;
		var benchmarkData = new stdlib.Float64Array(heap);
		
		// function declarations
		function benchmark(len) {
			len = len | 0;
			var i=0, j=0;
			var out=0.0, sinI=0.0;
			len = len << 3;
			for(i=0; (i|0) != (len|0); i=i+8|0) {
				sinI = sin(+benchmarkData[i >> 3]);
				for(j=0; (j|0) != (len|0); j=j+8|0)
					out = out + atan2(sinI, cos(+benchmarkData[j >> 3]));
			}
			return out;
		}
		
		// export function
		return { benchmark: benchmark };
	}
	const asmModule = asmModuleInit(window, {}, benchmarkData.buffer);

	var len = nbData-1|0;
	asmModule.benchmark(len); // warm up because chrome doesn't compile AOT 
	
	console.time("asm.js code");
	asmModule.benchmark(len);
	console.timeEnd("asm.js code"); // takes 44ms in Chrome on my machine
})(console);

bangbangsheshotmedown · 2024-11-11T15:51:34Z

why use javascript functions when LLVM provides intrinsics for most of them?

https://llvm.org/docs/LangRef.html#llvm-atan2-intrinsic

https://llvm.org/docs/LangRef.html#llvm-sin-intrinsic

https://llvm.org/docs/LangRef.html#llvm-cos-intrinsic

danleh · 2024-12-12T13:08:54Z

I stumbled over this in the context of https://crrev.com/c/6083877 (V8-side change how calls to imported JS Math.* functions are optimized.) CC'ing @jakobkummerow, the expert on this.

Emscripten should probably change its JavaScript code when using -sJS_MATH=1, because right now it won't be optimized as well by V8 as it could be. @kripken
Concretely, I see with -sJS_MATH=1 (and -g for readability):

function JS_cos(x) { return Math.cos(x) }
...
var wasmImports = {
  JS_cos,
  ...
}

which prevents us from using the Well-Known-Imports optimization linked above. Without the JS wrapper function performance is substantially better (on an x64 workstation and V8 ToT build: ~5300ms down to 2600ms). That is, the Wasm module should instead just directly import the Math.* functions:

var wasmImports = {
  JS_cos: Math.cos,
  ...
}

Also a note on these kinds of micro-benchmarks. In this case, for accurately measuring peak performance (and comparing it to native code), one has to increase the iteration count (to tier-up to the optimizing compiler, and dilute the compilation time overhead in the measurement), or better directly pass --no-liftoff when running in d8.
In other cases (e.g., which depend on inlining or speculative optimizations), one needs Liftoff for collecting feedback.
Finally, Wasm in V8 doesn't have on-stack replacement, so even with many iterations we might never tier-up to the optimizing compiler, unless the benchmark function is entered multiple times.
Summarizing: accurately measuring microbenchmarks is tricky, so we need to be careful.

danleh · 2024-12-12T13:42:55Z

Whether to use -sJS_MATH=1 by default is an orthogonal question. The considerations seem to be (see also https://emscripten.org/docs/tools_reference/settings_reference.html#js-math):

performance (with engine optimization, it might be even performance neutral or positive, see numbers below),
code size (using JS_MATH is smaller),
portability (performance across other engines),
difference in results? (JS implementation vs. libc++/cmath versions), and also
whether the default should be different for asm.js vs. Wasm (right now it's off-by-default for both).

Some performance numbers on x64 with a tip-of-tree d8 release build and a slightly modified version of the benchmark in the original post (compiled with emcc -O3 ..., version 4171ae2):

variant	runtime (ms)
native	2992
pure JS implementation	2723
asm.js (default)	6744
asm.js (with `JS_MATH`)	2641
Wasm (default, no additional d8 flags)	3549
Wasm (default, d8 flag `--no-liftoff`)	2845
Wasm (with `JS_MATH` as upstream, d8 flag `--no-liftoff`)	5360
Wasm (with `JS_MATH` + fix as proposed in previous comment, d8 flag `--no-liftoff`)	2633

Note that (i) --no-liftoff is important to properly compare peak performance, and (ii) with the well-known-imports optimization in V8, the JS_MATH version is even faster than the userspace musl implementation.

Code size:

variant	JS code + Wasm binary (bytes)
asm.js (default)	365870
asm.js (with `JS_MATH`)	353227 (-12643 or -3.5%)
Wasm (default)	214347
Wasm (with `JS_MATH`)	206745 (-7602 or -3.5%)

Obviously relative code size improvements would be smaller with larger, more realistic programs.

Given those performance numbers, it probably makes sense to enable JS_MATH for asm.js by default. For Wasm, that would depend on whether similar optimizations/performance is achieved in other engines.

CryZe · 2024-12-12T13:59:44Z

For Wasm, that would depend on whether similar optimizations/performance is achieved in other engines.

On November 11 (so before these optimizations landed) I did my own benchmarks:

Variant	Time Per Call
V8 Native wasm sin	4.22ns
V8 Math.sin import	6.37ns
SM Native wasm sin	4.0ns
SM Math.sin import	5.35ns
JSC Native wasm sin	8.2ns
JSC Math.sin import	9.57ns
Native sin	3.5ns

(JSC was on my iPhone 13)

sbc100 · 2024-12-12T17:49:04Z

var wasmImports = {
JS_cos: Math.cos,
...
}

Yes this would be really nice. One tricky aspect is that the current JS library functions (e.g. JS_cos) can also be called from JS.. so we would need some new mechanism to signal that this is only available as a wasm import

kripken · 2024-12-12T19:42:54Z

@sbc100 Alternatively, perhaps we could add an acorn optimization for

function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
  foo,

=>

function foo(x) { return bar(x) }
[..]
var wasmImports = {
[..]
  bar, // only this changed

That is, we know that function identity does not matter in wasm imports, so we can optimize such pass-through functions to their target. (Other optimizations can then remove foo, if it has no other uses - but it would be fine to have other uses, and leave it.)

However I'm not sure how important this is, given JS_MATH is mainly useful for code size. But maybe that "skip pass-through functions in wasm imports" pattern could help other things?

sbc100 · 2024-12-12T19:47:25Z

Yes, or perhaps acorn could even inline all otherwise-unused function into the imports struct. I wonder if closure does this already and if not then why not?

kripken · 2024-12-12T19:49:10Z

Inlining isn't enough, as the pass-through function would remain. And the pass-through can't be removed unless we know function identity does not matter - which we know about wasm imports, but in general, the code could compare the function reference to something else and see the difference that the wrapper makes.

See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284

sbc100 · 2024-12-12T20:16:36Z

A couple of -sJSMATH cleanups: #23144 #23143. Once those two land I have a followup that has -sJSMATH use library_math.js.. and those already don't have wrapper functions so it might be enough:

emscripten/src/library_math.js

Line 22 in 1171ada

emscripten_math_cos: 'Math.cos',

Is that going to be enough to make this work? e.g.

var mysin = Math.sin;
var Imports = {
  sin: mysin,
}

See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284

See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See #19284

Unlike the current EM_JS implementations, the `em_math.h` functions map directly to `Math.xxx` without a wrapper function. The other advantage of doing it this way is that we avoid duplicate implementations of all these functions. Fixes: emscripten-core#19284

danleh · 2024-12-13T11:08:19Z

Nice, thanks!

Is that going to be enough to make this work? e.g.
var mysin = Math.sin;
var Imports = {
  sin: mysin,
}

Yes, that should work fine, just tried that out locally.

See the comments at the top of `emscripten/js_math.h` for why JS versions of these functions are not needed. As a followup I plan to map `jsmath.c` functions to `em_math.h` functions instead of using EM_JS here. See emscripten-core#19284

Unlike the current EM_JS implementations, the `em_math.h` functions map directly to `Math.xxx` without a wrapper function. The other advantage of doing it this way is that we avoid duplicate implementations of all these functions. Fixes: emscripten-core#19284

JeromeDesfieux mentioned this issue May 5, 2023

Support for trigonometric functions WebAssembly/design#1503

Open

sbc100 mentioned this issue Dec 12, 2024

Remove unnecessary functions from jsmath.c. NFC #23143

Merged

sbc100 mentioned this issue Dec 13, 2024

Use em_math.h functions in jsmath.c #23147

Merged

sbc100 closed this as completed in #23147 Dec 13, 2024

sbc100 closed this as completed in 5efa98e Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very poor performance of trigonometric functions #19284

Very poor performance of trigonometric functions #19284

JeromeDesfieux commented May 3, 2023 •

edited

Loading

sbc100 commented May 3, 2023

JeromeDesfieux commented May 3, 2023

sbc100 commented May 3, 2023

kripken commented May 3, 2023

JeromeDesfieux commented May 4, 2023

JeromeDesfieux commented May 4, 2023 •

edited

Loading

dschuff commented May 5, 2023

JeromeDesfieux commented May 5, 2023 •

edited

Loading

dschuff commented May 5, 2023

kripken commented May 5, 2023

gl84 commented May 7, 2023

JeromeDesfieux commented May 9, 2023

kripken commented May 9, 2023

anonyco commented Nov 18, 2023 •

edited

Loading

bangbangsheshotmedown commented Nov 11, 2024 •

edited

Loading

danleh commented Dec 12, 2024 •

edited

Loading

danleh commented Dec 12, 2024

CryZe commented Dec 12, 2024 •

edited

Loading

sbc100 commented Dec 12, 2024

kripken commented Dec 12, 2024

sbc100 commented Dec 12, 2024

kripken commented Dec 12, 2024

sbc100 commented Dec 12, 2024

danleh commented Dec 13, 2024

Very poor performance of trigonometric functions #19284

Very poor performance of trigonometric functions #19284

Comments

JeromeDesfieux commented May 3, 2023 • edited Loading

sbc100 commented May 3, 2023

JeromeDesfieux commented May 3, 2023

sbc100 commented May 3, 2023

kripken commented May 3, 2023

JeromeDesfieux commented May 4, 2023

JeromeDesfieux commented May 4, 2023 • edited Loading

dschuff commented May 5, 2023

JeromeDesfieux commented May 5, 2023 • edited Loading

dschuff commented May 5, 2023

kripken commented May 5, 2023

gl84 commented May 7, 2023

JeromeDesfieux commented May 9, 2023

kripken commented May 9, 2023

anonyco commented Nov 18, 2023 • edited Loading

bangbangsheshotmedown commented Nov 11, 2024 • edited Loading

danleh commented Dec 12, 2024 • edited Loading

danleh commented Dec 12, 2024

CryZe commented Dec 12, 2024 • edited Loading

sbc100 commented Dec 12, 2024

kripken commented Dec 12, 2024

sbc100 commented Dec 12, 2024

kripken commented Dec 12, 2024

sbc100 commented Dec 12, 2024

danleh commented Dec 13, 2024

JeromeDesfieux commented May 3, 2023 •

edited

Loading

JeromeDesfieux commented May 4, 2023 •

edited

Loading

JeromeDesfieux commented May 5, 2023 •

edited

Loading

anonyco commented Nov 18, 2023 •

edited

Loading

bangbangsheshotmedown commented Nov 11, 2024 •

edited

Loading

danleh commented Dec 12, 2024 •

edited

Loading

CryZe commented Dec 12, 2024 •

edited

Loading