Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -1,517%] System.Numerics.Tests.Perf_Vector2.DistanceBenchmark #50939

Open
Tracked by #47244
DrewScoggins opened this issue Apr 8, 2021 · 14 comments · Fixed by #51731
Open
Tracked by #47244

[Perf -1,517%] System.Numerics.Tests.Perf_Vector2.DistanceBenchmark #50939

DrewScoggins opened this issue Apr 8, 2021 · 14 comments · Fixed by #51731
Labels
arch-x64 area-System.Numerics os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS ubuntu 18.04
Changes diff

Regressions in System.Numerics.Tests.Perf_Vector2

Benchmark Baseline Test Test/Base Modality Baseline Outlier
DistanceBenchmark 0.61 ns 9.83 ns 16.17 True

graph
Historical Data in Reporting System

Repro

git clone /~https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Numerics.Tests.Perf_Vector2*'

Histogram

System.Numerics.Tests.Perf_Vector2.DistanceBenchmark

[-0.526 ;  1.545) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 1.545 ;  3.549) | 
[ 3.549 ;  5.554) | 
[ 5.554 ;  7.559) | 
[ 7.559 ;  8.828) | 
[ 8.828 ; 10.852) | @@@@@@@@@@@@@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Apr 8, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Apr 8, 2021
@ghost
Copy link

ghost commented Apr 8, 2021

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS ubuntu 18.04
Changes diff

Regressions in System.Numerics.Tests.Perf_Vector2

Benchmark Baseline Test Test/Base Modality Baseline Outlier
DistanceBenchmark 0.61 ns 9.83 ns 16.17 True

graph
Historical Data in Reporting System

Repro

git clone /~https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Numerics.Tests.Perf_Vector2*'

Histogram

System.Numerics.Tests.Perf_Vector2.DistanceBenchmark

[-0.526 ;  1.545) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 1.545 ;  3.549) | 
[ 3.549 ;  5.554) | 
[ 5.554 ;  7.559) | 
[ 7.559 ;  8.828) | 
[ 8.828 ; 10.852) | @@@@@@@@@@@@@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: -
Labels:

arch-x64, area-System.Numerics, os-linux, os-windows, tenet-performance, tenet-performance-benchmarks, untriaged

Milestone: -

@tannergooding
Copy link
Member

@DrewScoggins, do we have disassembly easily accessible?

@DrewScoggins
Copy link
Member Author

Thanks to @adamsitnik for noticing this. Look related to this PR #41898

@DrewScoggins
Copy link
Member Author

DrewScoggins commented Apr 8, 2021

No, because this is so old all the artifacts, if we had them, would have been purged. And this was before we added disassembly to the report.

@adamsitnik
Copy link
Member

do we have disassembly easily accessible?

in this case we are lucky and the regression is reproducible on Windows. The disassembly:

git clone /~https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Numerics.Tests.Perf_Vector2.DistanceBenchmark --bdn-arguments "--disasm true"

.NET 5.0.5 (5.0.521.16609), X64 RyuJIT

; System.Numerics.Tests.Perf_Vector2.DistanceBenchmark()
       vzeroupper
       mov       rax,16AD17D7C78
       mov       rax,[rax]
       vmovsd    xmm0,qword ptr [rax+8]
       mov       rax,16AD17D7C80
       mov       rax,[rax]
       vmovsd    xmm1,qword ptr [rax+8]
       vsubps    xmm0,xmm0,xmm1
       vdpps     xmm0,xmm0,xmm0,31
       vsqrtss   xmm0,xmm0,xmm0
       ret
; Total bytes of code 54

.NET 6.0.0 (6.0.21.20503), X64 RyuJIT

; System.Numerics.Tests.Perf_Vector2.DistanceBenchmark()
       sub       rsp,18
       vzeroupper
       mov       rax,13834F694A8
       mov       rax,[rax]
       add       rax,8
       vmovss    xmm0,dword ptr [rax]
       vmovss    dword ptr [rsp+10],xmm0
       vmovss    xmm0,dword ptr [rax+4]
       vmovss    dword ptr [rsp+14],xmm0
       mov       rax,13834F694B0
       mov       rax,[rax]
       add       rax,8
       vmovss    xmm0,dword ptr [rax]
       vmovss    dword ptr [rsp+8],xmm0
       vmovss    xmm0,dword ptr [rax+4]
       vmovss    dword ptr [rsp+0C],xmm0
       vmovsd    xmm0,qword ptr [rsp+10]
       vmovsd    xmm1,qword ptr [rsp+8]
       vsubps    xmm0,xmm0,xmm1
       vdpps     xmm0,xmm0,xmm0,31
       vsqrtss   xmm0,xmm0,xmm0
       add       rsp,18
       ret
; Total bytes of code 114

@DrewScoggins
Copy link
Member Author

Run Information

Architecture x64
OS ubuntu 18.04
Changes diff

Regressions in System.Numerics.Tests.Perf_Matrix3x2

Benchmark Baseline Test Test/Base Modality Baseline Outlier
IsIdentityBenchmark 3.87 ns 8.17 ns 2.11 Bimodal True
CreateScaleFromScalarXYWithCenterBenchmark 5.26 ns 6.73 ns 1.28 Bimodal False

Related Issue on x64 Windows

[Perf 16%] System.Numerics.Tests.Perf_Matrix3x2.EqualsBenchmark

Related Issue on x86 Windows

[Perf 37%] System.Numerics.Tests.Perf_Matrix3x2 (2)

graph
graph
Historical Data in Reporting System

Repro

git clone /~https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Numerics.Tests.Perf_Matrix3x2*'

Histogram

System.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark

[3.447 ; 4.154) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[4.154 ; 4.860) | 
[4.860 ; 5.566) | 
[5.566 ; 6.309) | 
[6.309 ; 7.015) | @@@@@@@@@@@@@@@
[7.015 ; 7.867) | @@@@@@
[7.867 ; 8.528) | @

System.Numerics.Tests.Perf_Matrix3x2.CreateScaleFromScalarXYWithCenterBenchmark

[5.141 ; 5.395) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[5.395 ; 5.559) | 
[5.559 ; 5.813) | @@@@@@@@@@@@@@@@@@@@@@@@@@@
[5.813 ; 6.170) | @@
[6.170 ; 6.321) | @
[6.321 ; 6.575) | @@@@@@@@@@@@@@@@@@@
[6.575 ; 6.856) | @@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins
Copy link
Member Author

These tests also regressed over the same commit range.

@DrewScoggins
Copy link
Member Author

.NET 5.0.5 (5.0.521.16609), X64 RyuJIT

; System.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark()
       sub       rsp,38
       vxorps    xmm4,xmm4,xmm4
       vmovdqa   xmmword ptr [rsp+20],xmm4
       xor       eax,eax
       mov       [rsp+30],rax
       lea       rcx,[rsp+20]
       call      System.Numerics.Matrix3x2.get_Identity()
       lea       rcx,[rsp+20]
       call      System.Numerics.Matrix3x2.get_IsIdentity()
       nop
       add       rsp,38
       ret
; Total bytes of code 47
; System.Numerics.Matrix3x2.get_Identity()
       vzeroupper
       mov       rax,1ECEDC01420
       mov       rax,[rax]
       vmovdqu   xmm0,xmmword ptr [rax+8]
       vmovdqu   xmmword ptr [rcx],xmm0
       mov       rdx,[rax+18]
       mov       [rcx+10],rdx
       mov       rax,rcx
       ret
; Total bytes of code 37
; System.Numerics.Matrix3x2.get_IsIdentity()
       vzeroupper
       vmovss    xmm0,dword ptr [rcx]
       vucomiss  xmm0,dword ptr [7FFEC58B2528]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+0C]
       vucomiss  xmm0,dword ptr [7FFEC58B252C]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+4]
       vxorps    xmm1,xmm1,xmm1
       vucomiss  xmm0,xmm1
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+8]
       vxorps    xmm1,xmm1,xmm1
       vucomiss  xmm0,xmm1
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+10]
       vxorps    xmm1,xmm1,xmm1
       vucomiss  xmm0,xmm1
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+14]
       vxorps    xmm1,xmm1,xmm1
       vucomiss  xmm0,xmm1
       setnp     al
       jp        short M02_L00
       sete      al
M02_L00:
       movzx     eax,al
       ret
M02_L01:
       xor       eax,eax
       ret
; Total bytes of code 115

.NET 6.0.0 (6.0.21.20503), X64 RyuJIT

; System.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark()
       sub       rsp,98
       vzeroupper
       vxorps    xmm4,xmm4,xmm4
       vmovdqa   xmmword ptr [rsp+80],xmm4
       xor       eax,eax
       mov       [rsp+90],rax
       lea       rcx,[rsp+80]
       call      System.Numerics.Matrix3x2.get_Identity()
       vmovdqu   xmm0,xmmword ptr [rsp+80]
       vmovdqu   xmmword ptr [rsp+68],xmm0
       mov       rcx,[rsp+90]
       mov       [rsp+78],rcx
       lea       rcx,[rsp+50]
       call      System.Numerics.Matrix3x2.get_Identity()
       vmovdqu   xmm0,xmmword ptr [rsp+68]
       vmovdqu   xmmword ptr [rsp+38],xmm0
       mov       rcx,[rsp+78]
       mov       [rsp+48],rcx
       vmovdqu   xmm0,xmmword ptr [rsp+50]
       vmovdqu   xmmword ptr [rsp+20],xmm0
       mov       rcx,[rsp+60]
       mov       [rsp+30],rcx
       lea       rcx,[rsp+38]
       lea       rdx,[rsp+20]
       call      System.Numerics.Matrix3x2.op_Equality(System.Numerics.Matrix3x2, System.Numerics.Matrix3x2)
       movzx     eax,al
       add       rsp,98
       ret
; Total bytes of code 154
; System.Numerics.Matrix3x2.get_Identity()
       vzeroupper
       mov       rax,1DE240E1428
       mov       rax,[rax]
       vmovdqu   xmm0,xmmword ptr [rax+8]
       vmovdqu   xmmword ptr [rcx],xmm0
       mov       rdx,[rax+18]
       mov       [rcx+10],rdx
       mov       rax,rcx
       ret
; Total bytes of code 37
; System.Numerics.Matrix3x2.op_Equality(System.Numerics.Matrix3x2, System.Numerics.Matrix3x2)
       vzeroupper
       vmovss    xmm0,dword ptr [rcx]
       vucomiss  xmm0,dword ptr [rdx]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+0C]
       vucomiss  xmm0,dword ptr [rdx+0C]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+4]
       vucomiss  xmm0,dword ptr [rdx+4]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+8]
       vucomiss  xmm0,dword ptr [rdx+8]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+10]
       vucomiss  xmm0,dword ptr [rdx+10]
       jp        short M02_L01
       jne       short M02_L01
       vmovss    xmm0,dword ptr [rcx+14]
       vucomiss  xmm0,dword ptr [rdx+14]
       setnp     al
       jp        short M02_L00
       sete      al
M02_L00:
       movzx     eax,al
       ret
M02_L01:
       xor       eax,eax
       ret
; Total bytes of code 96

@DrewScoggins
Copy link
Member Author

.NET 5.0.5 (5.0.521.16609), X64 RyuJIT

; System.Numerics.Tests.Perf_Matrix3x2.CreateScaleFromScalarXYWithCenterBenchmark()
       vzeroupper
       vmovss    xmm2,dword ptr [7FFDC56928A8]
       vmovss    xmm1,dword ptr [7FFDC56928AC]
       vxorps    xmm0,xmm0,xmm0
       vmovq     r9,xmm0
       mov       rcx,rdx
       jmp       near ptr 00007FFDC5692750
; Total bytes of code 36

.NET 6.0.0 (6.0.21.20503), X64 RyuJIT

; System.Numerics.Tests.Perf_Matrix3x2.CreateScaleFromScalarXYWithCenterBenchmark()
       vzeroupper
       vmovss    xmm2,dword ptr [7FFF0816AEB8]
       vmovss    xmm1,dword ptr [7FFF0816AEBC]
       vxorps    xmm0,xmm0,xmm0
       vmovq     r9,xmm0
       mov       rcx,rdx
       jmp       near ptr System.Numerics.Matrix3x2.CreateScale(Single, Single, System.Numerics.Vector2)
; Total bytes of code 36
; System.Numerics.Matrix3x2.CreateScale(Single, Single, System.Numerics.Vector2)
       push      rsi
       sub       rsp,40
       vzeroupper
       mov       [rsp+68],r9
       mov       rsi,rcx
       vmovss    dword ptr [rsp+58],xmm1
       vmovss    dword ptr [rsp+60],xmm2
       lea       rcx,[rsp+28]
       call      System.Numerics.Matrix3x2.get_Identity()
       vmovss    xmm0,dword ptr [7FFF0816AF70]
       vmovss    xmm1,dword ptr [rsp+58]
       vsubss    xmm0,xmm0,xmm1
       vmulss    xmm0,xmm0,dword ptr [rsp+68]
       vmovss    xmm2,dword ptr [7FFF0816AF70]
       vmovss    xmm3,dword ptr [rsp+60]
       vsubss    xmm2,xmm2,xmm3
       vmulss    xmm2,xmm2,dword ptr [rsp+6C]
       vmovss    dword ptr [rsp+28],xmm1
       vmovss    dword ptr [rsp+34],xmm3
       vmovss    dword ptr [rsp+38],xmm0
       vmovss    dword ptr [rsp+3C],xmm2
       vmovdqu   xmm0,xmmword ptr [rsp+28]
       vmovdqu   xmmword ptr [rsi],xmm0
       mov       rax,[rsp+38]
       mov       [rsi+10],rax
       mov       rax,rsi
       add       rsp,40
       pop       rsi
       ret
; Total bytes of code 138

@tannergooding
Copy link
Member

For Distance in particular, the change was from:

Vector2 difference = value1 - value2;
float ls = Vector2.Dot(difference, difference);
return MathF.Sqrt(ls);

to

float distanceSquared = DistanceSquared(value1, value2);
return MathF.Sqrt(distanceSquared);

DistanceSquared itself does:

Vector2 difference = value1 - value2;
return Dot(difference, difference);

So with inlining you would expect zero differences, but they obviously exist 😄
Will get a JIT dump and try to root cause.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 23, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 27, 2021
@ghost ghost locked as resolved and limited conversation to collaborators May 27, 2021
@adamsitnik
Copy link
Member

@tannergooding the System.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark regression reported by @DrewScoggins in #50939 (comment) seems to be not resolved:

image

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmaster_x64_ubuntu%2018.04%2fSystem.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark.html

could you please take a look?

@adamsitnik
Copy link
Member

adamsitnik commented Sep 14, 2021

Compared to .NET 5, we still have this regression:

System.Numerics.Tests.Perf_Matrix3x2.IsIdentityBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 3.22 5.03 0.64 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 4.27 6.65 0.64 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 4.25 7.01 0.61 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 3.98 7.51 0.53 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 6.10 9.17 0.66 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Slower 4.47 7.47 0.60 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Slower 4.04 6.13 0.66 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Slower 6.44 10.17 0.63 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Slower 3.99 14.21 0.28 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Slower 4.33 6.84 0.63 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 8.71 22.70 0.38 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 4.46 6.81 0.66 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 4.35 7.06 0.62 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 4.50 7.37 0.61 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 4.66 6.83 0.68 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 4.62 6.76 0.68 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 4.13 5.65 0.73 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 4.27 6.46 0.66 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Slower 13.09 26.19 0.50 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Slower 6.50 10.03 0.65 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 6.94 9.92 0.70 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 3.67 6.58 0.56 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 4.91 18.32 0.27 +0 bimodal Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 11.12 12.96 0.86 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 6.05 8.19 0.74 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 4.72 6.86 0.69 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 5.41 7.94 0.68 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

@tannergooding could you PTAL?

@adamsitnik adamsitnik reopened this Sep 14, 2021
@adamsitnik adamsitnik modified the milestones: 6.0.0, 7.0.0 Sep 14, 2021
@adamsitnik
Copy link
Member

It's worse for x86:

System.Numerics.Tests.Perf_Vector3.DistanceBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 1.08 1.08 0.99 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 1.47 1.45 1.01 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.47 1.47 1.00 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.32 1.14 1.15 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1.23 1.04 1.19 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 1.23 1.04 1.18 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 0.79 1.03 0.77 +0 several? Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 1.17 1.08 1.08 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 0.78 0.82 0.96 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 0.83 0.91 0.91 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 3.10 17.56 0.18 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Same 1.40 1.41 0.99 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.40 1.40 1.00 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.41 1.40 1.01 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.43 1.41 1.02 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.40 1.42 0.98 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.16 1.14 1.02 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1.25 0.85 1.46 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Same 2.31 2.31 1.00 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Same 0.42 0.49 0.86 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 0.77 0.78 0.99 +0 bimodal Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1.47 16.79 0.09 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1.74 17.01 0.10 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1.22 1.25 0.97 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Same 1.35 1.35 1.00 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Same 1.27 1.26 1.01 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

System.Numerics.Tests.Perf_Vector2.DistanceBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 0.89 0.85 1.05 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 1.17 1.14 1.02 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.16 1.16 0.99 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.00 0.99 1.01 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 0.91 0.92 0.99 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 0.91 0.87 1.05 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 0.73 0.69 1.06 +0 several? Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 0.76 0.69 1.09 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 0.57 0.56 1.01 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 0.52 0.59 0.87 +0 several? Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Same 2.10 2.50 0.84 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Same 1.09 1.10 0.99 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.09 1.11 0.99 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.11 1.09 1.01 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.13 1.11 1.02 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 1.10 1.11 0.99 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 0.98 1.03 0.96 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 0.55 0.65 0.85 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Same 1.54 1.15 1.33 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Same 0.06 0.16 0.40 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 0.04 0.00 Infinity +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1.11 16.59 0.07 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1.38 16.47 0.08 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1.01 1.04 0.97 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Same 0.85 0.88 0.96 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Same 0.58 0.52 1.11 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

Repro:

git clone /~https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --architecture x86 --filter System.Numerics.Tests.Perf_Vector3.DistanceBenchmark --bdn-arguments "--disasm true"

.NET 5.0.9 (5.0.921.35908), X86 RyuJIT

; System.Numerics.Tests.Perf_Vector3.DistanceBenchmark()
       push      eax
       vzeroupper
       mov       eax,ds:[4668]
       lea       eax,[eax+4]
       vmovss    xmm0,dword ptr [eax+8]
       vmovsd    xmm1,qword ptr [eax]
       vshufps   xmm1,xmm1,xmm0,44
       mov       eax,ds:[466C]
       lea       eax,[eax+4]
       vmovss    xmm0,dword ptr [eax+8]
       vmovsd    xmm2,qword ptr [eax]
       vshufps   xmm2,xmm2,xmm0,44
       vsubps    xmm0,xmm1,xmm2
       vdpps     xmm0,xmm0,xmm0,71
       vsqrtss   xmm0,xmm0,xmm0
       vmovss    dword ptr [esp],xmm0
       fld       dword ptr [esp]
       pop       ecx
       ret
; Total bytes of code 72

.NET 6.0.0 (6.0.21.41701), X86 RyuJIT

; System.Numerics.Tests.Perf_Vector3.DistanceBenchmark()
       sub       esp,24
       vzeroupper
       mov       eax,ds:[4668]
       add       eax,4
       vmovss    xmm0,dword ptr [eax]
       vmovss    dword ptr [esp+10],xmm0
       vmovss    xmm0,dword ptr [eax+4]
       vmovss    dword ptr [esp+14],xmm0
       vmovss    xmm0,dword ptr [eax+8]
       vmovss    dword ptr [esp+18],xmm0
       mov       eax,ds:[466C]
       add       eax,4
       vmovss    xmm0,dword ptr [eax]
       vmovss    dword ptr [esp],xmm0
       vmovss    xmm0,dword ptr [eax+4]
       vmovss    dword ptr [esp+4],xmm0
       vmovss    xmm0,dword ptr [eax+8]
       vmovss    dword ptr [esp+8],xmm0
       vmovupd   xmm0,[esp+10]
       vmovupd   xmm1,[esp]
       vsubps    xmm0,xmm0,xmm1
       vdpps     xmm0,xmm0,xmm0,71
       vsqrtss   xmm0,xmm0,xmm0
       vmovss    dword ptr [esp+20],xmm0
       fld       dword ptr [esp+20]
       add       esp,24
       ret
; Total bytes of code 124

@tannergooding
Copy link
Member

This is due to the same original change which cleaned up a bunch of the logic in the System.Numerics.* types to help improve maintainability.

The JIT is preserving two struct copies that occur from the inlining which is causing the perf regression:

       vmovss    xmm0,dword ptr [eax]
       vmovss    dword ptr [esp+10],xmm0
       vmovss    xmm0,dword ptr [eax+4]
       vmovss    dword ptr [esp+14],xmm0
       vmovss    xmm0,dword ptr [eax+8]
       vmovss    dword ptr [esp+18],xmm0

and

       vmovss    xmm0,dword ptr [eax]
       vmovss    dword ptr [esp],xmm0
       vmovss    xmm0,dword ptr [eax+4]
       vmovss    dword ptr [esp+4],xmm0
       vmovss    xmm0,dword ptr [eax+8]
       vmovss    dword ptr [esp+8],xmm0

It would be simple enough to revert this particular method, but it would also be ideal if this could just be correctly handled and optimized. CC. @dotnet/jit-contrib

@dotnet dotnet unlocked this conversation Sep 14, 2021
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2021
@dakersnar dakersnar modified the milestones: 7.0.0, Future Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-System.Numerics os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants