Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringBuilder: use Span.Fill in Append repeating char #86287

Merged
merged 5 commits into from
Jul 7, 2023

Conversation

yesmey
Copy link
Contributor

@yesmey yesmey commented May 16, 2023

StringBuilder.Append(char value, int repeatCount) currently use a loop to add each character in sequence.
We can instead use Span<T>.Fill that is better optimized for this.

Note:
Append(char value) was calling this method as a fallback to when it needed to allocate. But with the new changes, Append(char value, int repeatCount) might be inlined, therefor I added a specific AppendWithExpansion(char) with MethodImplOptions.NoInlining.

Benchmark code
Benchmark result:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1778/22H2/2022Update/SunValley2)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.6.23276.3
  [Host]     : .NET 8.0.0 (8.0.23.27214), X64 RyuJIT AVX2
  Job-WZCNRH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-RFXJAL : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method Toolchain amount c Mean Ratio
AppendRepeated main 1 a 2.169 ns 1.00
AppendRepeated PR 1 a 2.623 ns 1.21
AppendRepeated main 2 a 4.057 ns 1.00
AppendRepeated PR 2 a 2.577 ns 0.64
AppendRepeated main 4 a 7.857 ns 1.00
AppendRepeated PR 4 a 3.035 ns 0.39
AppendRepeated main 8 a 15.549 ns 1.00
AppendRepeated PR 8 a 3.694 ns 0.24
AppendRepeated main 16 a 31.002 ns 1.00
AppendRepeated PR 16 a 3.979 ns 0.13

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 16, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 16, 2023
@vcsjones vcsjones added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 16, 2023
@ghost
Copy link

ghost commented May 16, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

StringBuilder.Append(char value, int repeatCount) currently use a loop to add each character in sequence.
We can instead use Span<T>.Fill that is better optimized for this.

Note:
Append(char value) was calling this method as a fallback to when it needed to allocate. But with the new changes, Append(char value, int repeatCount) might be inlined, therefor I added a specific AppendWithExpansion(char) with MethodImplOptions.NoInlining.

Benchmark code
Benchmark result:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.5.23255.2
  [Host]     : .NET 8.0.0 (8.0.23.25213), X64 RyuJIT AVX2
  Job-FJHCLO : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DXTBEB : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method Toolchain amount c Mean Ratio
AppendRepeatedConstAmount main ? ? 8.393 ns 1.00
AppendRepeatedConstAmount PR ? ? 3.210 ns 0.38
AppendRepeatedConst main 1 ? 3.488 ns 1.00
AppendRepeatedConst PR 1 ? 2.884 ns 0.83
AppendRepeatedConst main 2 ? 5.465 ns 1.00
AppendRepeatedConst PR 2 ? 2.733 ns 0.50
AppendRepeatedConst main 4 ? 8.640 ns 1.00
AppendRepeatedConst PR 4 ? 3.410 ns 0.39
AppendRepeatedConst main 8 ? 15.649 ns 1.00
AppendRepeatedConst PR 8 ? 3.849 ns 0.25
AppendRepeatedConst main 16 ? 30.945 ns 1.00
AppendRepeatedConst PR 16 ? 4.028 ns 0.13
AppendRepeated main 1 a 3.491 ns 1.00
AppendRepeated PR 1 a 4.069 ns 1.17
AppendRepeated main 2 a 5.345 ns 1.00
AppendRepeated PR 2 a 3.730 ns 0.70
AppendRepeated main 4 a 8.662 ns 1.00
AppendRepeated PR 4 a 4.304 ns 0.50
AppendRepeated main 8 a 14.983 ns 1.00
AppendRepeated PR 8 a 4.807 ns 0.32
AppendRepeated main 16 a 29.593 ns 1.00
AppendRepeated PR 16 a 5.170 ns 0.17
Author: yesmey
Assignees: -
Labels:

area-System.Runtime, community-contribution

Milestone: -

int firstLength = chunkChars.Length - chunkLength;
if (firstLength > 0)
{
chunkChars.AsSpan(chunkLength, firstLength).Fill(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert

Debug.Assert(firstLength < repeatCount, "We shouldn't be called if there was enough space for the entire run.");

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @yesmey not just for the contribution but also for providing benchmark numbers and their source code!

Would you be interested in contributing a benchmark for Append(char, int) to /~https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text/Perf.StringBuilder.cs? Currently we have no benchmarks for it and our reporting system won't report any improvements once I merge your PR.

@adamsitnik adamsitnik self-assigned this Jul 7, 2023
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Jul 7, 2023
@adamsitnik adamsitnik merged commit ff57624 into dotnet:main Jul 7, 2023
@adamsitnik adamsitnik added this to the 8.0.0 milestone Jul 7, 2023
@yesmey
Copy link
Contributor Author

yesmey commented Jul 12, 2023

@adamsitnik The regression in #88673 comes from marking the allocating path of Append(char) with MethodImplOptions.NoInlining. That benchmark in particular is allocating fairly often, and it makes sense the slow path is faster when it previously was being inlined more aggressively. This pattern is fairly common across the runtime, but if you prefer I can remove it and see if its better?

@adamsitnik
Copy link
Member

I can remove it and see if its better?

@yesmey could you please give it a try?

I was also thinking about doing sth like this:

public StringBuilder Append(char value)
{
    int nextCharIndex = m_ChunkLength;
    char[] chars = m_ChunkChars;
    
    if ((uint)chars.Length == (uint)nextCharIndex)
    {
        ExpandByABlock(1);
        nextCharIndex = 0;
        chars = m_ChunkChars;
    }
    
    chars[nextCharIndex] = value;
    m_ChunkLength++;

    return this;
}

but I am not sure if array boundaries check would got removed.

@MichalPetryka
Copy link
Contributor

but I am not sure if array boundaries check would got removed.

You can avoid them like this:

Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(chars), (uint)nextCharIndex) = value;

@yesmey
Copy link
Contributor Author

yesmey commented Jul 23, 2023

Sorry about the delay. I'm gonna need some help on this one.

Having trouble reproducing this locally. I think its because most of the benchmark is spent allocating and the Error column is sometimes very high. I tried to turn off as much as possible on my machine while running the benchmarks.

but I am not sure if array boundaries check would got removed.

You can avoid them like this:

Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(chars), (uint)nextCharIndex) = value;

I doubt avoiding an array boundaries check would even be noticeable here since the allocation size doubles every time, so the times that code is executed is gonna be rare, and sacrificing bounds safety for that is likely not worth it

My hypothesis is that the slowdown is coming from the benchmark calling this very frequently in a loop, and the jit is able to inline so the read/write of m_ChunkLength better in the previous version.

Here's how I build and run the benchmark locally

.\build.cmd Clr+Clr.Aot+Libs -c Release -rc Release

and

python3 .\scripts\benchmarks_ci.py -f net8.0 --filter System.Text.Tests.Perf_StringBuilder.Append_Char --corerun "D:\dotnet\runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe" "D:\dotnet\yesmey_runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe" --bdn-artifacts C:\results\after_test

\yesmey_runtime...CoreRun.exe built from ff57624
\runtime...CoreRun.exe built from 5a03596 (parent)

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1992)
AMD Ryzen 7 7800X3D, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.7.23364.32
  [Host]     : .NET 8.0.0 (8.0.23.36403), X64 RyuJIT AVX2
  Job-DWSUBF : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-YNMWPS : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain length Mean Error StdDev Median Min Max Ratio Gen0 Gen1 Allocated Alloc Ratio
Append_Char Job-DWSUBF \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe 100 90.69 ns 0.659 ns 0.584 ns 90.50 ns 89.93 ns 91.68 ns 1.00 0.0108 - 544 B 1.00
Append_Char Job-YNMWPS \yesmey_runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe 100 97.78 ns 0.240 ns 0.201 ns 97.72 ns 97.61 ns 98.18 ns 1.08 0.0106 - 544 B 1.00
Append_Char Job-DWSUBF \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe 100000 51,838.73 ns 170.468 ns 142.348 ns 51,837.67 ns 51,624.53 ns 52,088.30 ns 1.00 4.1254 3.5066 209968 B 1.00
Append_Char Job-YNMWPS \yesmey_runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\CoreRun.exe 100000 51,924.36 ns 184.830 ns 163.847 ns 51,939.05 ns 51,721.98 ns 52,278.99 ns 1.00 4.1391 3.5182 209968 B 1.00

Here's the asm output:

Before

; System.Text.Tests.Perf_StringBuilder.Append_Char(Int32)
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       mov       ebx,edx
       mov       rcx,offset MT_System.Text.StringBuilder
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       mov       dword ptr [rsi+20],7FFFFFFF
       mov       rcx,offset MT_System.Char[]
       mov       edx,10
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[rsi+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       xor       edi,edi
       test      ebx,ebx
       jle       short M00_L02
M00_L00:
       mov       ecx,[rsi+18]
       mov       edx,ecx
       mov       rax,[rsi+8]
       mov       r8d,[rax+8]
       cmp       r8d,edx
       jbe       short M00_L03
       mov       edx,edx
       mov       word ptr [rax+rdx*2+10],61
       inc       ecx
       mov       [rsi+18],ecx
M00_L01:
       inc       edi
       cmp       edi,ebx
       jl        short M00_L00
M00_L02:
       mov       rax,rsi
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       ret
M00_L03:
       mov       ebp,1
       mov       edx,[rsi+1C]
       lea       edx,[rdx+rcx+1]
       cmp       edx,[rsi+20]
       jg        short M00_L07
       test      edx,edx
       jle       short M00_L07
M00_L04:
       mov       rdx,[rsi+8]
       cmp       [rdx+8],ecx
       jle       short M00_L05
       lea       eax,[rcx+1]
       cmp       ecx,[rdx+8]
       jae       short M00_L08
       mov       ecx,ecx
       mov       word ptr [rdx+rcx*2+10],61
       dec       ebp
       jmp       short M00_L06
M00_L05:
       mov       [rsi+18],ecx
       mov       rcx,rsi
       mov       edx,ebp
       call      qword ptr [7FFE1853D518]; System.Text.StringBuilder.ExpandByABlock(Int32)
       xor       ecx,ecx
       mov       eax,ecx
M00_L06:
       test      ebp,ebp
       mov       ecx,eax
       jg        short M00_L04
       mov       [rsi+18],ecx
       jmp       short M00_L01
M00_L07:
       mov       rcx,offset MT_System.ArgumentOutOfRangeException
       call      CORINFO_HELP_NEWSFAST
       mov       rbx,rax
       mov       ecx,18331
       mov       rdx,7FFE180A4000
       call      CORINFO_HELP_STRCNS
       mov       rsi,rax
       call      qword ptr [7FFE18786790]
       mov       r8,rax
       mov       rdx,rsi
       mov       rcx,rbx
       call      qword ptr [7FFE182B6F28]
       mov       rcx,rbx
       call      CORINFO_HELP_THROW
       int       3
M00_L08:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 280

After

; System.Text.Tests.Perf_StringBuilder.Append_Char(Int32)
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       mov       ebx,edx
       mov       rcx,offset MT_System.Text.StringBuilder
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       mov       dword ptr [rsi+20],7FFFFFFF
       mov       rcx,offset MT_System.Char[]
       mov       edx,10
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[rsi+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       xor       edi,edi
       test      ebx,ebx
       jle       short M00_L02
M00_L00:
       mov       ecx,[rsi+18]
       mov       edx,ecx
       mov       rax,[rsi+8]
       mov       r8d,[rax+8]
       cmp       r8d,edx
       jbe       short M00_L03
       mov       edx,edx
       mov       word ptr [rax+rdx*2+10],61
       inc       ecx
       mov       [rsi+18],ecx
M00_L01:
       inc       edi
       cmp       edi,ebx
       jl        short M00_L00
M00_L02:
       mov       rax,rsi
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
M00_L03:
       mov       rcx,rsi
       mov       edx,61
       call      qword ptr [7FFE1853C3F0]; System.Text.StringBuilder.AppendWithExpansion(Char)
       jmp       short M00_L01
; Total bytes of code 137
; System.Text.StringBuilder.AppendWithExpansion(Char)
       push      rsi
       push      rbx
       sub       rsp,28
       mov       rbx,rcx
       mov       esi,edx
       mov       rcx,rbx
       mov       edx,1
       call      qword ptr [7FFE1853CB58]; System.Text.StringBuilder.ExpandByABlock(Int32)
       mov       rax,[rbx+8]
       cmp       dword ptr [rax+8],0
       jbe       short M01_L00
       mov       [rax+10],si
       inc       dword ptr [rbx+18]
       add       rsp,28
       pop       rbx
       pop       rsi
       ret
M01_L00:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 55

@ghost ghost locked as resolved and limited conversation to collaborators Aug 22, 2023
@yesmey yesmey deleted the fill-append branch August 24, 2023 20:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants