-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up Random.Shuffle
by batching
#111015
Conversation
Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones |
@dotnet-policy-service agree |
ulong r = xoshiro.NextUInt64(); | ||
|
||
// Correct r to be unbiased. | ||
// Based on /~https://github.com/swiftlang/swift/pull/39143 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to a link to the "original" idea for the algorithm, and potentially a brief description of it.
@@ -363,17 +363,78 @@ public void Shuffle<T>(T[] values) | |||
/// </remarks> | |||
public void Shuffle<T>(Span<T> values) | |||
{ | |||
int n = values.Length; | |||
if (_impl is XoshiroImpl xoshiro) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is not so straight to embed implementation details of XoshiroImpl
in base class. Let's see if others have better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also seeing problems.
This implementation requires NextUInt64()
, so currently XoshiroImpl
is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather ImplBase have a virtual or abstract Shuffle method on it, which this public Shuffle would just delegate to on _impl. Then the XoshiroImpl-specific implementation can be an override.
ref var head = ref MemoryMarshal.GetReference(values); | ||
var t = Unsafe.Add(ref head, m); | ||
Unsafe.Add(ref head, m) = Unsafe.Add(ref head, index); | ||
Unsafe.Add(ref head, index) = t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the bound check be eliminated by JIT with idiomatic access? If not, consider to add assert about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the bound checks are not removed please provide some strong motivation for it to exist as it is not aligned with our strategy to reduce unsafe code in the BCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried it, the benchmark results were very slightly worse.
When I checked the JIT code implemented with swapping without using Unsafe, I found that the amount of code jumping to CORINFO_HELP_RNGCHKFAIL
was increasing.
Method | Mean | Error | StdDev |
---|---|---|---|
shuffleUnseededInt10_Unsafe | 10.086 ns | 0.0783 ns | 0.0694 ns |
shuffleUnseededInt10_Safe | 10.372 ns | 0.1657 ns | 0.1550 ns |
shuffleUnseededInt1000_Unsafe | 1,437.318 ns | 21.1372 ns | 19.7718 ns |
shuffleUnseededInt1000_Safe | 1,468.840 ns | 18.4242 ns | 17.2340 ns |
JIT asm
// with Unsafe swap
// The reason why CORINFO_HELP_RNGCHKFAIL
is also in the "with Unsafe" code is because the fallback code contains a normal array reference.
.NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
; System.Tests.Perf_Random_Mine.shuffleUnseededInt10()
; [Benchmark] public void shuffleUnseededInt10() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..10));
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7FFC77DD48D0 sub rsp,38
7FFC77DD48D4 xor eax,eax
7FFC77DD48D6 mov [rsp+28],rax
7FFC77DD48DB mov rax,[rcx+8]
7FFC77DD48DF mov rdx,[rcx+18]
7FFC77DD48E3 test rdx,rdx
7FFC77DD48E6 je short M00_L01
7FFC77DD48E8 mov ecx,[rdx+8]
7FFC77DD48EB cmp ecx,0A
7FFC77DD48EE jb short M00_L00
7FFC77DD48F0 add rdx,10
7FFC77DD48F4 mov [rsp+28],rdx
7FFC77DD48F9 mov dword ptr [rsp+30],0A
7FFC77DD4901 lea rdx,[rsp+28]
7FFC77DD4906 mov rcx,rax
7FFC77DD4909 cmp [rcx],ecx
7FFC77DD490B call qword ptr [7FFC78247960]; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
7FFC77DD4911 nop
7FFC77DD4912 add rsp,38
7FFC77DD4916 ret
M00_L00:
7FFC77DD4917 call qword ptr [7FFC7824D830]
7FFC77DD491D int 3
M00_L01:
7FFC77DD491E mov ecx,2
7FFC77DD4923 call qword ptr [7FFC77D1FD38]
7FFC77DD4929 int 3
; Total bytes of code 90
; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
; if (_impl is XoshiroImpl xoshiro)
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; ulong bound = 2432902008176640000; // 20!
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; int nextIndex = Math.Min(20, values.Length);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; for (int i = 1; i < values.Length;)
; ^^^^^^^^^
; ulong r = xoshiro.NextUInt64();
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; ulong rbound = r * bound;
; ^^^^^^^^^^^^^^^^^^^^^^^^^
; if (rbound > 0 - bound)
; ^^^^^^^^^^^^^^^^^^^^^^^
; ulong r2 = xoshiro.NextUInt64();
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; sum = rbound + lohi;
; ^^^^^^^^^^^^^^^^^^^^
; carry = sum < rbound ? 1ul : 0ul;
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; rbound = lolo;
; ^^^^^^^^^^^^^^
; } while (sum == ~0ul);
; ^^^^^^^^^^^^^^^^^^^^
; r += carry;
; ^^^^^^^^^^^
; for (int m = i; m < nextIndex; m++)
; ^^^^^^^^^
; int index = (int)Math.BigMul(r, (ulong)(m + 1), out r);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; ref var head = ref MemoryMarshal.GetReference(values);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; var t = Unsafe.Add(ref head, m);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; Unsafe.Add(ref head, m) = Unsafe.Add(ref head, index);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; Unsafe.Add(ref head, index) = t;
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; i = nextIndex;
; ^^^^^^^^^^^^^^
; bound = (ulong)(i + 1);
; ^^^^^^^^^^^^^^^^^^^^^^^
; for (nextIndex = i + 1; nextIndex < values.Length; nextIndex++)
; ^^^^^^^^^^^^^^^^^
; if (Math.BigMul(bound, (ulong)(nextIndex + 1), out var newbound) == 0)
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; bound = newbound;
; ^^^^^^^^^^^^^^^^^
; int n = values.Length;
; ^^^^^^^^^^^^^^^^^^^^^^
; for (int i = 0; i < n - 1; i++)
; ^^^^^^^^^
; int j = Next(i, n);
; ^^^^^^^^^^^^^^^^^^^
; if (j != i)
; ^^^^^^^^^^^
; T temp = values[i];
; ^^^^^^^^^^^^^^^^^^^
; values[i] = values[j];
; ^^^^^^^^^^^^^^^^^^^^^^
; values[j] = temp;
; ^^^^^^^^^^^^^^^^^
7FFC77DD4AE0 push r15
7FFC77DD4AE2 push r14
7FFC77DD4AE4 push r13
7FFC77DD4AE6 push rdi
7FFC77DD4AE7 push rsi
7FFC77DD4AE8 push rbp
7FFC77DD4AE9 push rbx
7FFC77DD4AEA sub rsp,40
7FFC77DD4AEE mov rbx,rcx
7FFC77DD4AF1 mov rsi,[rdx]
7FFC77DD4AF4 mov edi,[rdx+8]
7FFC77DD4AF7 mov rcx,[rbx+8]
7FFC77DD4AFB test rcx,rcx
7FFC77DD4AFE je near ptr M01_L09
7FFC77DD4B04 mov rdx,offset MT_System.Random+XoshiroImpl
7FFC77DD4B0E cmp [rcx],rdx
7FFC77DD4B11 jne near ptr M01_L09
7FFC77DD4B17 mov rdx,21C3677C82B40000
7FFC77DD4B21 mov r8d,14
7FFC77DD4B27 cmp edi,14
7FFC77DD4B2A cmovl r8d,edi
7FFC77DD4B2E mov eax,1
7FFC77DD4B33 cmp edi,1
7FFC77DD4B36 jle near ptr M01_L06
M01_L00:
7FFC77DD4B3C mov r10,[rcx+8]
7FFC77DD4B40 mov r9,[rcx+10]
7FFC77DD4B44 mov r11,[rcx+18]
7FFC77DD4B48 mov rbx,[rcx+20]
7FFC77DD4B4C mov rbp,r9
7FFC77DD4B4F shl rbp,11
7FFC77DD4B53 xor r11,r10
7FFC77DD4B56 xor rbx,r9
7FFC77DD4B59 lea r14,[r9+r9*4]
7FFC77DD4B5D rol r14,7
7FFC77DD4B61 lea r14,[r14+r14*8]
7FFC77DD4B65 xor r9,r11
7FFC77DD4B68 xor r10,rbx
7FFC77DD4B6B xor r11,rbp
7FFC77DD4B6E rol rbx,2D
7FFC77DD4B72 mov [rcx+8],r10
7FFC77DD4B76 mov [rcx+10],r9
7FFC77DD4B7A mov [rcx+18],r11
7FFC77DD4B7E mov [rcx+20],rbx
7FFC77DD4B82 mov r10,r14
7FFC77DD4B85 imul r10,rdx
7FFC77DD4B89 mov [rsp+38],rdx
7FFC77DD4B8E mov r9,rdx
7FFC77DD4B91 neg r9
7FFC77DD4B94 cmp r9,r10
7FFC77DD4B97 jb short M01_L04
M01_L01:
7FFC77DD4B99 cmp eax,r8d
7FFC77DD4B9C jge short M01_L03
7FFC77DD4B9E xchg ax,ax
M01_L02:
7FFC77DD4BA0 lea r11d,[rax+1]
7FFC77DD4BA4 movsxd r10,r11d
7FFC77DD4BA7 lea r9,[rsp+28]
7FFC77DD4BAC mov rdx,r14
7FFC77DD4BAF mulx rdx,rbx,r10
7FFC77DD4BB4 mov [r9],rbx
7FFC77DD4BB7 mov r14,[rsp+28]
7FFC77DD4BBC cdqe
7FFC77DD4BBE mov r10d,[rsi+rax*4]
7FFC77DD4BC2 movsxd rdx,edx
7FFC77DD4BC5 mov r9d,[rsi+rdx*4]
7FFC77DD4BC9 mov [rsi+rax*4],r9d
7FFC77DD4BCD mov [rsi+rdx*4],r10d
7FFC77DD4BD1 mov eax,r11d
7FFC77DD4BD4 cmp eax,r8d
7FFC77DD4BD7 jl short M01_L02
M01_L03:
7FFC77DD4BD9 mov eax,r8d
7FFC77DD4BDC lea r8d,[rax+1]
7FFC77DD4BE0 movsxd r11,r8d
7FFC77DD4BE3 cmp r8d,edi
7FFC77DD4BE6 jge near ptr M01_L05
7FFC77DD4BEC jmp near ptr M01_L08
M01_L04:
7FFC77DD4BF1 mov r9,[rcx+8]
7FFC77DD4BF5 mov r11,[rcx+10]
7FFC77DD4BF9 mov rbx,[rcx+18]
7FFC77DD4BFD mov rbp,[rcx+20]
7FFC77DD4C01 mov r15,r11
7FFC77DD4C04 shl r15,11
7FFC77DD4C08 xor rbx,r9
7FFC77DD4C0B xor rbp,r11
7FFC77DD4C0E lea r13,[r11+r11*4]
7FFC77DD4C12 rol r13,7
7FFC77DD4C16 lea r13,[r13+r13*8]
7FFC77DD4C1B xor r11,rbx
7FFC77DD4C1E xor r9,rbp
7FFC77DD4C21 xor rbx,r15
7FFC77DD4C24 rol rbp,2D
7FFC77DD4C28 mov [rcx+8],r9
7FFC77DD4C2C mov [rcx+10],r11
7FFC77DD4C30 mov [rcx+18],rbx
7FFC77DD4C34 mov [rcx+20],rbp
7FFC77DD4C38 lea r9,[rsp+30]
7FFC77DD4C3D mov r11,[rsp+38]
7FFC77DD4C42 mov rdx,r13
7FFC77DD4C45 mulx rdx,rbx,r11
7FFC77DD4C4A mov [r9],rbx
7FFC77DD4C4D mov r9,[rsp+30]
7FFC77DD4C52 add rdx,r10
7FFC77DD4C55 cmp rdx,r10
7FFC77DD4C58 setb r10b
7FFC77DD4C5C movzx r10d,r10b
7FFC77DD4C60 cmp rdx,0FFFFFFFFFFFFFFFF
7FFC77DD4C64 mov [rsp+38],r11
7FFC77DD4C69 je short M01_L07
7FFC77DD4C6B add r14,r10
7FFC77DD4C6E jmp near ptr M01_L01
M01_L05:
7FFC77DD4C73 cmp eax,edi
7FFC77DD4C75 mov rdx,r11
7FFC77DD4C78 jl near ptr M01_L00
M01_L06:
7FFC77DD4C7E add rsp,40
7FFC77DD4C82 pop rbx
7FFC77DD4C83 pop rbp
7FFC77DD4C84 pop rsi
7FFC77DD4C85 pop rdi
7FFC77DD4C86 pop r13
7FFC77DD4C88 pop r14
7FFC77DD4C8A pop r15
7FFC77DD4C8C ret
M01_L07:
7FFC77DD4C8D mov r10,r9
7FFC77DD4C90 jmp near ptr M01_L04
M01_L08:
7FFC77DD4C95 lea edx,[r8+1]
7FFC77DD4C99 movsxd r10,edx
7FFC77DD4C9C lea r9,[rsp+20]
7FFC77DD4CA1 mov rdx,r11
7FFC77DD4CA4 mulx rdx,rbx,r10
7FFC77DD4CA9 mov [r9],rbx
7FFC77DD4CAC mov r10,[rsp+20]
7FFC77DD4CB1 test rdx,rdx
7FFC77DD4CB4 jne short M01_L05
7FFC77DD4CB6 mov r11,r10
7FFC77DD4CB9 inc r8d
7FFC77DD4CBC cmp r8d,edi
7FFC77DD4CBF jl short M01_L08
7FFC77DD4CC1 jmp short M01_L05
M01_L09:
7FFC77DD4CC3 mov ebp,edi
7FFC77DD4CC5 xor r14d,r14d
M01_L10:
7FFC77DD4CC8 lea ecx,[rdi-1]
7FFC77DD4CCB cmp r14d,ecx
7FFC77DD4CCE jge short M01_L06
7FFC77DD4CD0 mov rcx,rbx
7FFC77DD4CD3 mov edx,r14d
7FFC77DD4CD6 mov r8d,edi
7FFC77DD4CD9 mov rax,[rbx]
7FFC77DD4CDC mov rax,[rax+40]
7FFC77DD4CE0 call qword ptr [rax+30]
7FFC77DD4CE3 cmp eax,r14d
7FFC77DD4CE6 je short M01_L11
7FFC77DD4CE8 mov ecx,[rsi+r14*4]
7FFC77DD4CEC cmp eax,ebp
7FFC77DD4CEE jae short M01_L12
7FFC77DD4CF0 mov edx,eax
7FFC77DD4CF2 mov edx,[rsi+rdx*4]
7FFC77DD4CF5 mov [rsi+r14*4],edx
7FFC77DD4CF9 mov eax,eax
7FFC77DD4CFB mov [rsi+rax*4],ecx
M01_L11:
7FFC77DD4CFE inc r14d
7FFC77DD4D01 jmp short M01_L10
M01_L12:
7FFC77DD4D03 call CORINFO_HELP_RNGCHKFAIL
7FFC77DD4D08 int 3
; Total bytes of code 553
// without Unsafe swap
// The jump to CORINFO_HELP_RNGCHKFAIL
has increased to three places, indicating that the range check has not been removed by JIT.
.NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
; System.Tests.Perf_Random_Mine.shuffleUnseededInt10()
; [Benchmark] public void shuffleUnseededInt10() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..10));
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7FFC622C48D0 sub rsp,38
7FFC622C48D4 xor eax,eax
7FFC622C48D6 mov [rsp+28],rax
7FFC622C48DB mov rax,[rcx+8]
7FFC622C48DF mov rdx,[rcx+18]
7FFC622C48E3 test rdx,rdx
7FFC622C48E6 je short M00_L01
7FFC622C48E8 mov ecx,[rdx+8]
7FFC622C48EB cmp ecx,0A
7FFC622C48EE jb short M00_L00
7FFC622C48F0 add rdx,10
7FFC622C48F4 mov [rsp+28],rdx
7FFC622C48F9 mov dword ptr [rsp+30],0A
7FFC622C4901 lea rdx,[rsp+28]
7FFC622C4906 mov rcx,rax
7FFC622C4909 cmp [rcx],ecx
7FFC622C490B call qword ptr [7FFC627379C0]; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
7FFC622C4911 nop
7FFC622C4912 add rsp,38
7FFC622C4916 ret
M00_L00:
7FFC622C4917 call qword ptr [7FFC6273D878]
7FFC622C491D int 3
M00_L01:
7FFC622C491E mov ecx,2
7FFC622C4923 call qword ptr [7FFC6220FD38]
7FFC622C4929 int 3
; Total bytes of code 90
; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
; if (_impl is XoshiroImpl xoshiro)
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; ulong bound = 2432902008176640000; // 20!
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; int nextIndex = Math.Min(20, values.Length);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; for (int i = 1; i < values.Length;)
; ^^^^^^^^^
; ulong r = xoshiro.NextUInt64();
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; ulong rbound = r * bound;
; ^^^^^^^^^^^^^^^^^^^^^^^^^
; if (rbound > 0 - bound)
; ^^^^^^^^^^^^^^^^^^^^^^^
; ulong r2 = xoshiro.NextUInt64();
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; sum = rbound + lohi;
; ^^^^^^^^^^^^^^^^^^^^
; carry = sum < rbound ? 1ul : 0ul;
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; rbound = lolo;
; ^^^^^^^^^^^^^^
; } while (sum == ~0ul);
; ^^^^^^^^^^^^^^^^^^^^
; r += carry;
; ^^^^^^^^^^^
; for (int m = i; m < nextIndex; m++)
; ^^^^^^^^^
; int index = (int)Math.BigMul(r, (ulong)(m + 1), out r);
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; var temp = values[m];
; ^^^^^^^^^^^^^^^^^^^^^
; values[m] = values[index];
; ^^^^^^^^^^^^^^^^^^^^^^^^^^
; values[index] = temp;
; ^^^^^^^^^^^^^^^^^^^^^
; i = nextIndex;
; ^^^^^^^^^^^^^^
; bound = (ulong)(i + 1);
; ^^^^^^^^^^^^^^^^^^^^^^^
; for (nextIndex = i + 1; nextIndex < values.Length; nextIndex++)
; ^^^^^^^^^^^^^^^^^
; if (Math.BigMul(bound, (ulong)(nextIndex + 1), out var newbound) == 0)
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
; bound = newbound;
; ^^^^^^^^^^^^^^^^^
; int n = values.Length;
; ^^^^^^^^^^^^^^^^^^^^^^
; for (int i = 0; i < n - 1; i++)
; ^^^^^^^^^
; int j = Next(i, n);
; ^^^^^^^^^^^^^^^^^^^
; if (j != i)
; ^^^^^^^^^^^
; T temp = values[i];
; ^^^^^^^^^^^^^^^^^^^
; values[i] = values[j];
; ^^^^^^^^^^^^^^^^^^^^^^
; values[j] = temp;
; ^^^^^^^^^^^^^^^^^
7FFC622C4AE0 push r15
7FFC622C4AE2 push r14
7FFC622C4AE4 push r13
7FFC622C4AE6 push rdi
7FFC622C4AE7 push rsi
7FFC622C4AE8 push rbp
7FFC622C4AE9 push rbx
7FFC622C4AEA sub rsp,40
7FFC622C4AEE mov rbx,rcx
7FFC622C4AF1 mov rsi,[rdx]
7FFC622C4AF4 mov edi,[rdx+8]
7FFC622C4AF7 mov rcx,[rbx+8]
7FFC622C4AFB test rcx,rcx
7FFC622C4AFE je near ptr M01_L09
7FFC622C4B04 mov rdx,offset MT_System.Random+XoshiroImpl
7FFC622C4B0E cmp [rcx],rdx
7FFC622C4B11 jne near ptr M01_L09
7FFC622C4B17 mov rdx,21C3677C82B40000
7FFC622C4B21 mov r8d,14
7FFC622C4B27 cmp edi,14
7FFC622C4B2A cmovl r8d,edi
7FFC622C4B2E mov eax,1
7FFC622C4B33 cmp edi,1
7FFC622C4B36 jle near ptr M01_L06
M01_L00:
7FFC622C4B3C mov r10,[rcx+8]
7FFC622C4B40 mov r9,[rcx+10]
7FFC622C4B44 mov r11,[rcx+18]
7FFC622C4B48 mov rbx,[rcx+20]
7FFC622C4B4C mov rbp,r9
7FFC622C4B4F shl rbp,11
7FFC622C4B53 xor r11,r10
7FFC622C4B56 xor rbx,r9
7FFC622C4B59 lea r14,[r9+r9*4]
7FFC622C4B5D rol r14,7
7FFC622C4B61 lea r14,[r14+r14*8]
7FFC622C4B65 xor r9,r11
7FFC622C4B68 xor r10,rbx
7FFC622C4B6B xor r11,rbp
7FFC622C4B6E rol rbx,2D
7FFC622C4B72 mov [rcx+8],r10
7FFC622C4B76 mov [rcx+10],r9
7FFC622C4B7A mov [rcx+18],r11
7FFC622C4B7E mov [rcx+20],rbx
7FFC622C4B82 mov r10,r14
7FFC622C4B85 imul r10,rdx
7FFC622C4B89 mov [rsp+38],rdx
7FFC622C4B8E mov r9,rdx
7FFC622C4B91 neg r9
7FFC622C4B94 cmp r9,r10
7FFC622C4B97 jb short M01_L04
M01_L01:
7FFC622C4B99 cmp eax,r8d
7FFC622C4B9C jge short M01_L03
7FFC622C4B9E xchg ax,ax
M01_L02:
7FFC622C4BA0 lea r11d,[rax+1]
7FFC622C4BA4 movsxd r10,r11d
7FFC622C4BA7 lea r9,[rsp+28]
7FFC622C4BAC mov rdx,r14
7FFC622C4BAF mulx rdx,rbx,r10
7FFC622C4BB4 mov [r9],rbx
7FFC622C4BB7 mov r14,[rsp+28]
7FFC622C4BBC cmp eax,edi
7FFC622C4BBE jae near ptr M01_L12
7FFC622C4BC4 mov eax,eax
7FFC622C4BC6 mov r10d,[rsi+rax*4]
7FFC622C4BCA cmp edx,edi
7FFC622C4BCC jae near ptr M01_L12
7FFC622C4BD2 mov edx,edx
7FFC622C4BD4 mov r9d,[rsi+rdx*4]
7FFC622C4BD8 mov [rsi+rax*4],r9d
7FFC622C4BDC mov [rsi+rdx*4],r10d
7FFC622C4BE0 mov eax,r11d
7FFC622C4BE3 cmp eax,r8d
7FFC622C4BE6 jl short M01_L02
M01_L03:
7FFC622C4BE8 mov eax,r8d
7FFC622C4BEB lea r8d,[rax+1]
7FFC622C4BEF movsxd r11,r8d
7FFC622C4BF2 cmp r8d,edi
7FFC622C4BF5 jge near ptr M01_L05
7FFC622C4BFB jmp near ptr M01_L08
M01_L04:
7FFC622C4C00 mov r9,[rcx+8]
7FFC622C4C04 mov r11,[rcx+10]
7FFC622C4C08 mov rbx,[rcx+18]
7FFC622C4C0C mov rbp,[rcx+20]
7FFC622C4C10 mov r15,r11
7FFC622C4C13 shl r15,11
7FFC622C4C17 xor rbx,r9
7FFC622C4C1A xor rbp,r11
7FFC622C4C1D lea r13,[r11+r11*4]
7FFC622C4C21 rol r13,7
7FFC622C4C25 lea r13,[r13+r13*8]
7FFC622C4C2A xor r11,rbx
7FFC622C4C2D xor r9,rbp
7FFC622C4C30 xor rbx,r15
7FFC622C4C33 rol rbp,2D
7FFC622C4C37 mov [rcx+8],r9
7FFC622C4C3B mov [rcx+10],r11
7FFC622C4C3F mov [rcx+18],rbx
7FFC622C4C43 mov [rcx+20],rbp
7FFC622C4C47 lea r9,[rsp+30]
7FFC622C4C4C mov r11,[rsp+38]
7FFC622C4C51 mov rdx,r13
7FFC622C4C54 mulx rdx,rbx,r11
7FFC622C4C59 mov [r9],rbx
7FFC622C4C5C mov r9,[rsp+30]
7FFC622C4C61 add rdx,r10
7FFC622C4C64 cmp rdx,r10
7FFC622C4C67 setb r10b
7FFC622C4C6B movzx r10d,r10b
7FFC622C4C6F cmp rdx,0FFFFFFFFFFFFFFFF
7FFC622C4C73 mov [rsp+38],r11
7FFC622C4C78 je short M01_L07
7FFC622C4C7A add r14,r10
7FFC622C4C7D jmp near ptr M01_L01
M01_L05:
7FFC622C4C82 cmp eax,edi
7FFC622C4C84 mov rdx,r11
7FFC622C4C87 jl near ptr M01_L00
M01_L06:
7FFC622C4C8D add rsp,40
7FFC622C4C91 pop rbx
7FFC622C4C92 pop rbp
7FFC622C4C93 pop rsi
7FFC622C4C94 pop rdi
7FFC622C4C95 pop r13
7FFC622C4C97 pop r14
7FFC622C4C99 pop r15
7FFC622C4C9B ret
M01_L07:
7FFC622C4C9C mov r10,r9
7FFC622C4C9F jmp near ptr M01_L04
M01_L08:
7FFC622C4CA4 lea edx,[r8+1]
7FFC622C4CA8 movsxd r10,edx
7FFC622C4CAB lea r9,[rsp+20]
7FFC622C4CB0 mov rdx,r11
7FFC622C4CB3 mulx rdx,rbx,r10
7FFC622C4CB8 mov [r9],rbx
7FFC622C4CBB mov r10,[rsp+20]
7FFC622C4CC0 test rdx,rdx
7FFC622C4CC3 jne short M01_L05
7FFC622C4CC5 mov r11,r10
7FFC622C4CC8 inc r8d
7FFC622C4CCB cmp r8d,edi
7FFC622C4CCE jl short M01_L08
7FFC622C4CD0 jmp short M01_L05
M01_L09:
7FFC622C4CD2 mov ebp,edi
7FFC622C4CD4 xor r14d,r14d
M01_L10:
7FFC622C4CD7 lea ecx,[rdi-1]
7FFC622C4CDA cmp r14d,ecx
7FFC622C4CDD jge short M01_L06
7FFC622C4CDF mov rcx,rbx
7FFC622C4CE2 mov edx,r14d
7FFC622C4CE5 mov r8d,edi
7FFC622C4CE8 mov rax,[rbx]
7FFC622C4CEB mov rax,[rax+40]
7FFC622C4CEF call qword ptr [rax+30]
7FFC622C4CF2 cmp eax,r14d
7FFC622C4CF5 je short M01_L11
7FFC622C4CF7 mov ecx,[rsi+r14*4]
7FFC622C4CFB cmp eax,ebp
7FFC622C4CFD jae short M01_L12
7FFC622C4CFF mov edx,eax
7FFC622C4D01 mov edx,[rsi+rdx*4]
7FFC622C4D04 mov [rsi+r14*4],edx
7FFC622C4D08 mov eax,eax
7FFC622C4D0A mov [rsi+rax*4],ecx
M01_L11:
7FFC622C4D0D inc r14d
7FFC622C4D10 jmp short M01_L10
M01_L12:
7FFC622C4D12 call CORINFO_HELP_RNGCHKFAIL
7FFC622C4D17 int 3
; Total bytes of code 568
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andanteyk I guess the end results imply that the unsafe code is not worth it
@@ -288,17 +289,60 @@ public static string GetHexString(int stringLength, bool lowercase = false) | |||
/// <typeparam name="T">The type of span.</typeparam> | |||
public static void Shuffle<T>(Span<T> values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes for cryptographic RNG may subject to more careful review for security requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
src/libraries/System.Private.CoreLib/src/System/Random.cs:364
- Ensure that the new shuffling logic introduced in this method is covered by tests.
public void Shuffle<T>(Span<T> values)
A few pieces of feedback, which I can't think of a good "next to some code" place to leave them:
I copied this implementation locally and asked it to repeatedly shuffle arrays of a few chosen lengths, and asked where the original Span<int> values = Enumerable.Range(0, Size).Select(v => (int)v).ToArray();
Span<int> clone = values.ToArray();
int[] positions = new int[Size];
Dictionary<int, int> callCount = new();
Console.WriteLine($"Input: [0, {Size}). Trials: {TrialCount:#,###}");
Console.WriteLine();
for (int i = 0; i < TrialCount; i++)
{
values.CopyTo(clone);
BatchShuffle(clone, out int rngCalls);
positions[clone.IndexOf(0)]++;
ref int count = ref CollectionsMarshal.GetValueRefOrAddDefault(callCount, rngCalls, out _);
count++;
} Which gives me an answer like:
"All around 5000" looks uniformish, but basically this function is itself a very expensive PRNG, so doing a statistical analysis of the output would be goodness. Since I was mainly concerned with "is it always in [0, 20) because of the 20!?" I didn't wire in any statistical testing. |
int index = (int)Math.BigMul(r, (ulong)(m + 1), out r); | ||
|
||
// Swap span[m] <-> span[index] | ||
var temp = values[m]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var temp = values[m]; | |
T temp = values[m]; |
Thank you for your review.
That's slightly different from Here I'm certainly doing some preprocessing to make sure
The So the next operation will only process 13 elements instead of 20, because
That's correct because it allows for overflow, since it actually only needs the lowest 64 bits. Mathematically that means
This algorithm is essentially the same as Fisher-Yates. |
I did some simple statistical testing. I shuffled an array of consecutive numbers by I also ran it on .NET 9.0 and found the results to be pretty consistent.
This testing is going to take a very long time and I don't think it's appropriate to include it in the repository. What do you think? using System.Diagnostics;
using System.Security.Cryptography;
Console.WriteLine($"{System.Runtime.InteropServices.RuntimeInformation.FrameworkDescription}");
var bucket = new int[256 * 256];
var source = new int[256];
long length = 100_000_000;
var stopwatch = Stopwatch.StartNew();
for (long i = 0; i < length; i++)
{
for (int k = 0; k < source.Length; k++)
{
source[k] = k;
}
RandomNumberGenerator.Shuffle(source.AsSpan());
for (int k = 0; k < source.Length; k++)
{
bucket[source[k] * 256 + k]++;
}
}
stopwatch.Stop();
int min = int.MaxValue;
int max = 0;
double average = 0;
double stdev = 0;
for (int i = 0; i < bucket.Length; i++)
{
min = Math.Min(min, bucket[i]);
max = Math.Max(max, bucket[i]);
average += bucket[i];
}
average /= bucket.Length;
for (int i = 0; i < bucket.Length; i++)
{
stdev += (bucket[i] - average) * (bucket[i] - average);
}
stdev = Math.Sqrt(stdev / bucket.Length);
Console.WriteLine($"elapsed: {stopwatch.Elapsed}");
Console.WriteLine($"avg = {average}, stdev = {stdev}, min = {min} ({min / average:p}), max = {max} ({max / average:p})"); |
Having it in the test suite as a disabled test at least gives us a basis of "what sort of test should we run when we change Shuffle?". It can be guarded (disabled) by something like Line 8 in 86a7ff9
I'd include the stats numbers from 9 and this change as a comment, so that future run needs would know what to compare to. Tests generally assert something rather than just print, so probably should assert that the stddev is below 700? |
@richlander I'm not sure if the pedigree of this change has any impact on our Third Party Notices. What are your thoughts? |
A quick look suggests that the algorithm is from a paper referenced at swiftlang/swift#39143. We nee to validate the license asserted by that paper. |
From the perspective of the CSPRNG... I think the answer is we don't want to accept this right now. Before I get into the why, I want to say that this has been very interesting, and I have been completely nerd sniped... having done little else since seeing this PR (and letting it significantly infiltrate my "outside of work" time). We've had a lot of "backroom chat" on this, but I think it comes down to
So while we want to say "no", we're not saying that the work should be thrown away... rather we should heavily pivot (and the pivot probably means we should close the PR and come back to it). The most useful piece would be to define some sort of batched random function. One such beast could be public partial class RandomNumberGenerator
{
public static void BatchedRandom(ReadOnlySpan<int> toExclusives, Span<int> destination)
{
if (toExclusives.Length != destination.Length)
throw new SomeKindOfArgumentException();
// if any of toExclusives are <= 0, throw something else.
ReadOnlySpan<uint> exclusiveUints = MemoryMarshal.Cast<int, uint>(toExclusives);
Span<uint> destinationUints = MemoryMarshal.Cast<int, uint>(destination);
while (exclusiveUints.Length > 0)
{
int n = BatchedRandomCore(MemoryMarshal.toExclusives, destination);
exclusiveUints= exclusiveUints.Slice(n);
destinationUints = destinationUints.Slice(n);
}
}
private static int BatchedRandomCore(ReadOnlySpan<uint> toExclusives, Span<uint> destination)
{
// Provides an implementation of "Batched ranged random integer generation"
// by Brackett-Rozinsky and Lemire (https://doi.org/10.1002/spe.3369)
//
// This is inspired by Algorithm 3 (Batched dice rolls (unknown threshold)),
// but it determines the maximum number of elements it can process in one batch,
// and checks the suitability of r before calculating/committing any individual a[i] values.
Debug.Assert(toExclusives.Length == destination.Length);
int n;
ulong bound = 1;
// Determine how many elements we can process this call.
// When this loop exits n is either the input length (process everything),
// or how many items were processed before overflowing ulong (a partial batch).
for (n = 0; n < toExclusives.Length; n++)
{
ulong carry = Math.BigMul(bound, toExclusives[n], out ulong newBound);
if (carry != 0)
{
break;
}
bound = newBound;
}
Debug.Assert(n > 0);
// Choose a random 64-bit value for r. If `r * bound` is suitable, then r is suitable.
// If it isn't, then choose another r and repeat.
ulong r = 0;
FillSpan(MemoryMarshal.AsBytes(MemoryMarshal.CreateSpan(ref r, 1)));
ulong rBoundLower = unchecked(r * bound);
if (rBoundLower < bound)
{
ulong t = (0 - bound) % bound;
while (rBoundLower < t)
{
FillSpan(MemoryMarshal.AsBytes(MemoryMarshal.CreateSpan(ref r, 1)));
rBoundLower = unchecked(r * bound);
}
}
// r and r * bound are suitable. Build up the individual a[i].
for (int i = n - 1; i >= 0; --i)
{
destination[i] = Math.BigMul(r, bounds[i], out r);
}
// We ended up at the same rBoundLower value. Math works, hooray.
Debug.Assert(r == rBoundLower);
return n;
}
} Given that function, we could easily write GetItems and Shuffle in terms of it. Of course, the function itself has questions:
Since we're now talking about introducing public API, it needs to go through the API Review process. Along the way a few other useful functions have suggested themselves:
Maybe those go in the same API proposal. Maybe different. That was all, of course, from the CSPRNG perspective. Since the crypto team is also who added Shuffle and GetItems to System.Random, we might be the de facto owners there... so then similar arguments apply. However, the speed of Xoshiro is so high compared to the CSPRNG that it's not clear that batching as public API is as important. What is important is that I don't believe, at this time, that we're comfortable using the "swift PR" algorithm (even though Chi-Square says it's valid). |
Thank you for your review.
It is also possible to implement it using the same Lemire style as the current // Change the code to correct `r` to the following:
if (rbound < bound)
{
ulong mod = (0 - bound) % bound;
while (rbound < mod)
{
r = NextUInt64();
rbound = r * bound;
}
}
( If your biggest concern is with "Swift PR", that can be resolved.
Faster is better, isn't it? :)
If I were to make (another) PR to speed up |
In security code, simpler is generally better (algorithms may be complicated, but the implementation of the algorithm should start simple and evolve as needed). The fastest of "simple enough" approaches is often what is chosen, rather than necessarily "the fastest", except for things that truly end up being the bottleneck (for example, BitLocker uses AES, so the throughput of AES might actually be the rate limiting factor on reading from a fast drive under moderate CPU load). Shuffle or GetItems having inline batched randomization is not simple, but making them in terms of existing batched random functionality is.
For the CSPRNG, the answer is the same as with Shuffle... if we stop and make batched random a feature in and of itself first, sure. But without evidence that 69ns/item is too slow, there's not really justification for making the code more complex. |
I've started a proposal at #111211. |
Same question would apply to the batched API -- if the underlying implementation for getting a single value was that much cheaper, would we care to expose batched APIs? |
I'm a little weary of that because I don't know if that has compliance implications. Apple's RNG, for example, does have a cache - but only if you ask for 12 bytes or less. They bypass the cache if you ask for more, ostensibly for FIPS reasons. See #106525 (comment) for why. |
Understood. Once again, thank you for your review. |
Summary
This PR aims to speed up
System.Random.Shuffle()
andSystem.Security.Cryptography.RandomNumberGenerator.Shuffle()
.Currently,
Shuffle()
is a naive implementation of the Fisher-Yates shuffle, which means that it generates one random number for each index retrieval.Therefore, I aim to speed up the implementation by splitting the random number.
Benchmark
Details
The idea behind this method comes from the "Batched ranged random number generation".
The procedure is as follows:
n = 2 * 3 * 4 * ... * 20
. This is equal to the maximum factorial number that can be handled within the range ofulong
.r
usingNextUInt64()
.r
so that it can be evenly converted to0 <= r < n
.r
and perform the Fisher-Yates shuffle.n = 21 * 22 * ...
. Repeat the process.In 1,
n
is a constant, so we could pre-calculate it and embed or cache it.However, this was not cost-effective, so this time we pre-calculate only the first
n
.3 is almost the same as generating uniform random numbers in a specific range.
Currently,
Random.NextInt(min, max)
uses the nearly divisionless method proposed by Daniel Lemire, but this time I adopted the method proposed in swiftlang/swift#39143.The reason is that it completely eliminates slow division, and benchmark tests showed that swift method is slightly faster.
Note that this method does not bias the probability like
r % n
.4 is calculated as follows:
To borrow a phrase from the paper, this is expressed in mixed-radix notation.
First the 1! digit (
[0..2]*1
), then the 2! digit ([0..3]*2
), then the 3! digit ([0..4]*6
), and so on.Note that this technique can also be applied to
GetItems()
, although it is out of scope in this PR.This change only applies to implementations that do not specify a seed.
If a seed is specified, the code is switched to work the same as the previous implementation for compatibility reasons.