Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up Random.Shuffle by batching #111015

Closed
wants to merge 5 commits into from

Conversation

andanteyk
Copy link

Summary

This PR aims to speed up System.Random.Shuffle() and System.Security.Cryptography.RandomNumberGenerator.Shuffle().

Currently, Shuffle() is a naive implementation of the Fisher-Yates shuffle, which means that it generates one random number for each index retrieval.
Therefore, I aim to speed up the implementation by splitting the random number.

Benchmark

private Random _randomUnseeded = new Random();
private int[] _destinationInt = new int[1 << 20];

[Benchmark] public void shuffleUnseededInt10() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..10));
[Benchmark] public void shuffleUnseededInt1000() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..1000));

[Benchmark] public void shuffleCryptoInt10() => RandomNumberGenerator.Shuffle(_destinationInt.AsSpan(..10));
[Benchmark] public void shuffleCryptoInt1000() => RandomNumberGenerator.Shuffle(_destinationInt.AsSpan(..1000));
Method Mean Error StdDev
shuffleUnseededInt10_Old 37.197 ns 0.7427 ns 0.7295 ns
shuffleUnseededInt10_New 10.086 ns 0.0783 ns 0.0694 ns
shuffleUnseededInt1000_Old 1,899.355 ns 15.0212 ns 14.0509 ns
shuffleUnseededInt1000_New 1,437.318 ns 21.1372 ns 19.7718 ns
shuffleCryptoInt10_Old 590.400 ns 5.9734 ns 5.5875 ns
shuffleCryptoInt10_New 66.692 ns 0.6127 ns 0.5116 ns
shuffleCryptoInt1000_Old 69,274.924 ns 1,193.3118 ns 1,276.8303 ns
shuffleCryptoInt1000_New 9,377.440 ns 130.9093 ns 122.4527 ns

Details

The idea behind this method comes from the "Batched ranged random number generation".
The procedure is as follows:

  1. Calculate n = 2 * 3 * 4 * ... * 20. This is equal to the maximum factorial number that can be handled within the range of ulong.
  2. Generate a 64-bit random number r using NextUInt64().
  3. Adjust r so that it can be evenly converted to 0 <= r < n.
  4. Split r and perform the Fisher-Yates shuffle.
  5. Calculate next n = 21 * 22 * ​​.... Repeat the process.

In 1, n is a constant, so we could pre-calculate it and embed or cache it.
However, this was not cost-effective, so this time we pre-calculate only the first n.

3 is almost the same as generating uniform random numbers in a specific range.
Currently, Random.NextInt(min, max) uses the nearly divisionless method proposed by Daniel Lemire, but this time I adopted the method proposed in swiftlang/swift#39143.
The reason is that it completely eliminates slow division, and benchmark tests showed that swift method is slightly faster.
Note that this method does not bias the probability like r % n.

// `factorial` is `n`
  
// Generates 64-bit random number 
ulong r = rng.NextUInt64();  
  
ulong rlo = r * factorial;  
  
// Look at the lower 64 bits (rlo) of r * factorial, 
// and if there is a carryover possibility...
// (The maximum value added in subsequent processing is factorial - 1, 
//  so if rlo <= (2^64) - factorial, no carryover occurs.)  
while (rlo > 0ul - factorial)  
{  
    // Generate an additional random number t and check if it carries over  
    //   [rhi] . [rlo]        -> r * factorial; upper rhi / lower rlo
    // +     0 . [thi] [tlo]  -> t * factorial; upper thi / lower tlo  
    // ---------------------  
    //   [carry.  sum] [tlo]  -> rhi + carry is the result  
    ulong thi = Math.BigMul(rng.Next(), factorial, out ulong tlo);  
    ulong sum = rlo + thi;  
    ulong carry = sum < thi ? 1ul : 0ul;  
  
    // If sum == 0xff...ff, there is a possibility of a carry in the future,
    // so calculate it again.
    // If not, there will be no more carry, so add the carry and finish.
    if (sum != ~0ul)  
    {  
        r += carry;  
        break;  
    }  
  
    rlo = tlo;  
}  
  
// Here, rhi is a uniform random number distributed without bias 
// in the range 0 <= rhi < factorial
ulong rhi = Math.BigMul(r, factorial, out _);  

4 is calculated as follows:

int t2 = (int)Math.BigMul(r, 2ul, out r);   // [0, 2)  
int t3 = (int)Math.BigMul(r, 3ul, out r);   // [0, 3)  
// ...  
int t20 = (int)Math.BigMul(r, 20ul, out r);   // [0, 20)  

To borrow a phrase from the paper, this is expressed in mixed-radix notation.
First the 1! digit ([0..2]*1), then the 2! digit ([0..3]*2), then the 3! digit ([0..4]*6), and so on.


Note that this technique can also be applied to GetItems(), although it is out of scope in this PR.

This change only applies to implementations that do not specify a seed.
If a seed is specified, the code is switched to work the same as the previous implementation for compatibility reasons.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 1, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones
See info in area-owners.md if you want to be subscribed.

@andanteyk
Copy link
Author

@dotnet-policy-service agree

ulong r = xoshiro.NextUInt64();

// Correct r to be unbiased.
// Based on /~https://github.com/swiftlang/swift/pull/39143
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to a link to the "original" idea for the algorithm, and potentially a brief description of it.

@@ -363,17 +363,78 @@ public void Shuffle<T>(T[] values)
/// </remarks>
public void Shuffle<T>(Span<T> values)
{
int n = values.Length;
if (_impl is XoshiroImpl xoshiro)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is not so straight to embed implementation details of XoshiroImpl in base class. Let's see if others have better idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also seeing problems.
This implementation requires NextUInt64(), so currently XoshiroImpl is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather ImplBase have a virtual or abstract Shuffle method on it, which this public Shuffle would just delegate to on _impl. Then the XoshiroImpl-specific implementation can be an override.

Comment on lines 398 to 401
ref var head = ref MemoryMarshal.GetReference(values);
var t = Unsafe.Add(ref head, m);
Unsafe.Add(ref head, m) = Unsafe.Add(ref head, index);
Unsafe.Add(ref head, index) = t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the bound check be eliminated by JIT with idiomatic access? If not, consider to add assert about it.

Copy link
Member

@EgorBo EgorBo Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the bound checks are not removed please provide some strong motivation for it to exist as it is not aligned with our strategy to reduce unsafe code in the BCL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried it, the benchmark results were very slightly worse.
When I checked the JIT code implemented with swapping without using Unsafe, I found that the amount of code jumping to CORINFO_HELP_RNGCHKFAIL was increasing.

Method Mean Error StdDev
shuffleUnseededInt10_Unsafe 10.086 ns 0.0783 ns 0.0694 ns
shuffleUnseededInt10_Safe 10.372 ns 0.1657 ns 0.1550 ns
shuffleUnseededInt1000_Unsafe 1,437.318 ns 21.1372 ns 19.7718 ns
shuffleUnseededInt1000_Safe 1,468.840 ns 18.4242 ns 17.2340 ns
JIT asm

// with Unsafe swap
// The reason why CORINFO_HELP_RNGCHKFAIL is also in the "with Unsafe" code is because the fallback code contains a normal array reference.

.NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

; System.Tests.Perf_Random_Mine.shuffleUnseededInt10()
;         [Benchmark] public void shuffleUnseededInt10() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..10));
;                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       7FFC77DD48D0 sub       rsp,38
       7FFC77DD48D4 xor       eax,eax
       7FFC77DD48D6 mov       [rsp+28],rax
       7FFC77DD48DB mov       rax,[rcx+8]
       7FFC77DD48DF mov       rdx,[rcx+18]
       7FFC77DD48E3 test      rdx,rdx
       7FFC77DD48E6 je        short M00_L01
       7FFC77DD48E8 mov       ecx,[rdx+8]
       7FFC77DD48EB cmp       ecx,0A
       7FFC77DD48EE jb        short M00_L00
       7FFC77DD48F0 add       rdx,10
       7FFC77DD48F4 mov       [rsp+28],rdx
       7FFC77DD48F9 mov       dword ptr [rsp+30],0A
       7FFC77DD4901 lea       rdx,[rsp+28]
       7FFC77DD4906 mov       rcx,rax
       7FFC77DD4909 cmp       [rcx],ecx
       7FFC77DD490B call      qword ptr [7FFC78247960]; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
       7FFC77DD4911 nop
       7FFC77DD4912 add       rsp,38
       7FFC77DD4916 ret
M00_L00:
       7FFC77DD4917 call      qword ptr [7FFC7824D830]
       7FFC77DD491D int       3
M00_L01:
       7FFC77DD491E mov       ecx,2
       7FFC77DD4923 call      qword ptr [7FFC77D1FD38]
       7FFC77DD4929 int       3
; Total bytes of code 90
; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
;             if (_impl is XoshiroImpl xoshiro)
;             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 ulong bound = 2432902008176640000;        // 20!
;                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 int nextIndex = Math.Min(20, values.Length);
;                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 for (int i = 1; i < values.Length;)
;                      ^^^^^^^^^
;                     ulong r = xoshiro.NextUInt64();
;                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                     ulong rbound = r * bound;
;                     ^^^^^^^^^^^^^^^^^^^^^^^^^
;                     if (rbound > 0 - bound)
;                     ^^^^^^^^^^^^^^^^^^^^^^^
;                             ulong r2 = xoshiro.NextUInt64();
;                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             sum = rbound + lohi;
;                             ^^^^^^^^^^^^^^^^^^^^
;                             carry = sum < rbound ? 1ul : 0ul;
;                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             rbound = lolo;
;                             ^^^^^^^^^^^^^^
;                         } while (sum == ~0ul);
;                           ^^^^^^^^^^^^^^^^^^^^
;                         r += carry;
;                         ^^^^^^^^^^^
;                     for (int m = i; m < nextIndex; m++)
;                          ^^^^^^^^^
;                         int index = (int)Math.BigMul(r, (ulong)(m + 1), out r);
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         ref var head = ref MemoryMarshal.GetReference(values);
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         var t = Unsafe.Add(ref head, m);
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         Unsafe.Add(ref head, m) = Unsafe.Add(ref head, index);
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         Unsafe.Add(ref head, index) = t;
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                     i = nextIndex;
;                     ^^^^^^^^^^^^^^
;                     bound = (ulong)(i + 1);
;                     ^^^^^^^^^^^^^^^^^^^^^^^
;                     for (nextIndex = i + 1; nextIndex < values.Length; nextIndex++)
;                          ^^^^^^^^^^^^^^^^^
;                         if (Math.BigMul(bound, (ulong)(nextIndex + 1), out var newbound) == 0)
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             bound = newbound;
;                             ^^^^^^^^^^^^^^^^^
;                 int n = values.Length;
;                 ^^^^^^^^^^^^^^^^^^^^^^
;                 for (int i = 0; i < n - 1; i++)
;                      ^^^^^^^^^
;                     int j = Next(i, n);
;                     ^^^^^^^^^^^^^^^^^^^
;                     if (j != i)
;                     ^^^^^^^^^^^
;                         T temp = values[i];
;                         ^^^^^^^^^^^^^^^^^^^
;                         values[i] = values[j];
;                         ^^^^^^^^^^^^^^^^^^^^^^
;                         values[j] = temp;
;                         ^^^^^^^^^^^^^^^^^
       7FFC77DD4AE0 push      r15
       7FFC77DD4AE2 push      r14
       7FFC77DD4AE4 push      r13
       7FFC77DD4AE6 push      rdi
       7FFC77DD4AE7 push      rsi
       7FFC77DD4AE8 push      rbp
       7FFC77DD4AE9 push      rbx
       7FFC77DD4AEA sub       rsp,40
       7FFC77DD4AEE mov       rbx,rcx
       7FFC77DD4AF1 mov       rsi,[rdx]
       7FFC77DD4AF4 mov       edi,[rdx+8]
       7FFC77DD4AF7 mov       rcx,[rbx+8]
       7FFC77DD4AFB test      rcx,rcx
       7FFC77DD4AFE je        near ptr M01_L09
       7FFC77DD4B04 mov       rdx,offset MT_System.Random+XoshiroImpl
       7FFC77DD4B0E cmp       [rcx],rdx
       7FFC77DD4B11 jne       near ptr M01_L09
       7FFC77DD4B17 mov       rdx,21C3677C82B40000
       7FFC77DD4B21 mov       r8d,14
       7FFC77DD4B27 cmp       edi,14
       7FFC77DD4B2A cmovl     r8d,edi
       7FFC77DD4B2E mov       eax,1
       7FFC77DD4B33 cmp       edi,1
       7FFC77DD4B36 jle       near ptr M01_L06
M01_L00:
       7FFC77DD4B3C mov       r10,[rcx+8]
       7FFC77DD4B40 mov       r9,[rcx+10]
       7FFC77DD4B44 mov       r11,[rcx+18]
       7FFC77DD4B48 mov       rbx,[rcx+20]
       7FFC77DD4B4C mov       rbp,r9
       7FFC77DD4B4F shl       rbp,11
       7FFC77DD4B53 xor       r11,r10
       7FFC77DD4B56 xor       rbx,r9
       7FFC77DD4B59 lea       r14,[r9+r9*4]
       7FFC77DD4B5D rol       r14,7
       7FFC77DD4B61 lea       r14,[r14+r14*8]
       7FFC77DD4B65 xor       r9,r11
       7FFC77DD4B68 xor       r10,rbx
       7FFC77DD4B6B xor       r11,rbp
       7FFC77DD4B6E rol       rbx,2D
       7FFC77DD4B72 mov       [rcx+8],r10
       7FFC77DD4B76 mov       [rcx+10],r9
       7FFC77DD4B7A mov       [rcx+18],r11
       7FFC77DD4B7E mov       [rcx+20],rbx
       7FFC77DD4B82 mov       r10,r14
       7FFC77DD4B85 imul      r10,rdx
       7FFC77DD4B89 mov       [rsp+38],rdx
       7FFC77DD4B8E mov       r9,rdx
       7FFC77DD4B91 neg       r9
       7FFC77DD4B94 cmp       r9,r10
       7FFC77DD4B97 jb        short M01_L04
M01_L01:
       7FFC77DD4B99 cmp       eax,r8d
       7FFC77DD4B9C jge       short M01_L03
       7FFC77DD4B9E xchg      ax,ax
M01_L02:
       7FFC77DD4BA0 lea       r11d,[rax+1]
       7FFC77DD4BA4 movsxd    r10,r11d
       7FFC77DD4BA7 lea       r9,[rsp+28]
       7FFC77DD4BAC mov       rdx,r14
       7FFC77DD4BAF mulx      rdx,rbx,r10
       7FFC77DD4BB4 mov       [r9],rbx
       7FFC77DD4BB7 mov       r14,[rsp+28]
       7FFC77DD4BBC cdqe
       7FFC77DD4BBE mov       r10d,[rsi+rax*4]
       7FFC77DD4BC2 movsxd    rdx,edx
       7FFC77DD4BC5 mov       r9d,[rsi+rdx*4]
       7FFC77DD4BC9 mov       [rsi+rax*4],r9d
       7FFC77DD4BCD mov       [rsi+rdx*4],r10d
       7FFC77DD4BD1 mov       eax,r11d
       7FFC77DD4BD4 cmp       eax,r8d
       7FFC77DD4BD7 jl        short M01_L02
M01_L03:
       7FFC77DD4BD9 mov       eax,r8d
       7FFC77DD4BDC lea       r8d,[rax+1]
       7FFC77DD4BE0 movsxd    r11,r8d
       7FFC77DD4BE3 cmp       r8d,edi
       7FFC77DD4BE6 jge       near ptr M01_L05
       7FFC77DD4BEC jmp       near ptr M01_L08
M01_L04:
       7FFC77DD4BF1 mov       r9,[rcx+8]
       7FFC77DD4BF5 mov       r11,[rcx+10]
       7FFC77DD4BF9 mov       rbx,[rcx+18]
       7FFC77DD4BFD mov       rbp,[rcx+20]
       7FFC77DD4C01 mov       r15,r11
       7FFC77DD4C04 shl       r15,11
       7FFC77DD4C08 xor       rbx,r9
       7FFC77DD4C0B xor       rbp,r11
       7FFC77DD4C0E lea       r13,[r11+r11*4]
       7FFC77DD4C12 rol       r13,7
       7FFC77DD4C16 lea       r13,[r13+r13*8]
       7FFC77DD4C1B xor       r11,rbx
       7FFC77DD4C1E xor       r9,rbp
       7FFC77DD4C21 xor       rbx,r15
       7FFC77DD4C24 rol       rbp,2D
       7FFC77DD4C28 mov       [rcx+8],r9
       7FFC77DD4C2C mov       [rcx+10],r11
       7FFC77DD4C30 mov       [rcx+18],rbx
       7FFC77DD4C34 mov       [rcx+20],rbp
       7FFC77DD4C38 lea       r9,[rsp+30]
       7FFC77DD4C3D mov       r11,[rsp+38]
       7FFC77DD4C42 mov       rdx,r13
       7FFC77DD4C45 mulx      rdx,rbx,r11
       7FFC77DD4C4A mov       [r9],rbx
       7FFC77DD4C4D mov       r9,[rsp+30]
       7FFC77DD4C52 add       rdx,r10
       7FFC77DD4C55 cmp       rdx,r10
       7FFC77DD4C58 setb      r10b
       7FFC77DD4C5C movzx     r10d,r10b
       7FFC77DD4C60 cmp       rdx,0FFFFFFFFFFFFFFFF
       7FFC77DD4C64 mov       [rsp+38],r11
       7FFC77DD4C69 je        short M01_L07
       7FFC77DD4C6B add       r14,r10
       7FFC77DD4C6E jmp       near ptr M01_L01
M01_L05:
       7FFC77DD4C73 cmp       eax,edi
       7FFC77DD4C75 mov       rdx,r11
       7FFC77DD4C78 jl        near ptr M01_L00
M01_L06:
       7FFC77DD4C7E add       rsp,40
       7FFC77DD4C82 pop       rbx
       7FFC77DD4C83 pop       rbp
       7FFC77DD4C84 pop       rsi
       7FFC77DD4C85 pop       rdi
       7FFC77DD4C86 pop       r13
       7FFC77DD4C88 pop       r14
       7FFC77DD4C8A pop       r15
       7FFC77DD4C8C ret
M01_L07:
       7FFC77DD4C8D mov       r10,r9
       7FFC77DD4C90 jmp       near ptr M01_L04
M01_L08:
       7FFC77DD4C95 lea       edx,[r8+1]
       7FFC77DD4C99 movsxd    r10,edx
       7FFC77DD4C9C lea       r9,[rsp+20]
       7FFC77DD4CA1 mov       rdx,r11
       7FFC77DD4CA4 mulx      rdx,rbx,r10
       7FFC77DD4CA9 mov       [r9],rbx
       7FFC77DD4CAC mov       r10,[rsp+20]
       7FFC77DD4CB1 test      rdx,rdx
       7FFC77DD4CB4 jne       short M01_L05
       7FFC77DD4CB6 mov       r11,r10
       7FFC77DD4CB9 inc       r8d
       7FFC77DD4CBC cmp       r8d,edi
       7FFC77DD4CBF jl        short M01_L08
       7FFC77DD4CC1 jmp       short M01_L05
M01_L09:
       7FFC77DD4CC3 mov       ebp,edi
       7FFC77DD4CC5 xor       r14d,r14d
M01_L10:
       7FFC77DD4CC8 lea       ecx,[rdi-1]
       7FFC77DD4CCB cmp       r14d,ecx
       7FFC77DD4CCE jge       short M01_L06
       7FFC77DD4CD0 mov       rcx,rbx
       7FFC77DD4CD3 mov       edx,r14d
       7FFC77DD4CD6 mov       r8d,edi
       7FFC77DD4CD9 mov       rax,[rbx]
       7FFC77DD4CDC mov       rax,[rax+40]
       7FFC77DD4CE0 call      qword ptr [rax+30]
       7FFC77DD4CE3 cmp       eax,r14d
       7FFC77DD4CE6 je        short M01_L11
       7FFC77DD4CE8 mov       ecx,[rsi+r14*4]
       7FFC77DD4CEC cmp       eax,ebp
       7FFC77DD4CEE jae       short M01_L12
       7FFC77DD4CF0 mov       edx,eax
       7FFC77DD4CF2 mov       edx,[rsi+rdx*4]
       7FFC77DD4CF5 mov       [rsi+r14*4],edx
       7FFC77DD4CF9 mov       eax,eax
       7FFC77DD4CFB mov       [rsi+rax*4],ecx
M01_L11:
       7FFC77DD4CFE inc       r14d
       7FFC77DD4D01 jmp       short M01_L10
M01_L12:
       7FFC77DD4D03 call      CORINFO_HELP_RNGCHKFAIL
       7FFC77DD4D08 int       3
; Total bytes of code 553

// without Unsafe swap
// The jump to CORINFO_HELP_RNGCHKFAIL has increased to three places, indicating that the range check has not been removed by JIT.

.NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

; System.Tests.Perf_Random_Mine.shuffleUnseededInt10()
;         [Benchmark] public void shuffleUnseededInt10() => _randomUnseeded.Shuffle(_destinationInt.AsSpan(..10));
;                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       7FFC622C48D0 sub       rsp,38
       7FFC622C48D4 xor       eax,eax
       7FFC622C48D6 mov       [rsp+28],rax
       7FFC622C48DB mov       rax,[rcx+8]
       7FFC622C48DF mov       rdx,[rcx+18]
       7FFC622C48E3 test      rdx,rdx
       7FFC622C48E6 je        short M00_L01
       7FFC622C48E8 mov       ecx,[rdx+8]
       7FFC622C48EB cmp       ecx,0A
       7FFC622C48EE jb        short M00_L00
       7FFC622C48F0 add       rdx,10
       7FFC622C48F4 mov       [rsp+28],rdx
       7FFC622C48F9 mov       dword ptr [rsp+30],0A
       7FFC622C4901 lea       rdx,[rsp+28]
       7FFC622C4906 mov       rcx,rax
       7FFC622C4909 cmp       [rcx],ecx
       7FFC622C490B call      qword ptr [7FFC627379C0]; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
       7FFC622C4911 nop
       7FFC622C4912 add       rsp,38
       7FFC622C4916 ret
M00_L00:
       7FFC622C4917 call      qword ptr [7FFC6273D878]
       7FFC622C491D int       3
M00_L01:
       7FFC622C491E mov       ecx,2
       7FFC622C4923 call      qword ptr [7FFC6220FD38]
       7FFC622C4929 int       3
; Total bytes of code 90
; System.Random.Shuffle[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>)
;             if (_impl is XoshiroImpl xoshiro)
;             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 ulong bound = 2432902008176640000;        // 20!
;                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 int nextIndex = Math.Min(20, values.Length);
;                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                 for (int i = 1; i < values.Length;)
;                      ^^^^^^^^^
;                     ulong r = xoshiro.NextUInt64();
;                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                     ulong rbound = r * bound;
;                     ^^^^^^^^^^^^^^^^^^^^^^^^^
;                     if (rbound > 0 - bound)
;                     ^^^^^^^^^^^^^^^^^^^^^^^
;                             ulong r2 = xoshiro.NextUInt64();
;                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             sum = rbound + lohi;
;                             ^^^^^^^^^^^^^^^^^^^^
;                             carry = sum < rbound ? 1ul : 0ul;
;                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             rbound = lolo;
;                             ^^^^^^^^^^^^^^
;                         } while (sum == ~0ul);
;                           ^^^^^^^^^^^^^^^^^^^^
;                         r += carry;
;                         ^^^^^^^^^^^
;                     for (int m = i; m < nextIndex; m++)
;                          ^^^^^^^^^
;                         int index = (int)Math.BigMul(r, (ulong)(m + 1), out r);
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         var temp = values[m];
;                         ^^^^^^^^^^^^^^^^^^^^^
;                         values[m] = values[index];
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
;                         values[index] = temp;
;                         ^^^^^^^^^^^^^^^^^^^^^
;                     i = nextIndex;
;                     ^^^^^^^^^^^^^^
;                     bound = (ulong)(i + 1);
;                     ^^^^^^^^^^^^^^^^^^^^^^^
;                     for (nextIndex = i + 1; nextIndex < values.Length; nextIndex++)
;                          ^^^^^^^^^^^^^^^^^
;                         if (Math.BigMul(bound, (ulong)(nextIndex + 1), out var newbound) == 0)
;                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
;                             bound = newbound;
;                             ^^^^^^^^^^^^^^^^^
;                 int n = values.Length;
;                 ^^^^^^^^^^^^^^^^^^^^^^
;                 for (int i = 0; i < n - 1; i++)
;                      ^^^^^^^^^
;                     int j = Next(i, n);
;                     ^^^^^^^^^^^^^^^^^^^
;                     if (j != i)
;                     ^^^^^^^^^^^
;                         T temp = values[i];
;                         ^^^^^^^^^^^^^^^^^^^
;                         values[i] = values[j];
;                         ^^^^^^^^^^^^^^^^^^^^^^
;                         values[j] = temp;
;                         ^^^^^^^^^^^^^^^^^
       7FFC622C4AE0 push      r15
       7FFC622C4AE2 push      r14
       7FFC622C4AE4 push      r13
       7FFC622C4AE6 push      rdi
       7FFC622C4AE7 push      rsi
       7FFC622C4AE8 push      rbp
       7FFC622C4AE9 push      rbx
       7FFC622C4AEA sub       rsp,40
       7FFC622C4AEE mov       rbx,rcx
       7FFC622C4AF1 mov       rsi,[rdx]
       7FFC622C4AF4 mov       edi,[rdx+8]
       7FFC622C4AF7 mov       rcx,[rbx+8]
       7FFC622C4AFB test      rcx,rcx
       7FFC622C4AFE je        near ptr M01_L09
       7FFC622C4B04 mov       rdx,offset MT_System.Random+XoshiroImpl
       7FFC622C4B0E cmp       [rcx],rdx
       7FFC622C4B11 jne       near ptr M01_L09
       7FFC622C4B17 mov       rdx,21C3677C82B40000
       7FFC622C4B21 mov       r8d,14
       7FFC622C4B27 cmp       edi,14
       7FFC622C4B2A cmovl     r8d,edi
       7FFC622C4B2E mov       eax,1
       7FFC622C4B33 cmp       edi,1
       7FFC622C4B36 jle       near ptr M01_L06
M01_L00:
       7FFC622C4B3C mov       r10,[rcx+8]
       7FFC622C4B40 mov       r9,[rcx+10]
       7FFC622C4B44 mov       r11,[rcx+18]
       7FFC622C4B48 mov       rbx,[rcx+20]
       7FFC622C4B4C mov       rbp,r9
       7FFC622C4B4F shl       rbp,11
       7FFC622C4B53 xor       r11,r10
       7FFC622C4B56 xor       rbx,r9
       7FFC622C4B59 lea       r14,[r9+r9*4]
       7FFC622C4B5D rol       r14,7
       7FFC622C4B61 lea       r14,[r14+r14*8]
       7FFC622C4B65 xor       r9,r11
       7FFC622C4B68 xor       r10,rbx
       7FFC622C4B6B xor       r11,rbp
       7FFC622C4B6E rol       rbx,2D
       7FFC622C4B72 mov       [rcx+8],r10
       7FFC622C4B76 mov       [rcx+10],r9
       7FFC622C4B7A mov       [rcx+18],r11
       7FFC622C4B7E mov       [rcx+20],rbx
       7FFC622C4B82 mov       r10,r14
       7FFC622C4B85 imul      r10,rdx
       7FFC622C4B89 mov       [rsp+38],rdx
       7FFC622C4B8E mov       r9,rdx
       7FFC622C4B91 neg       r9
       7FFC622C4B94 cmp       r9,r10
       7FFC622C4B97 jb        short M01_L04
M01_L01:
       7FFC622C4B99 cmp       eax,r8d
       7FFC622C4B9C jge       short M01_L03
       7FFC622C4B9E xchg      ax,ax
M01_L02:
       7FFC622C4BA0 lea       r11d,[rax+1]
       7FFC622C4BA4 movsxd    r10,r11d
       7FFC622C4BA7 lea       r9,[rsp+28]
       7FFC622C4BAC mov       rdx,r14
       7FFC622C4BAF mulx      rdx,rbx,r10
       7FFC622C4BB4 mov       [r9],rbx
       7FFC622C4BB7 mov       r14,[rsp+28]
       7FFC622C4BBC cmp       eax,edi
       7FFC622C4BBE jae       near ptr M01_L12
       7FFC622C4BC4 mov       eax,eax
       7FFC622C4BC6 mov       r10d,[rsi+rax*4]
       7FFC622C4BCA cmp       edx,edi
       7FFC622C4BCC jae       near ptr M01_L12
       7FFC622C4BD2 mov       edx,edx
       7FFC622C4BD4 mov       r9d,[rsi+rdx*4]
       7FFC622C4BD8 mov       [rsi+rax*4],r9d
       7FFC622C4BDC mov       [rsi+rdx*4],r10d
       7FFC622C4BE0 mov       eax,r11d
       7FFC622C4BE3 cmp       eax,r8d
       7FFC622C4BE6 jl        short M01_L02
M01_L03:
       7FFC622C4BE8 mov       eax,r8d
       7FFC622C4BEB lea       r8d,[rax+1]
       7FFC622C4BEF movsxd    r11,r8d
       7FFC622C4BF2 cmp       r8d,edi
       7FFC622C4BF5 jge       near ptr M01_L05
       7FFC622C4BFB jmp       near ptr M01_L08
M01_L04:
       7FFC622C4C00 mov       r9,[rcx+8]
       7FFC622C4C04 mov       r11,[rcx+10]
       7FFC622C4C08 mov       rbx,[rcx+18]
       7FFC622C4C0C mov       rbp,[rcx+20]
       7FFC622C4C10 mov       r15,r11
       7FFC622C4C13 shl       r15,11
       7FFC622C4C17 xor       rbx,r9
       7FFC622C4C1A xor       rbp,r11
       7FFC622C4C1D lea       r13,[r11+r11*4]
       7FFC622C4C21 rol       r13,7
       7FFC622C4C25 lea       r13,[r13+r13*8]
       7FFC622C4C2A xor       r11,rbx
       7FFC622C4C2D xor       r9,rbp
       7FFC622C4C30 xor       rbx,r15
       7FFC622C4C33 rol       rbp,2D
       7FFC622C4C37 mov       [rcx+8],r9
       7FFC622C4C3B mov       [rcx+10],r11
       7FFC622C4C3F mov       [rcx+18],rbx
       7FFC622C4C43 mov       [rcx+20],rbp
       7FFC622C4C47 lea       r9,[rsp+30]
       7FFC622C4C4C mov       r11,[rsp+38]
       7FFC622C4C51 mov       rdx,r13
       7FFC622C4C54 mulx      rdx,rbx,r11
       7FFC622C4C59 mov       [r9],rbx
       7FFC622C4C5C mov       r9,[rsp+30]
       7FFC622C4C61 add       rdx,r10
       7FFC622C4C64 cmp       rdx,r10
       7FFC622C4C67 setb      r10b
       7FFC622C4C6B movzx     r10d,r10b
       7FFC622C4C6F cmp       rdx,0FFFFFFFFFFFFFFFF
       7FFC622C4C73 mov       [rsp+38],r11
       7FFC622C4C78 je        short M01_L07
       7FFC622C4C7A add       r14,r10
       7FFC622C4C7D jmp       near ptr M01_L01
M01_L05:
       7FFC622C4C82 cmp       eax,edi
       7FFC622C4C84 mov       rdx,r11
       7FFC622C4C87 jl        near ptr M01_L00
M01_L06:
       7FFC622C4C8D add       rsp,40
       7FFC622C4C91 pop       rbx
       7FFC622C4C92 pop       rbp
       7FFC622C4C93 pop       rsi
       7FFC622C4C94 pop       rdi
       7FFC622C4C95 pop       r13
       7FFC622C4C97 pop       r14
       7FFC622C4C99 pop       r15
       7FFC622C4C9B ret
M01_L07:
       7FFC622C4C9C mov       r10,r9
       7FFC622C4C9F jmp       near ptr M01_L04
M01_L08:
       7FFC622C4CA4 lea       edx,[r8+1]
       7FFC622C4CA8 movsxd    r10,edx
       7FFC622C4CAB lea       r9,[rsp+20]
       7FFC622C4CB0 mov       rdx,r11
       7FFC622C4CB3 mulx      rdx,rbx,r10
       7FFC622C4CB8 mov       [r9],rbx
       7FFC622C4CBB mov       r10,[rsp+20]
       7FFC622C4CC0 test      rdx,rdx
       7FFC622C4CC3 jne       short M01_L05
       7FFC622C4CC5 mov       r11,r10
       7FFC622C4CC8 inc       r8d
       7FFC622C4CCB cmp       r8d,edi
       7FFC622C4CCE jl        short M01_L08
       7FFC622C4CD0 jmp       short M01_L05
M01_L09:
       7FFC622C4CD2 mov       ebp,edi
       7FFC622C4CD4 xor       r14d,r14d
M01_L10:
       7FFC622C4CD7 lea       ecx,[rdi-1]
       7FFC622C4CDA cmp       r14d,ecx
       7FFC622C4CDD jge       short M01_L06
       7FFC622C4CDF mov       rcx,rbx
       7FFC622C4CE2 mov       edx,r14d
       7FFC622C4CE5 mov       r8d,edi
       7FFC622C4CE8 mov       rax,[rbx]
       7FFC622C4CEB mov       rax,[rax+40]
       7FFC622C4CEF call      qword ptr [rax+30]
       7FFC622C4CF2 cmp       eax,r14d
       7FFC622C4CF5 je        short M01_L11
       7FFC622C4CF7 mov       ecx,[rsi+r14*4]
       7FFC622C4CFB cmp       eax,ebp
       7FFC622C4CFD jae       short M01_L12
       7FFC622C4CFF mov       edx,eax
       7FFC622C4D01 mov       edx,[rsi+rdx*4]
       7FFC622C4D04 mov       [rsi+r14*4],edx
       7FFC622C4D08 mov       eax,eax
       7FFC622C4D0A mov       [rsi+rax*4],ecx
M01_L11:
       7FFC622C4D0D inc       r14d
       7FFC622C4D10 jmp       short M01_L10
M01_L12:
       7FFC622C4D12 call      CORINFO_HELP_RNGCHKFAIL
       7FFC622C4D17 int       3
; Total bytes of code 568

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andanteyk I guess the end results imply that the unsafe code is not worth it

@@ -288,17 +289,60 @@ public static string GetHexString(int stringLength, bool lowercase = false)
/// <typeparam name="T">The type of span.</typeparam>
public static void Shuffle<T>(Span<T> values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes for cryptographic RNG may subject to more careful review for security requirements.

@davidfowl davidfowl requested a review from Copilot January 1, 2025 15:30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/libraries/System.Private.CoreLib/src/System/Random.cs:364

  • Ensure that the new shuffling logic introduced in this method is covered by tests.
public void Shuffle<T>(Span<T> values)
@bartonjs
Copy link
Member

bartonjs commented Jan 3, 2025

A few pieces of feedback, which I can't think of a good "next to some code" place to leave them:

  • The comments as to how the function works need to be in the code, not a pointer back to the PR.
  • If the if (rbound > 0 - bound) block is an alternative implementation of NextInt(min, max) (or NextInt(0, max)) then it should be pulled out into a separate function. Possibly replacing the implementation of the existing function that does that.
    • Even if it slows it down a smidgeon, the benefit to understanding the algorithm over "here's a pile of code that's fast" will be worth it... I can't imagine Shuffle is anyone's critical path, or that the overhead is all that high.
  • I don't understand where 20! comes into play. I thought that meant it was rolling 20d20 and only ever shuffling the first 20 elements, then shuffling the next 20 elements, etc.
  • I think if ((r * bound) > 0 - bound) doesn't care about (r * bound) overflowing, because it's trying to do r % bound without DIV, but since overflow is extremely possible here it should be addressed by a comment.
  • I think we kind of just trusted Fisher-Yates to be a good shuffle. For the CSPRNG based shuffle, if we want to adopt this (or any other "clever" method) we probably want to add some statistical sampling tests.

I copied this implementation locally and asked it to repeatedly shuffle arrays of a few chosen lengths, and asked where the original [0] element ended up, e.g.:

Span<int> values = Enumerable.Range(0, Size).Select(v => (int)v).ToArray();
Span<int> clone = values.ToArray();
int[] positions = new int[Size];
Dictionary<int, int> callCount = new();

Console.WriteLine($"Input: [0, {Size}). Trials: {TrialCount:#,###}");
Console.WriteLine();

for (int i = 0; i < TrialCount; i++)
{
    values.CopyTo(clone);
    BatchShuffle(clone, out int rngCalls);
    positions[clone.IndexOf(0)]++;

    ref int count = ref CollectionsMarshal.GetValueRefOrAddDefault(callCount, rngCalls, out _);
    count++;
}

Which gives me an answer like:

Input: [0, 190). Trials: 1,000,000

5312 5194 5245 5255 5222 5243 5125 5423 5244 5244 5091 5203 5357 5282 5110 5274
5272 5312 5198 5181 5180 5393 5305 5299 5229 5331 5249 5272 5276 5295 5374 5174
5266 5270 5220 5267 5340 5225 5235 5184 5212 5304 5272 5346 5298 5416 5226 5289
5205 5234 5269 5177 5275 5206 5271 5325 5281 5356 5322 5328 5213 5293 5233 5267
5201 5295 5315 5211 5359 5325 5140 5300 5387 5290 5244 5181 5229 5273 5303 5284
5415 5209 5273 5333 5260 5085 5271 5243 5210 5310 5264 5297 5200 5258 5261 5234
5261 5238 5303 5279 5134 5388 5346 5196 5257 5239 5200 5216 5383 5296 5283 5305
5261 5263 5187 5308 5110 5317 5228 5293 5115 5437 5127 5350 5319 5326 5294 5199
5407 5439 5227 5221 5249 5188 5310 5271 5314 5397 5324 5264 5308 5215 5243 5261
5325 5238 5251 5355 5336 5266 5251 5126 5262 5306 5240 5243 5211 5184 5350 5280
5347 5240 5259 5411 5257 5449 5156 5249 5245 5395 5132 5269 5342 5327 5289 5090
5092 5238 5273 5229 5182 5246 5164 5156 5284 5306 5195 5088 5219 5302

RNG Call Count Frequency:
20: 1103
21: 14815
22: 78180
23: 200330
24: 279408
25: 236334
26: 128415
27: 47144
28: 11941
29: 2031
30: 273
31: 24
32: 2

"All around 5000" looks uniformish, but basically this function is itself a very expensive PRNG, so doing a statistical analysis of the output would be goodness.

Since I was mainly concerned with "is it always in [0, 20) because of the 20!?" I didn't wire in any statistical testing.

int index = (int)Math.BigMul(r, (ulong)(m + 1), out r);

// Swap span[m] <-> span[index]
var temp = values[m];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var temp = values[m];
T temp = values[m];

@andanteyk
Copy link
Author

Thank you for your review.
I'll add comments to the source code later.

If the if (rbound > 0 - bound) block is an alternative implementation of NextInt(min, max) (or NextInt(0, max)) then it should be pulled out into a separate function.

That's slightly different from NextInt().

Here I'm certainly doing some preprocessing to make sure 0 <= rhi < bound, but it doesn't need the value of rhi itself; it actually needs r.

I don't understand where 20! comes into play.

The 20! comes from the fact that 20! < 2^64 < 21!.
It's there to shuffle the first 20 elements indeed. To shuffle n elements you need n! random numbers.

So the next operation will only process 13 elements instead of 20, because 21*22*...*33 = 33!/20! < 2^64.
Only the first 20! is embedded as a constant, but anything after that is calculated dynamically.

I think if ((r * bound) > 0 - bound) doesn't care about (r * bound) overflowing

That's correct because it allows for overflow, since it actually only needs the lowest 64 bits.

Mathematically that means $(r * bound) \bmod 2^{64} &gt; 2^{64} - bound$.
More details can be found in the Details section of this PR.

I think we kind of just trusted Fisher-Yates to be a good shuffle.

This algorithm is essentially the same as Fisher-Yates.
The improvement is that it allows random numbers to be generated for multiple indexes at once, rather than consuming one per index.

@andanteyk
Copy link
Author

I did some simple statistical testing.

I shuffled an array of consecutive numbers by RandomNumberGenerator.Shuffle() and counted where they moved.
bucket contains the count as [originalIndex * 256 + shuffledIndex]++;.
If the shuffle is performed evenly, the count in bucket should be pretty even.
(This is almost the same as the method suggested by bartonjs, but it takes statistics for all indexes.)

I also ran it on .NET 9.0 and found the results to be pretty consistent.
There were no major outliers.

.NET 9.0.0
elapsed: 00:43:40.3377860
avg = 390625, stdev = 624.923991447496, min = 387750 (99.26%), max = 393364 (100.70%)

.NET 10.0.0-alpha.1.24610.2
elapsed: 00:30:59.0007549
avg = 390625, stdev = 624.9221927154149, min = 388102 (99.35%), max = 393109 (100.64%)

This testing is going to take a very long time and I don't think it's appropriate to include it in the repository. What do you think?

using System.Diagnostics;
using System.Security.Cryptography;

Console.WriteLine($"{System.Runtime.InteropServices.RuntimeInformation.FrameworkDescription}");

var bucket = new int[256 * 256];
var source = new int[256];
long length = 100_000_000;
var stopwatch = Stopwatch.StartNew();

for (long i = 0; i < length; i++)
{
    for (int k = 0; k < source.Length; k++)
    {
        source[k] = k;
    }

    RandomNumberGenerator.Shuffle(source.AsSpan());

    for (int k = 0; k < source.Length; k++)
    {
        bucket[source[k] * 256 + k]++;
    }
}

stopwatch.Stop();
int min = int.MaxValue;
int max = 0;
double average = 0;
double stdev = 0;
for (int i = 0; i < bucket.Length; i++)
{
    min = Math.Min(min, bucket[i]);
    max = Math.Max(max, bucket[i]);
    average += bucket[i];
}

average /= bucket.Length;
for (int i = 0; i < bucket.Length; i++)
{
    stdev += (bucket[i] - average) * (bucket[i] - average);
}
stdev = Math.Sqrt(stdev / bucket.Length);

Console.WriteLine($"elapsed: {stopwatch.Elapsed}");
Console.WriteLine($"avg = {average}, stdev = {stdev}, min = {min} ({min / average:p}), max = {max} ({max / average:p})");

@bartonjs bartonjs added the cryptographic-docs-impact Issues impacting cryptographic docs. Cleared and reused after documentation is updated each release. label Jan 3, 2025
@bartonjs
Copy link
Member

bartonjs commented Jan 3, 2025

This testing is going to take a very long time and I don't think it's appropriate to include it in the repository. What do you think?

Having it in the test suite as a disabled test at least gives us a basis of "what sort of test should we run when we change Shuffle?".

It can be guarded (disabled) by something like

. It's already in the right assembly, even.

I'd include the stats numbers from 9 and this change as a comment, so that future run needs would know what to compare to.

Tests generally assert something rather than just print, so probably should assert that the stddev is below 700?

@bartonjs
Copy link
Member

bartonjs commented Jan 3, 2025

@richlander I'm not sure if the pedigree of this change has any impact on our Third Party Notices. What are your thoughts?

@richlander
Copy link
Member

A quick look suggests that the algorithm is from a paper referenced at swiftlang/swift#39143. We nee to validate the license asserted by that paper.

@bartonjs
Copy link
Member

bartonjs commented Jan 7, 2025

From the perspective of the CSPRNG... I think the answer is we don't want to accept this right now.

Before I get into the why, I want to say that this has been very interesting, and I have been completely nerd sniped... having done little else since seeing this PR (and letting it significantly infiltrate my "outside of work" time).

We've had a lot of "backroom chat" on this, but I think it comes down to

  • The algorithm from the Swift PR has unknown pedigree, which makes it hard to track in our paperwork.
  • While this does make Shuffle faster, it's not clear that there's anyone who actually has Shuffle as a performance bottleneck.
  • It's a lot of code for Shuffle
  • The benefit of batched random feels significantly higher for GetItems, but it wasn't proposed at the same time.
    • Of course, it would have been even more code, so even scarier...
  • Batched random itself feels like a useful feature.

So while we want to say "no", we're not saying that the work should be thrown away... rather we should heavily pivot (and the pivot probably means we should close the PR and come back to it).

The most useful piece would be to define some sort of batched random function. One such beast could be

public partial class RandomNumberGenerator
{
    public static void BatchedRandom(ReadOnlySpan<int> toExclusives, Span<int> destination)
    {
        if (toExclusives.Length != destination.Length)
            throw new SomeKindOfArgumentException();

        // if any of toExclusives are <= 0, throw something else.

        ReadOnlySpan<uint> exclusiveUints = MemoryMarshal.Cast<int, uint>(toExclusives);
        Span<uint> destinationUints = MemoryMarshal.Cast<int, uint>(destination);

        while (exclusiveUints.Length > 0)
        {
            int n = BatchedRandomCore(MemoryMarshal.toExclusives, destination);
            exclusiveUints= exclusiveUints.Slice(n);
            destinationUints = destinationUints.Slice(n);
        }
    }

    private static int BatchedRandomCore(ReadOnlySpan<uint> toExclusives, Span<uint> destination)
    {
        // Provides an implementation of "Batched ranged random integer generation"
        // by Brackett-Rozinsky and Lemire (https://doi.org/10.1002/spe.3369)
        //
        // This is inspired by Algorithm 3 (Batched dice rolls (unknown threshold)),
        // but it determines the maximum number of elements it can process in one batch,
        // and checks the suitability of r before calculating/committing any individual a[i] values.

        Debug.Assert(toExclusives.Length == destination.Length);
        int n;
        ulong bound = 1;

        // Determine how many elements we can process this call.
        // When this loop exits n is either the input length (process everything),
        // or how many items were processed before overflowing ulong (a partial batch).
        for (n = 0; n < toExclusives.Length; n++)
        {
            ulong carry = Math.BigMul(bound, toExclusives[n], out ulong newBound);

            if (carry != 0)
            {
                break;
            }

            bound = newBound;
        }

        Debug.Assert(n > 0);

        // Choose a random 64-bit value for r.  If `r * bound` is suitable, then r is suitable.
        // If it isn't, then choose another r and repeat.

        ulong r = 0;
        FillSpan(MemoryMarshal.AsBytes(MemoryMarshal.CreateSpan(ref r, 1)));
        ulong rBoundLower = unchecked(r * bound);

        if (rBoundLower < bound)
        {
            ulong t = (0 - bound) % bound;

            while (rBoundLower < t)
            {
                FillSpan(MemoryMarshal.AsBytes(MemoryMarshal.CreateSpan(ref r, 1)));
                rBoundLower = unchecked(r * bound);
            }
        }

        // r and r * bound are suitable.  Build up the individual a[i].

        for (int i = n - 1; i >= 0; --i)
        {
            destination[i] = Math.BigMul(r, bounds[i], out r);
        }

        // We ended up at the same rBoundLower value.  Math works, hooray.
        Debug.Assert(r == rBoundLower);
        return n;
    }
}

Given that function, we could easily write GetItems and Shuffle in terms of it.

Of course, the function itself has questions:

  • Do we want to do the paper's Multiply and reject, or the more easily understood algorithm of "choose a bounded integer, now do repeated DivRem"? In my tests the DivRem approach took twice as long as the paper's multiply-and-reject; but was still 7 times faster than calling GetInt32 for each call (on [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]).
    • Given that the paper exists, we can probably go with the multiply approach.
  • Shuffle is basically the only algorithm that wants heterogeneous toExclusive, so there's probably a desire to have a simpler overload: public static void BatchedRandom(Span<int> destination, int toExclusive) (and another overload specifying lower bounds?)
    • GetItems would definitely prefer to call BatchedRandom(indices, choices.Length)
    • Some might argue that if Shuffle is the only need for heterogeneous that we shouldn't bother making it public API... and also shouldn't bother changing Shuffle to use batching.
  • The worker functions are easier when written in terms of unsigned numbers. Do we want to overload for both int and uint?
  • Should the functions write to all destination slots, or should we expose instead the driver that says how much it did in one batch?
  • What do we want to call the function? (BatchedRandom? FillInt32s? FillInt32?)
  • Do we want it to do something different on a 32-bit CPU?

Since we're now talking about introducing public API, it needs to go through the API Review process.

Along the way a few other useful functions have suggested themselves:

  • GetUInt64() // replace the FillSpan(MemoryMarshal.AsBytes(MemoryMarshal.CreateSpan(ref r, 1))) with something cleaner.
  • GetUInt64(ulong toExclusive) // If we went with the DivRem approach to batching this would be useful.
  • Probably 32-bit versions of the above.
  • Maybe signed-64, since we'd then have Unsigned 64, Unsigned 32, and Signed 32... and 3 out of 4 suggests squaring the circle.

Maybe those go in the same API proposal. Maybe different.

That was all, of course, from the CSPRNG perspective. Since the crypto team is also who added Shuffle and GetItems to System.Random, we might be the de facto owners there... so then similar arguments apply. However, the speed of Xoshiro is so high compared to the CSPRNG that it's not clear that batching as public API is as important. What is important is that I don't believe, at this time, that we're comfortable using the "swift PR" algorithm (even though Chi-Square says it's valid).

@andanteyk
Copy link
Author

Thank you for your review.

The algorithm from the Swift PR has unknown pedigree, which makes it hard to track in our paperwork.

It is also possible to implement it using the same Lemire style as the current Next(max) instead of the "Swift PR" style.
Performance may change slightly, but it should still be faster than a naive implementation.

// Change the code to correct `r` to the following:
if (rbound < bound)
{
    ulong mod = (0 - bound) % bound;

    while (rbound < mod)
    {
        r = NextUInt64();
        rbound = r * bound;
    }
}
Method Mean Error StdDev
shuffleUnseededInt10_Old 35.923 ns 0.4722 ns 0.4417 ns
shuffleUnseededInt10_New 13.135 ns 0.1213 ns 0.1135 ns
shuffleUnseededInt1000_Old 1,892.877 ns 15.9320 ns 14.1233 ns
shuffleUnseededInt1000_New 1,575.102 ns 13.2296 ns 11.7277 ns
shuffleCryptoInt10_Old 591.240 ns 5.2520 ns 4.9127 ns
shuffleCryptoInt10_New 64.785 ns 0.2776 ns 0.2461 ns
shuffleCryptoInt1000_Old 69,950.397 ns 552.8761 ns 461.6766 ns
shuffleCryptoInt1000_New 9,048.014 ns 68.4062 ns 63.9872 ns

(Old is the current .NET implementation, New is the implementation after it was changed to Lemire style.)

If your biggest concern is with "Swift PR", that can be resolved.

While this does make Shuffle faster, it's not clear that there's anyone who actually has Shuffle as a performance bottleneck.

Faster is better, isn't it? :)

The benefit of batched random feels significantly higher for GetItems, but it wasn't proposed at the same time.

If I were to make (another) PR to speed up GetItems using the same technique, would it be likely to be accepted?

@bartonjs
Copy link
Member

bartonjs commented Jan 8, 2025

Faster is better, isn't it? :)

In security code, simpler is generally better (algorithms may be complicated, but the implementation of the algorithm should start simple and evolve as needed). The fastest of "simple enough" approaches is often what is chosen, rather than necessarily "the fastest", except for things that truly end up being the bottleneck (for example, BitLocker uses AES, so the throughput of AES might actually be the rate limiting factor on reading from a fast drive under moderate CPU load).

Shuffle or GetItems having inline batched randomization is not simple, but making them in terms of existing batched random functionality is.

If I were to make (another) PR to speed up GetItems using the same technique, would it be likely to be accepted?

For the CSPRNG, the answer is the same as with Shuffle... if we stop and make batched random a feature in and of itself first, sure. But without evidence that 69ns/item is too slow, there's not really justification for making the code more complex.

@bartonjs
Copy link
Member

bartonjs commented Jan 8, 2025

I've started a proposal at #111211.

@MihaZupan
Copy link
Member

MihaZupan commented Jan 8, 2025

RandomNumberGenerator tends to have (relatively) very high per-call overhead.
Making up numbers here, but if we were to add e.g. TLS caching that fetched a KB of random data from the underlying generator at a time, making calls to RNG.Fill(8 bytes)/RNG.GetInt32 10x cheaper, would we still care to use smarter algorithms like these? (discussed in #13628 as well)

Same question would apply to the batched API -- if the underlying implementation for getting a single value was that much cheaper, would we care to expose batched APIs?

@vcsjones
Copy link
Member

vcsjones commented Jan 8, 2025

but if we were to add e.g. TLS caching

I'm a little weary of that because I don't know if that has compliance implications. Apple's RNG, for example, does have a cache - but only if you ask for 12 bytes or less. They bypass the cache if you ask for more, ostensibly for FIPS reasons. See #106525 (comment) for why.

@andanteyk
Copy link
Author

Understood. Once again, thank you for your review.

@andanteyk andanteyk closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Security community-contribution Indicates that the PR has been added by a community member cryptographic-docs-impact Issues impacting cryptographic docs. Cleared and reused after documentation is updated each release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants