-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eliminate ActorPath.ToSerializationFormat
UID allocations
#6195
Conversation
Used some more `Span<char>` magic to avoid additional allocations when string-ifying `ActorPath` components.
src/core/Akka/Util/SpanHacks.cs
Outdated
i = Math.Abs(i); | ||
|
||
// count sig figs | ||
while (i > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would an if or Jump based table based on length potentially be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might very well be - I haven't benchmarked these methods themselves, only in aggregate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, looking at .NET, one thing that we -can- do here, is something like:
while (i > 100)
{
i = i / 100;
startLen +=2;
}
if (i > 10)
{
startLen++;
}
return startLen +1; //Could alternatively 'preload' the first character in the negative ternary.
The advantage is that division happens half as often, even if the ASM -may- be a little bit larger... but we can elide elsewhere. :)
src/core/Akka/Util/SpanHacks.cs
Outdated
|
||
while (i > 0) | ||
{ | ||
span[startPos + targetLength - index++ - 1] = Numbers[i % 10]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need startPos
here? i.e. would it be better to have callers .Slice()
the input span to the correct starting position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Question: would it be better to do = (i%10)+'0'
here? I guess it depends on whether you'd rather throw (slower but potentially safer) or overflow (faster but potentially unsafer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a dedicated benchmark for this function so we can measure it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need startPos here? i.e. would it be better to have callers .Slice() the input span to the correct starting position?
I don't think it hurts anything - I just didn't wanted all changes to be made on the same input Span
@to11mtm think this is worth picking back up again? It shows up in my profiler for sure but doesn't make much of an impact on the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some food for thought here.
Also, here is a sharplab exampling some of the ASM improvements possible with minimal lift.
|
||
// UID calculation | ||
var uidSizeHint = 0; | ||
if (uid != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it makes more sense to do something where we can avoid the branch in AppendUidSpan
.
Ideas that come to mind:
- Pad the allocated span by uint maxvalue+1 (21), that would let us just do the padding call within the if block.
- Move this alloc stuff into a function that returns the span of proper length, either with UID filled or unfilled (and do the UID calc there)
src/core/Akka/Util/SpanHacks.cs
Outdated
i = Math.Abs(i); | ||
|
||
// count sig figs | ||
while (i > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, looking at .NET, one thing that we -can- do here, is something like:
while (i > 100)
{
i = i / 100;
startLen +=2;
}
if (i > 10)
{
startLen++;
}
return startLen +1; //Could alternatively 'preload' the first character in the negative ternary.
The advantage is that division happens half as often, even if the ASM -may- be a little bit larger... but we can elide elsewhere. :)
I think it's worth picking back up, but it needs to get done well enough to show a benefit. The PR comments I added -should- help to some extent (especially the sharplab examples, they are less branchy/loopy) The other major hurdle to perf, is the way we are setting in the Span, is resulting in extra bounds checks and throws when we ('may') get out of bounds. I'm not 100% certain of the best way to handle that, but there -should- be a way to do so. It's mostly tricky because we are going backwards on top of potentially getting an span of arbitrary size (That may be either too small, or larger.) |
@to11mtm ok, I'll get to work on this |
Benchmarks for the BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
|
Using some simplified math: public static int Int64SizeInCharacters(long i)
{
// still need 1 char to represent '0'
if(i == 0) return 1;
// account for negative characters
int padding = 1;
if (i < 0)
{
i *= -1;
padding = 2;
}
// count sig figs
return (int)Math.Log10(i) + padding;
} BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
|
Implemented a jump table and I'm pretty happy with the results @to11mtm - the largest UID today is /// <summary>
/// How many characters do we need to represent this int as a string?
/// </summary>
/// <param name="i">The int.</param>
/// <returns>Character length.</returns>
public static int Int64SizeInCharacters(long i)
{
// account for negative characters
var padding = 0;
if (i < 0)
{
i *= -1;
padding = 1;
}
switch (i)
{
case 0:
return 1;
case < 10:
return 2 + padding;
case < 100:
return 3 + padding;
case < 1000:
return 4 + padding;
case < 10000:
return 5 + padding;
case < 100000:
return 6 + padding;
case < 1_000_000:
return 7 + padding;
case < 10_000_000:
return 8 + padding;
case < 100_000_000:
return 9 + padding;
case < 1_000_000_000:
return 10 + padding;
case < 10_000_000_000:
return 11 + padding;
case 100_000_000_000:
return 12 + padding;
default:
return (int)Math.Log10(i) + 1 + padding;
}
} This required upgrading Akka.NET to support C#9, FYI BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
|
You could probably make this even more efficient doing a binary-search style operation but it wouldn't save you much IMHO. |
Pretty sure my last changes broke the UID function (I think it's an N+1 error), so I'll fix that. In the meantime - added some benchies for the formatting function itself: BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
|
Implemented some of @to11mtm 's suggestions and sped up the formatting function itself: BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
|
|
Method | Mean | Error | StdDev | Gen 0 | Allocated |
---|---|---|---|---|---|
ActorPath_Parse | 224.320 ns | 1.3475 ns | 1.1946 ns | 0.1109 | 464 B |
ActorPath_Concat | 39.405 ns | 0.6358 ns | 0.5636 ns | 0.0268 | 112 B |
ActorPath_Equals | 4.267 ns | 0.0182 ns | 0.0161 ns | - | - |
ActorPath_ToString | 49.135 ns | 0.6010 ns | 0.5622 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 57.184 ns | 0.6982 ns | 0.6531 ns | 0.0267 | 112 B |
ActorPath_ToSerializationFormatWithAddress | 64.348 ns | 0.2643 ns | 0.2472 ns | 0.0267 | 112 B |
v1.4
Branch
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
Method | Mean | Error | StdDev | Gen 0 | Allocated |
---|---|---|---|---|---|
ActorPath_Parse | 227.059 ns | 4.3677 ns | 4.0856 ns | 0.1106 | 464 B |
ActorPath_Concat | 39.632 ns | 0.6738 ns | 0.6617 ns | 0.0268 | 112 B |
ActorPath_Equals | 4.318 ns | 0.0544 ns | 0.0425 ns | - | - |
ActorPath_ToString | 48.681 ns | 1.0389 ns | 2.1456 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 60.018 ns | 1.1744 ns | 1.0985 ns | 0.0267 | 112 B |
ActorPath_ToSerializationFormatWithAddress | 57.864 ns | 0.7122 ns | 0.6314 ns | 0.0267 | 112 B |
Looks like the current implementation is winning out on perf - or it's all in the margin of error
While working on akkadotnet#6195 I realized that none of those `ActorPath`s actually have a set UID, thus we're missing that entire facet from both the parsing and serialization benchmarks.
While working on #6195 I realized that none of those `ActorPath`s actually have a set UID, thus we're missing that entire facet from both the parsing and serialization benchmarks.
|
Method | Uid | Mean | Error | StdDev | Gen 0 | Allocated |
---|---|---|---|---|---|---|
ActorPath_Parse | 1 | 286.090 ns | 5.4953 ns | 5.8799 ns | 0.0992 | 416 B |
ActorPath_Concat | 1 | 39.046 ns | 0.8300 ns | 0.8151 ns | 0.0268 | 112 B |
ActorPath_Equals | 1 | 4.276 ns | 0.0173 ns | 0.0135 ns | - | - |
ActorPath_ToString | 1 | 56.383 ns | 0.7110 ns | 0.6650 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 1 | 68.930 ns | 0.9070 ns | 0.8040 ns | 0.0286 | 120 B |
ActorPath_ToSerializationFormatWithAddress | 1 | 75.431 ns | 0.8508 ns | 0.7958 ns | 0.0286 | 120 B |
ActorPath_Parse | 100000 | 281.287 ns | 2.4783 ns | 2.0695 ns | 0.0992 | 416 B |
ActorPath_Concat | 100000 | 38.671 ns | 0.1395 ns | 0.1089 ns | 0.0268 | 112 B |
ActorPath_Equals | 100000 | 4.267 ns | 0.0162 ns | 0.0126 ns | - | - |
ActorPath_ToString | 100000 | 57.425 ns | 0.7558 ns | 0.6700 ns | 0.0267 | 112 B |
ActorPath_ToSerializationFormat | 100000 | 87.786 ns | 0.2636 ns | 0.2058 ns | 0.0305 | 128 B |
ActorPath_ToSerializationFormatWithAddress | 100000 | 84.566 ns | 0.7803 ns | 0.7299 ns | 0.0305 | 128 B |
ActorPath_Parse | 2147483647 | 295.088 ns | 2.3138 ns | 1.9321 ns | 0.0992 | 416 B |
ActorPath_Concat | 2147483647 | 39.235 ns | 0.5909 ns | 0.5527 ns | 0.0268 | 112 B |
ActorPath_Equals | 2147483647 | 4.290 ns | 0.0326 ns | 0.0305 ns | - | - |
ActorPath_ToString | 2147483647 | 56.219 ns | 0.9611 ns | 0.8990 ns | 0.0267 | 112 B |
ActorPath_ToSerializationFormat | 2147483647 | 108.536 ns | 2.2346 ns | 3.2049 ns | 0.0324 | 136 B |
ActorPath_ToSerializationFormatWithAddress | 2147483647 | 93.275 ns | 1.7111 ns | 1.6006 ns | 0.0324 | 136 B |
v1.4
Branch
Latest v1.4
Benchmarks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
Method | Uid | Mean | Error | StdDev | Gen 0 | Allocated |
---|---|---|---|---|---|---|
ActorPath_Parse | 1 | 268.316 ns | 1.5227 ns | 1.2715 ns | 0.0992 | 416 B |
ActorPath_Concat | 1 | 38.400 ns | 0.1241 ns | 0.1036 ns | 0.0268 | 112 B |
ActorPath_Equals | 1 | 4.377 ns | 0.0361 ns | 0.0320 ns | - | - |
ActorPath_ToString | 1 | 50.872 ns | 0.1519 ns | 0.1269 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 1 | 168.876 ns | 1.6593 ns | 1.4709 ns | 0.0610 | 256 B |
ActorPath_ToSerializationFormatWithAddress | 1 | 168.242 ns | 2.5289 ns | 2.1117 ns | 0.0610 | 256 B |
ActorPath_Parse | 100000 | 281.081 ns | 1.4482 ns | 1.2838 ns | 0.0992 | 416 B |
ActorPath_Concat | 100000 | 38.480 ns | 0.1676 ns | 0.1309 ns | 0.0268 | 112 B |
ActorPath_Equals | 100000 | 4.377 ns | 0.0479 ns | 0.0425 ns | - | - |
ActorPath_ToString | 100000 | 50.507 ns | 0.1441 ns | 0.1203 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 100000 | 180.087 ns | 0.3269 ns | 0.2552 ns | 0.0629 | 264 B |
ActorPath_ToSerializationFormatWithAddress | 100000 | 185.135 ns | 3.4828 ns | 3.0874 ns | 0.0629 | 264 B |
ActorPath_Parse | 2147483647 | 300.098 ns | 3.0413 ns | 2.6960 ns | 0.0992 | 416 B |
ActorPath_Concat | 2147483647 | 39.325 ns | 0.8076 ns | 0.7159 ns | 0.0268 | 112 B |
ActorPath_Equals | 2147483647 | 4.360 ns | 0.0337 ns | 0.0315 ns | - | - |
ActorPath_ToString | 2147483647 | 51.842 ns | 1.0444 ns | 1.0725 ns | 0.0268 | 112 B |
ActorPath_ToSerializationFormat | 2147483647 | 193.560 ns | 2.6633 ns | 2.2240 ns | 0.0648 | 272 B |
ActorPath_ToSerializationFormatWithAddress | 2147483647 | 194.647 ns | 3.9673 ns | 6.1767 ns | 0.0648 | 272 B |
This looks like success to me - going to merge it into v1.4. |
|
While working on akkadotnet#6195 I realized that none of those `ActorPath`s actually have a set UID, thus we're missing that entire facet from both the parsing and serialization benchmarks.
…et#6195) * eliminate `ActorPath.ToSerializationFormat` UID allocations Used some more `Span<char>` magic to avoid additional allocations when string-ifying `ActorPath` components. * adding `SpanHacks` benchmarks * sped up `Int64SizeInCharacters` * added `TryFormat` benchmarks * fixed n+1 error in jump table * cleaned up `TryFormat` inside `SpanHacks` * fixed `SpanHacks` index calculation * removed BDN results * Update SpanHacks.cs
* cleaned up duplicate System.Collections.Immutable package reference (#6264) also standardized all System.* packages on a common version * converted build system to .NET 7.0 (#6263) * converted build system to .NET 7.0 * upgrade to Incrementalist.Cmd v0.8.0 * upgraded MNTR to support .NET 7.0 * fixed build system to target .NET 7.0 * upgrade to latest version of DocFx * add .NET 6 SDK back to build system * fixed HyperionConfigTests * Akka.Streams: `ReuseLatest` stage to repeatedly emit the most recent value until a newer one is pushed (#6262) * code cleanup in Akka.Streams `Attributes` * added `RepeatPrevious{T}` stage * WIP - debugging `RepeatPreviousSpecs` * fixed tests and added documentation * fixed documentation * API approvals * fixed markdown linting * removed `SwapPrevious<T>` delegate. * renamed stage from `RepeatPrevious` to `ReuseLatest` * remove BDN results * added real UID to `ActorPathBenchmarks` (#6276) While working on #6195 I realized that none of those `ActorPath`s actually have a set UID, thus we're missing that entire facet from both the parsing and serialization benchmarks. * Enable dynamic PGO for RemotePingPong and PingPong (#6277) * eliminate `ActorPath.ToSerializationFormat` UID allocations (#6195) * eliminate `ActorPath.ToSerializationFormat` UID allocations Used some more `Span<char>` magic to avoid additional allocations when string-ifying `ActorPath` components. * adding `SpanHacks` benchmarks * sped up `Int64SizeInCharacters` * added `TryFormat` benchmarks * fixed n+1 error in jump table * cleaned up `TryFormat` inside `SpanHacks` * fixed `SpanHacks` index calculation * removed BDN results * Update SpanHacks.cs * compilation fixes and V1.5 api approval
Changes
Used some more
Span<char>
magic to avoid additional allocations when string-ifyingActorPath
components. Large source of allocations during outbound serialization.Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
Latest
dev
BenchmarksUsing RemotePingPong.
NET Core 3.1
OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112
Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 160129, 1249.60, 112, 134
5, 1000000, 315160, 3173.45, 144, 160
10, 2000000, 330088, 6059.34, 168, 168
15, 3000000, 326727, 9182.70, 176, 159
20, 4000000, 329327, 12146.22, 168, 153
25, 5000000, 331544, 15081.96, 161, 142
30, 6000000, 320582, 18716.53, 150, 137
NET 6.0
OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112
Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 169349, 1181.20, 112, 138
5, 1000000, 338181, 2957.13, 148, 161
10, 2000000, 357016, 5602.10, 169, 169
15, 3000000, 357484, 8392.14, 177, 169
20, 4000000, 282307, 14169.69, 177, 152
25, 5000000, 252922, 19769.79, 161, 139
30, 6000000, 332153, 18064.79, 149, 140
This PR's Benchmarks
NET Core 3.1
OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112
Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 167365, 1195.77, 112, 134
5, 1000000, 314565, 3179.28, 149, 166
10, 2000000, 328031, 6097.24, 174, 174
15, 3000000, 320890, 9349.82, 182, 165
20, 4000000, 322763, 12393.38, 174, 157
25, 5000000, 316196, 15813.78, 166, 141
30, 6000000, 313677, 19128.38, 150, 136
NET 6.0
OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112
Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 187091, 1069.74, 112, 137
5, 1000000, 345424, 2895.43, 146, 166
10, 2000000, 351371, 5692.77, 174, 174
15, 3000000, 345265, 8689.21, 182, 173
20, 4000000, 260095, 15379.95, 182, 147
25, 5000000, 280521, 17824.88, 155, 143
30, 6000000, 282526, 21237.83, 151, 140
Benchmarks (
ActorPath
in Benchmark.NET)These are all on .NET 6
dev
This PR
The allocations definitely disappeared from my profiler in Rider, but the total benchmark impact is negligible. Probably going to push this change to v1.4.45.