-
-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced intermediate allocations #2415
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes for review.
@@ -1470,7 +1474,7 @@ private int ReadImageHeaders(BufferedReadStream stream, out bool inverted, out b | |||
{ | |||
// Usually the color palette is 1024 byte (256 colors * 4), but the documentation does not mention a size limit. | |||
// Make sure, that we will not read pass the bitmap offset (starting position of image data). | |||
if ((stream.Position + colorMapSizeBytes) > this.fileHeader.Offset) | |||
if (stream.Position > this.fileHeader.Offset - colorMapSizeBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just because I'm paranoid of overflow 😉.
@@ -22,7 +22,7 @@ internal sealed class GifDecoderCore : IImageDecoderInternals | |||
/// <summary> | |||
/// The temp buffer used to reduce allocations. | |||
/// </summary> | |||
private readonly byte[] buffer = new byte[16]; | |||
private ScratchBuffer buffer; // mutable struct, don't make readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ScratchBuffer
also has 16 bytes and is embedded into this class. It saves the additional object the GC needs to track + the overhead of object / array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use scratch buffers in a few places.
private unsafe struct ScratchBuffer | ||
{ | ||
private fixed byte scratch[16]; | ||
|
||
public Span<byte> Span => MemoryMarshal.CreateSpan(ref this.scratch[0], 16); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can see what the C# compiler generates.
ref this.scratch[0]
gets translated to ref scratch.FixedElementField
so a normal reference to a field.
buffer[1] = GifConstants.CommentLabel; | ||
buffer[0] = GifConstants.ExtensionIntroducer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write below a comment on why these are flipped.
|
||
this.outputStream.Write(buffer, 0, 4); | ||
|
||
// We write the highest index first, to have only one bound check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sharplab showcases the difference between "normal" (0-13) or the order used here.
Note the bound check before every access in Foo
, and only one bound check in Bar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a few places we've done that. 👍
@@ -61,16 +61,13 @@ public override void Decode(ReadOnlySpan<byte> data, Buffer2D<TPixel> pixels, in | |||
{ | |||
for (int x = 0; x < pixelRow.Length; x++) | |||
{ | |||
data.Slice(offset, 4).CopyTo(buffer); | |||
float r = BitConverter.ToSingle(buffer, 0); | |||
float r = BitConverter.ToSingle(data.Slice(offset, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intermediate copy into the buffer can be avoided here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gosh yeah! That was a bad one.
@@ -23,7 +24,7 @@ internal abstract class BitWriterBase | |||
/// <summary> | |||
/// A scratch buffer to reduce allocations. | |||
/// </summary> | |||
private readonly byte[] scratchBuffer = new byte[4]; | |||
private ScratchBuffer scratchBuffer; // mutable struct, don't make readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried hard to avoid these fields, and use solely stackalloc.
But sometimes it's very hard, and passing the buffer into every method, potentially across types, is cumbersome. Thus these ScratchBuffer
-types are used.
huffTree[i] = default; | ||
} | ||
Span<ushort> histogramSymbols = histogramImageXySize <= 64 ? stackalloc ushort[histogramImageXySize] : new ushort[histogramImageXySize]; | ||
Span<HuffmanTree> huffTree = stackalloc HuffmanTree[3 * WebpConstants.CodeLengthCodes]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather big stack allocation, but it should be fine.
That's why the unit-tests were run with only 256 kB stack size locally.
{ | ||
codeLengths[i] = 0; | ||
} | ||
codeLengths.AsSpan(0, alphabetSize).Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Span<T>.Clear
is a method call, so I wasn't sure if this is a win over the manual (inline) loop.
Thus a benchmark run with .NET 6 yields:
| Method | AlphabetSize | Mean | Error | StdDev | Ratio | RatioSD |
|----------- |------------- |----------:|---------:|---------:|------:|--------:|
| Default | 40 | 21.36 ns | 0.457 ns | 0.764 ns | 1.00 | 0.00 |
| Span_Clear | 40 | 15.73 ns | 0.338 ns | 0.474 ns | 0.73 | 0.03 |
| | | | | | | |
| Default | 256 | 118.73 ns | 2.409 ns | 4.642 ns | 1.00 | 0.00 |
| Span_Clear | 256 | 26.19 ns | 0.532 ns | 0.711 ns | 0.22 | 0.01 |
Benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Bench>();
public class Bench
{
private readonly int[] _codeLengths = new int[296];
[Params(40, 256)]
public int AlphabetSize { get; set; }
[Benchmark(Baseline = true)]
public void Default()
{
int alphabetSize = this.AlphabetSize;
int[] codeLengths = _codeLengths;
for (int i = 0; i < alphabetSize; ++i)
{
codeLengths[i] = 0;
}
}
[Benchmark]
public void Span_Clear()
{
_codeLengths.AsSpan(0, this.AlphabetSize).Clear();
}
}
protected ulong ReadUInt64() => | ||
this.TryReadSpan(this.buf8) | ||
? this.ConvertToUInt64(this.buf8) | ||
protected ulong ReadUInt64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method won't inline, so no problem to have the stackalloc
here.
In recent JITs sometimes stackalloc won't hinder inlining anyway.
Please wait a moment with review -- I'd like to validate / improve something. |
It's only one value for the fixed buffer size and the creation of the span -- so less error prone once the value needs to be updated.
Everything is fine. My fear was that the scratch-buffer, backed by the fixed buffer, e.g.
TestA passed the _buffer as value, so copying needs to be done (in the dasm the first call is the call to "memory copy" native routine). If the buffer needs to passed around, then it should be done via ref as shown in TestB .
Both don't apply here (also checked other uses of fixed buffers across the code-base). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of learnings here.
It's really great to see in many instances a modernization of the codebase.
Prerequisites
Description
stackalloc
ated memoryTests are run locally also with only 256 kB of stack-size (like (older) IIS) has -- via
DOTNET_DefaultStackSize=0x3e800
.