Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced intermediate allocations #2415

Merged
merged 12 commits into from
Mar 29, 2023
Merged

Conversation

gfoidl
Copy link
Contributor

@gfoidl gfoidl commented Mar 25, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

  • intermediate allocations are reduced -- mostly by using stackallocated memory
  • scratch buffers, that were fields of a class, are either removed or substituted by fixed sized buffers, thus reducing the amount of bytes allocated a bit (no additional object headers for the array), and reducing the total number of object the GC needs to track

Tests are run locally also with only 256 kB of stack-size (like (older) IIS) has -- via DOTNET_DefaultStackSize=0x3e800.

Copy link
Contributor Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for review.

@@ -1470,7 +1474,7 @@ private int ReadImageHeaders(BufferedReadStream stream, out bool inverted, out b
{
// Usually the color palette is 1024 byte (256 colors * 4), but the documentation does not mention a size limit.
// Make sure, that we will not read pass the bitmap offset (starting position of image data).
if ((stream.Position + colorMapSizeBytes) > this.fileHeader.Offset)
if (stream.Position > this.fileHeader.Offset - colorMapSizeBytes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just because I'm paranoid of overflow 😉.

@@ -22,7 +22,7 @@ internal sealed class GifDecoderCore : IImageDecoderInternals
/// <summary>
/// The temp buffer used to reduce allocations.
/// </summary>
private readonly byte[] buffer = new byte[16];
private ScratchBuffer buffer; // mutable struct, don't make readonly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ScratchBuffer also has 16 bytes and is embedded into this class. It saves the additional object the GC needs to track + the overhead of object / array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use scratch buffers in a few places.

Comment on lines 766 to 771
private unsafe struct ScratchBuffer
{
private fixed byte scratch[16];

public Span<byte> Span => MemoryMarshal.CreateSpan(ref this.scratch[0], 16);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can see what the C# compiler generates.

ref this.scratch[0] gets translated to ref scratch.FixedElementField so a normal reference to a field.

Comment on lines +369 to +370
buffer[1] = GifConstants.CommentLabel;
buffer[0] = GifConstants.ExtensionIntroducer;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write below a comment on why these are flipped.


this.outputStream.Write(buffer, 0, 4);

// We write the highest index first, to have only one bound check.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sharplab showcases the difference between "normal" (0-13) or the order used here.

Note the bound check before every access in Foo, and only one bound check in Bar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a few places we've done that. 👍

@@ -61,16 +61,13 @@ public override void Decode(ReadOnlySpan<byte> data, Buffer2D<TPixel> pixels, in
{
for (int x = 0; x < pixelRow.Length; x++)
{
data.Slice(offset, 4).CopyTo(buffer);
float r = BitConverter.ToSingle(buffer, 0);
float r = BitConverter.ToSingle(data.Slice(offset, 4));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intermediate copy into the buffer can be avoided here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gosh yeah! That was a bad one.

@@ -23,7 +24,7 @@ internal abstract class BitWriterBase
/// <summary>
/// A scratch buffer to reduce allocations.
/// </summary>
private readonly byte[] scratchBuffer = new byte[4];
private ScratchBuffer scratchBuffer; // mutable struct, don't make readonly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hard to avoid these fields, and use solely stackalloc.
But sometimes it's very hard, and passing the buffer into every method, potentially across types, is cumbersome. Thus these ScratchBuffer-types are used.

huffTree[i] = default;
}
Span<ushort> histogramSymbols = histogramImageXySize <= 64 ? stackalloc ushort[histogramImageXySize] : new ushort[histogramImageXySize];
Span<HuffmanTree> huffTree = stackalloc HuffmanTree[3 * WebpConstants.CodeLengthCodes];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather big stack allocation, but it should be fine.
That's why the unit-tests were run with only 256 kB stack size locally.

{
codeLengths[i] = 0;
}
codeLengths.AsSpan(0, alphabetSize).Clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span<T>.Clear is a method call, so I wasn't sure if this is a win over the manual (inline) loop.
Thus a benchmark run with .NET 6 yields:

|     Method | AlphabetSize |      Mean |    Error |   StdDev | Ratio | RatioSD |
|----------- |------------- |----------:|---------:|---------:|------:|--------:|
|    Default |           40 |  21.36 ns | 0.457 ns | 0.764 ns |  1.00 |    0.00 |
| Span_Clear |           40 |  15.73 ns | 0.338 ns | 0.474 ns |  0.73 |    0.03 |
|            |              |           |          |          |       |         |
|    Default |          256 | 118.73 ns | 2.409 ns | 4.642 ns |  1.00 |    0.00 |
| Span_Clear |          256 |  26.19 ns | 0.532 ns | 0.711 ns |  0.22 |    0.01 |
Benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>();

public class Bench
{
    private readonly int[] _codeLengths = new int[296];

    [Params(40, 256)]
    public int AlphabetSize { get; set; }

    [Benchmark(Baseline = true)]
    public void Default()
    {
        int alphabetSize = this.AlphabetSize;
        int[] codeLengths = _codeLengths;

        for (int i = 0; i < alphabetSize; ++i)
        {
            codeLengths[i] = 0;
        }
    }

    [Benchmark]
    public void Span_Clear()
    {
        _codeLengths.AsSpan(0, this.AlphabetSize).Clear();
    }
}

protected ulong ReadUInt64() =>
this.TryReadSpan(this.buf8)
? this.ConvertToUInt64(this.buf8)
protected ulong ReadUInt64()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method won't inline, so no problem to have the stackalloc here.
In recent JITs sometimes stackalloc won't hinder inlining anyway.

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 26, 2023

Please wait a moment with review -- I'd like to validate / improve something.

It's only one value for the fixed buffer size and the creation of the span -- so less error prone once the value needs to be updated.
@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 26, 2023

to validate

Everything is fine. My fear was that the scratch-buffer, backed by the fixed buffer, e.g.

private ScratchBuffer buffer; // mutable struct, don't make readonly
gets passed into a method directly. As it's a struct, so copied by value, there could be a perf-penalty due excessive copying. This example showcases what I mean. Method TestA passed the _buffer as value, so copying needs to be done (in the dasm the first call is the call to "memory copy" native routine). If the buffer needs to passed around, then it should be done via ref as shown in TestB.

Both don't apply here (also checked other uses of fixed buffers across the code-base).
Here we only pass the span around (as in TestC), so no problem.

@JimBobSquarePants JimBobSquarePants added this to the v3.1.0 milestone Mar 27, 2023
Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of learnings here.

It's really great to see in many instances a modernization of the codebase.

@JimBobSquarePants JimBobSquarePants merged commit 3d40c91 into SixLabors:main Mar 29, 2023
@gfoidl gfoidl deleted the allocations branch March 29, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants