Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Inline "is on heap?" checks for checked write barriers #111561

Closed
wants to merge 8 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jan 18, 2025

Let's see if it works out

MyLargeStruct Largestruct(MyLargeStruct ms) => ms;

public struct MyLargeStruct
{
    public object Fld1;
    public object Fld2;
    public object Fld3;
    public object Fld4;
    public object Fld5;
    public object Fld6;
    public object Fld7;
    public object Fld8;
}

Current codegen:

; Method Benchmark:Largestruct(MyLargeStruct):MyLargeStruct (FullOpts)
G_M19878_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mov     x1, x0
G_M19878_IG02:
            movz    x0, #0x5758      ;; code for CORINFO_HELP_BULK_WRITEBARRIER
            movk    x0, #0x6530 LSL #16
            movk    x0, #0x7FFA LSL #32
            ldr     x3, [x0]
            ldrsb   wzr, [x8]
            mov     x0, x8
            mov     x2, #64
            blr     x3 ;; CORINFO_HELP_BULK_WRITEBARRIER
G_M19878_IG03:
            ldp     fp, lr, [sp], #0x10
            ret     lr
; Total bytes of code: 52

New codegen:

; Method Benchmark:Largestruct(MyLargeStruct):MyLargeStruct (FullOpts)
G_M19878_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mov     x1, x0
G_M19878_IG02:
            movz    x0, #0x5788
            movk    x0, #0xC993 LSL #16
            movk    x0, #0x7FFA LSL #32
            ldp     x2, x0, [x0]
            cmp     x8, x2
            ccmp    x8, x0, z, ge    ;;; is retBuf on GC heap? 
            blt     G_M19878_IG05
G_M19878_IG03:
            ldp     q16, q17, [x1]  ;;; it is on stack - do copy with 4 SIMD stores (stp)
            stp     q16, q17, [x8]
            ldp     q16, q17, [x1, #0x20]
            stp     q16, q17, [x8, #0x20]
G_M19878_IG04:
            ldp     fp, lr, [sp], #0x10
            ret     lr
G_M19878_IG05:  ;; is on heap
            movz    x0, #0x5758 ;; code for CORINFO_HELP_BULK_WRITEBARRIER
            movk    x0, #0x6A63 LSL #16
            movk    x0, #0x7FFA LSL #32
            ldr     x3, [x0]
            ldrsb   wzr, [x8]
            mov     x0, x8
            mov     x2, #64
            blr     x3 ;; CORINFO_HELP_BULK_WRITEBARRIER  ;; we might want a new helper here
            b       G_M19878_IG04
; Total bytes of code: 100

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 18, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 18, 2025

@EgorBot -azure_ampere -azure_cobalt100 -profiler

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmark
{
    private MyStruct _heap1;
    private MyLargeStruct _heap2;

    [Benchmark]
    public void ReturnBufferWB_Smallstruct_stack() => Smallstruct(default);

    [Benchmark]
    public void ReturnBufferWB_Smallstruct_heap() => _heap1 = Smallstruct(default);

    [Benchmark]
    public void ReturnBufferWB_Largestruct_stack() => Largestruct(default);

    [Benchmark]
    public void ReturnBufferWB_Largestruct_heap() => _heap2 = Largestruct(default);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static MyStruct Smallstruct(MyStruct ms) => ms;

    [MethodImpl(MethodImplOptions.NoInlining)]
    static MyLargeStruct Largestruct(MyLargeStruct ms) => ms;
}

public struct MyStruct
{
    public object Fld1;
    public object Fld2;
    public object Fld3;
}

public struct MyLargeStruct
{
    public object Fld1;
    public object Fld2;
    public object Fld3;
    public object Fld4;
    public object Fld5;
    public object Fld6;
    public object Fld7;
    public object Fld8;
}

@jkotas
Copy link
Member

jkotas commented Jan 18, 2025

            ldp     x2, x0, [x0]
            cmp     x8, x2
            ccmp    x8, x0, z, ge    ;;; is retBuf on GC heap? 

If GC is triggered right after ldp x2, x0, [x0], expands the heap and moves the object into newly expanded part of the heap, we will compare the reference against a stale GC heap limit and things will go wrong there. The entire expansion may need to be marked as no-gc region to avoid this problem.

@NinoFloris
Copy link
Contributor

Size diffs won't be nice, but preliminary arm64 perf results look very encouraging?

When I brought up return barrier cost I had assumed keeping returns unmanaged would help but it's more than I expected. (and there's still a heap check in here!)

Flipping the trade-off would help with the code size issue, assuming the copy mostly trades out with the removal of the write barrier call. Though I understand if such a broad change is just too involved or risky. This PR is nice in any case, as I have also seen things like try patterns show up in my profiling.

Thanks for picking this up @EgorBo

@EgorBo
Copy link
Member Author

EgorBo commented Jan 18, 2025

If GC is triggered right after ldp x2, x0, [x0], expands the heap and moves the object into newly expanded part of the heap, we will compare the reference against a stale GC heap limit and things will go wrong there. The entire expansion may need to be marked as no-gc region to avoid this problem.

Yep, I had it in mind, this expansion happens late enough to be able to arrange nogc, but I wonder if it's possible for the GC boundaries to be just constant? Like GC reserves a huge memory region and the boundaries represent reserved, not comitted? thus, we can bake them as constant

When I brought up return barrier cost I had assumed keeping returns unmanaged would help but it's more than I expected.

It's important to keep in mind that struct copies for large blocks aren't cheap even without write barriers, so an extra struct copy might ruin the idea

Overall, I am not planning to continue for now, I was mostly interested in perf numbers. I think we need something realistic to measure impact of it on. Also, @AndyAyersMS mentioned that at some point we might start emitting more (if not all) checked barriers due to IPO escape analsysis, so this might help there as well

@EgorBo EgorBo closed this Jan 18, 2025
@jkotas
Copy link
Member

jkotas commented Jan 18, 2025

I wonder if it's possible for the GC boundaries to be just constant?

GC does that in some modes today. We would want to add information about this to WriteBarrierParameters.

@NinoFloris
Copy link
Contributor

It's important to keep in mind that struct copies for large blocks aren't cheap even without write barriers, so an extra struct copy might ruin the idea

Yes larger structs may get hit by this, though I'm not sure how common those are, they're advised against either way. The perf numbers show most gains for the smaller structs, this is aligns with my own investigations.

Just to reiterate, it's frustrating to have managed struct hot paths today, if not *everything* can be inlined perf reduces significantly. It's not what you expect defining something as simple as (unmanaged header, ROSeq buffer). Using a pooled class to avoid these return costs is not always possible, and can easily end up being more expensive due to the actual heap barriers you now incur to update the fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants