Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Unspill normalize-on-load variables using exact small type #90318

Merged
merged 8 commits into from
Aug 14, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Aug 10, 2023

The existing logic would sometimes unspill using the type of the local
that is being unspilled. This type is often wider than the exact small
type in the LclVarDsc, since NOL locals are normally expanded as
CAST(<small type>, LCL_VAR<int>).

This causes problems since morph will in some cases avoid inserting
normalization for NOL locals when it has a subrange assertion available.
This optimization relies on the backend always ensuring that the local
will be normalized as part of unspilling and args homing.

Size-wise regressions are expected on xarch since the encoding of the
normalizing load is larger. However, as we have seen before, using wide
loads can cause significant store-forwarding stalls which can have large
negative impact on performance, so overall there is an expected perf
benefit of using the small loads (in addition to the correctness fix).

Fix #90219

The existing logic would sometimes unspill using the type of the local
that is being unspilled. This type is often wider than the exact small
type in the LclVarDsc, since NOL locals are normally expanded as
CAST(<small type>, LCL_VAR<int>).

This causes problems since morph will in some cases avoid inserting
normalization for NOL locals when it has a subrange assertion available.
This optimization relies on the backend always ensuring that the local
will be normalized as part of unspilling and args homing.

Size-wise regressions are expected on xarch since the encoding of the
normalizing load is larger. However, as we have seen before, using wide
loads can cause significant store-forwarding stalls which can have large
negative impact on performance, so overall there is an expected perf
benefit of using the small loads (in addition to the correctness fix).

Fix dotnet#90219
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 10, 2023
@ghost ghost assigned jakobbotsch Aug 10, 2023
@ghost
Copy link

ghost commented Aug 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The existing logic would sometimes unspill using the type of the local
that is being unspilled. This type is often wider than the exact small
type in the LclVarDsc, since NOL locals are normally expanded as
CAST(, LCL_VAR).

This causes problems since morph will in some cases avoid inserting
normalization for NOL locals when it has a subrange assertion available.
This optimization relies on the backend always ensuring that the local
will be normalized as part of unspilling and args homing.

Size-wise regressions are expected on xarch since the encoding of the
normalizing load is larger. However, as we have seen before, using wide
loads can cause significant store-forwarding stalls which can have large
negative impact on performance, so overall there is an expected perf
benefit of using the small loads (in addition to the correctness fix).

Fix #90219

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

@sultrifork sultrifork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakobbotsch
Copy link
Member Author

jitstress failures are #90144. I've opened #90376 for the libraries-jitstress failures, but there were some that look like infrastructure, so I will rerun it. I've opened #90377 for the Fuzzlyn failure.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch jakobbotsch marked this pull request as ready for review August 11, 2023 19:44
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @BruceForstall

Diffs. Not quite as bad as the ones I displayed during the meeting after fixing my bug, but still somewhat bad. But as mentioned this is not expected to be a perf regression, but rather an improvement (but let's see in the perf lab).

We could probably fix the size regression by not rex prefixing movzx unnecessarily, but I don't think it's something we need to do for .NET 8. I will open a separate issue for it.

@BruceForstall
Copy link
Member

BruceForstall commented Aug 11, 2023

I was surprised to see a few examples in arm64 diffs of removing a register from byref GC reporting, but no codegen diff. The cases seemed benign since we don't need to report register pointers to the stack, but is it expected that this PR changes GC reporting?

Example:

@@ -436,14 +436,12 @@ G_M8490_IG31:        ; bbWeight=0, gcrefRegs=D800000 {x23 x24 x26 x27}, byrefReg
             blr     x2
             ; gcr arg pop 0
             ldr     x2, [fp, #0x18]	// [V22 tmp14]
-            ; byrRegs +[x2]
             ldr     w0, [x2, #0x04]
             cmp     w0, #24
             mov     x3, x2
             bls     G_M8490_IG05
 						;; size=68 bbWeight=0 PerfScore 0.00
 G_M8490_IG32:        ; bbWeight=0, gcrefRegs=D800000 {x23 x24 x26 x27}, byrefRegs=100000 {x20}, byref
-            ; byrRegs -[x2]
             mov     x0, x27
             ; gcrRegs +[x0]
             mov     w1, wzr

actually, this change exists on win-x64 as well.

@jakobbotsch
Copy link
Member Author

The cases seemed benign since we don't need to report register pointers to the stack, but is it expected that this PR changes GC reporting?

Hmm no -- it happens because GetStackSlotHomeType uses the type from the LclVarDsc, while we used to use the type from the GenTreeLclVarCommon that we are unspilling because of. I don't think it will be a problem, but I don't think we need to take the risk for .NET 8, so let me special case this for byrefs/GC pointers to use the original behavior.

@BruceForstall
Copy link
Member

BruceForstall commented Aug 11, 2023

On win-x64, the (non-GC) diffs seem to be converting from mov DWORD to movzx BYTE or WORD. But on linux-arm, it seems like the opposite:

-            ldrb    r10, [sp+0x14]	// [V06 tmp3]
+            ldr     r10, [sp+0x14]	// [V06 tmp3]

Would that be expected?

(the diffs are not all this way; there are some ldr to ldrb or ldrh cases as well)

[Edit] looks like you enabled the new unspill code for TARGET_ARM where that target previously had different behavior.

// general, the type of 'lcl' does not have any relation to the
// type of 'varDsc':
//
// * For NOL locals it is wider under normal circumstances, where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spell out NOL the first time

Suggested change
// * For NOL locals it is wider under normal circumstances, where
// * For normalize-on-load (NOL) locals it is wider under normal circumstances, where

//
// * For NOL locals it is wider under normal circumstances, where
// morph has added a cast on top
// * For all locals it can be narrower narrower in some cases, e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// * For all locals it can be narrower narrower in some cases, e.g.
// * For all locals it can be narrower in some cases, e.g.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Aug 12, 2023

On win-x64, the (non-GC) diffs seem to be converting from mov DWORD to movzx BYTE or WORD. But on linux-arm, it seems like the opposite:

-            ldrb    r10, [sp+0x14]	// [V06 tmp3]
+            ldr     r10, [sp+0x14]	// [V06 tmp3]

Would that be expected?

(the diffs are not all this way; there are some ldr to ldrb or ldrh cases as well)

[Edit] looks like you enabled the new unspill code for TARGET_ARM where that target previously had different behavior.

It happens when unspilling normalize-on-store locals (the LCL_VAR will be the small type, but the stack slot is wide and normalized). I unified the logic because I do not think ARM32 needs or should be special cased here. But let me do that change for .NET 9 instead, and scope this change down to only normalize-on-load locals for the other architectures in the interest of reducing risk.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-diffs, runtime-coreclr superpmi-replay

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch jakobbotsch added this to the 8.0.0 milestone Aug 14, 2023
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakobbotsch jakobbotsch merged commit 6d0e3e5 into dotnet:main Aug 14, 2023
@jakobbotsch jakobbotsch deleted the fix-90219 branch August 14, 2023 17:20
shushanhf added a commit to shushanhf/runtime that referenced this pull request Aug 17, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Bad codegen on win-x86
3 participants