Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Sve.DuplicateSelectedScalarToVector() #103228

Merged

Conversation

SwapnilGaikwad
Copy link
Contributor

Contribute towards #99957.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@SwapnilGaikwad
Copy link
Contributor Author

@a74nh @kunalspathak @dotnet/arm64-contrib @arch-arm64-sve

@SwapnilGaikwad
Copy link
Contributor Author

SwapnilGaikwad commented Jun 10, 2024

There are a few assertion failures, related to STR. Taking a further look.

Stress test results
===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_float() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_double() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_sbyte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_short() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_int() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_long() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_byte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_ushort() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_uint() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_ulong() : 17
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
Errors:

Assert failure(PID 2894704 [0x002c2b70], Thread: 2894704 [0x2c2b70]): Assertion failed 'isValidSimm<9>(emitGetInsSC(id))' in 'JIT.HardwareIntrinsics.Arm._Sve.Program:SveDuplicateSelectedScalarToVector_double()' during 'Generate code' (IL size 115; hash 0x749c6f16; FullOpts)

    File: /home/user/dotnet/runtime/src/coreclr/jit/emitarm64sve.cpp:14192
    Image: /home/user/dotnet/runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun


------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
Errors:

Assert failure(PID 2894838 [0x002c2bf6], Thread: 2894838 [0x2c2bf6]): Assertion failed 'isValidSimm<9>(emitGetInsSC(id))' in 'JIT.HardwareIntrinsics.Arm._Sve.Program:SveDuplicateSelectedScalarToVector_double()' during 'Generate code' (IL size 115; hash 0x749c6f16; FullOpts)

    File: /home/user/dotnet/runtime/src/coreclr/jit/emitarm64sve.cpp:14192
    Image: /home/user/dotnet/runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun


------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@kunalspathak
Copy link
Member

related to STR.

I have seen them in the past, but yes, if you can find out what is the cause, that will be good.

@SwapnilGaikwad
Copy link
Contributor Author

All stress tests are passing now.

Stress test results
===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_float() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_double() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_sbyte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_short() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_int() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_long() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_byte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_ushort() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_uint() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveDuplicateSelectedScalarToVector_ulong() : 17
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@SwapnilGaikwad SwapnilGaikwad marked this pull request as ready for review June 11, 2024 14:07
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few things to check.

@@ -421,6 +421,10 @@ void HWIntrinsicInfo::lookupImmBounds(
immUpperBound = (int)SVE_PATTERN_ALL;
break;

case NI_Sve_DuplicateSelectedScalarToVector:
immUpperBound = (512 / (genTypeSize(baseType) * BITS_PER_BYTE)) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
immUpperBound = (512 / (genTypeSize(baseType) * BITS_PER_BYTE)) - 1;
immUpperBound = Compiler::getSIMDVectorLength(simdSize, baseType) - 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not able able to use the getSIMDVectorLength() here as the imm for DUP seems special [1].

Is the immediate index, in the range 0 to one less than the number of elements in 512 bits, encoded in "imm2:tsz".

I didn't find a better helper method for this so did it explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually little confused with that description and then this one:

The immediate element index is in the range of 0 to 63 (bytes), 31 (halfwords), 15 (words), 7 (doublewords) or 3 (quadwords).

With ^ description, it sounded me like getSIMDVectorLength(), no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing. Unlike other cases, the index here does not depend on the vector length. A valid index range is fixed based on the element type, e.g., 0 to 63 for vectors of type byte.

ValidateResult(_dataTable.inArrayPtr, {Imm}, _dataTable.outArrayPtr);
}

public void RunBasicScenario_UnsafeRead_InvalidImm()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm what is the behavior here?

From https://docsmirror.github.io/A64/2023-06/dup_z_zi.html, is it validating this portion:

 Selecting an element beyond the accessible vector length causes the destination vector to be set to zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid index refers to a value that is permitted but out of bounds on the current system. E.g., bytes permitted index values are between 0 to 63 but on a 256bit SVE length machine, valid indices can be between 256 / 8 - 1 = 0 to 31. Thus, the indices between 32 to 63 would set the destination value to zero.

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiclistarm64sve.h Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
@SwapnilGaikwad
Copy link
Contributor Author

The tests with short are failing with the following assertion error. Not sure if it's related to the patch.
Do you have any suggestions on how to fix them?

Beginning scenario: RunBasicScenario_UnsafeRead

Assert failure(PID 1069122 [0x00105042], Thread: 1069122 [0x105042]): Assertion failed '(candidates & allRegs(regType)) != RBM_NONE' in 'System.Runtime.Intrinsics.Arm.Sve:DuplicateSelectedScalarToVector(System.Numerics.Vector`1[short],ubyte):System.Numerics.Vector`1[short]' during 'LSRA build intervals' (IL size 8; hash 0x3d7d92ae; Tier0)

    File: /home/user/dotnet/runtime/src/coreclr/jit/lsra.cpp:2913
    Image: /home/user/dotnet/runtime/dup-artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun

Aborted (core dumped)

@SwapnilGaikwad
Copy link
Contributor Author

JIT dump for a failing method.

@kunalspathak
Copy link
Member

JIT dump for a failing method.

@SwapnilGaikwad - Are you sure this one is the right dump? I see the compilation succeed in this dump.

@kunalspathak
Copy link
Member

Do you have any suggestions on how to fix them?

This is to do with that a node is of type TYP_FLOAT and we assigned it candidates of TYP_MASK or vice-versa. It might not be directly related to your PR, but something that your PR exposed specially after the LSRA work of predicate register is complete. If you share the correct dump, I can quickly tell for which node it is getting wrong, and you can then track in lsraarm64.cpp why that is happening.

@SwapnilGaikwad
Copy link
Contributor Author

Hi Kunal, apologies. It was dump of the parent test method that emitted a call to Sve.DuplicateSelectedScalarToVector.
I have updated the gist with the correct dump now.

@kunalspathak
Copy link
Member

Hi Kunal, apologies. It was dump of the parent test method that emitted a call to Sve.DuplicateSelectedScalarToVector. I have updated the gist with the correct dump now.

DefList: { N004.t0. LCL_VAR; N014.t1. LCL_VAR }
N016 (  4,  4) [000009] -----------                         *  CAST      int <- ubyte <- int REG NA
<RefPosition #7   @16  RefTypeUse <Ivl:3> BB01 regmask=[x0-xip0 x19-x28] minReg=1 last wt=100.00>
Interval  4: int RefPositions {} physReg:NA Preferences=[x0-xip0 x19-x28] Aversions=[]
<RefPosition #8   @17  RefTypeDef <Ivl:4> CAST BB01 regmask=[x0-xip0 x19-x28] minReg=1 wt=400.00>

DefList: { N004.t0. LCL_VAR; N016.t9. CAST }
N018 ( 17, 20) [000006] ---X-------                         *  HWINTRINSIC simd16 short DuplicateSelectedScalarToVector REG NA
Interval  5: int RefPositions {} physReg:NA Preferences=[x0-xip0 x19-x28] Aversions=[]
<RefPosition #9   @18  RefTypeDef <Ivl:5 internal> HWINTRINSIC BB01 regmask=[x0-xip0 x19-x28] minReg=1 wt=400.00>
<RefPosition #10  @18  RefTypeUse <Ivl:0> BB01 regmask=[allFloat] minReg=1 last wt=100.00>
<RefPosition #11  @18  RefTypeUse <Ivl:4> BB01 regmask=[d0-d15] minReg=1 last wt=100.00>
<RefPosition #12  @18  RefTypeUse <Ivl:5 internal> HWINTRINSIC BB01 regmask=[x0-xip0 x19-x28] minReg=1 last wt=400.00>
Interval  6: simd16 RefPositions {} physReg:NA Preferences=[allFloat] Aversions=[]
<RefPosition #13  @19  RefTypeDef <Ivl:6> HWINTRINSIC BB01 regmask=[allFloat] minReg=1 wt=400.00>

I think the problem is in <Ivl:4>. For def RefPosition #8, the mask created is GPR, but the use RefPosition #11, the mask if d0-d15 which is GPR, which seems wrong. So, you might want to check by putting breakpoint in newRefPositionRaw() for newRP->rpNum == 11 why the mask is not gpr.

@kunalspathak
Copy link
Member

@SwapnilGaikwad - I assume you don't need anything from me for this PR?

@SwapnilGaikwad
Copy link
Contributor Author

@SwapnilGaikwad - I assume you don't need anything from me for this PR?

Hi @kunalspathak , I haven't exhausted all the debugging possibilities yet. I might come back with questions 🙂 .

@SwapnilGaikwad
Copy link
Contributor Author

SwapnilGaikwad commented Jun 26, 2024

why the mask is not gpr

Hi @kunalspathak , your suggestion was very useful to identify the issue. The mask was hardcoded to simd registers for op2.

All stress tests pass
===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_float() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_double() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_sbyte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_short() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_int() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_long() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_byte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_ushort() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_uint() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_DuplicateSelectedScalarToVector_ulong() : 17
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Have a question about the change in codegenarm64.cpp.

Also, how are we exposing DUP (immediate) and DUP (scalar) ?

@@ -8150,9 +8153,11 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
{
useRegForImm = true;
regNumber rsvdReg = codeGen->rsGetRsvdReg();
codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);
// For larger imm values (> 9 bits), calculate base + imm in a reserved register first.
codeGen->instGen_Set_Reg_To_Base_Plus_Imm(EA_PTRSIZE, rsvdReg, reg2, imm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to the DUP? Don't think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might actually need it for /~https://github.com/dotnet/runtime/pull/104065/files#diff-2b2c8b9011607926410624d6f81613fad7b74c6e0516d578675a8b792998fe4fR7893-R7896, but I am curious if you found a repro, for which you had to add change here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to DUP directly. However, tests trigger a scenario where we end up in having a store with a larger immediate.

@kunalspathak
Copy link
Member

Also, how are we exposing DUP (immediate) and DUP (scalar) ?

Spoke offline that we will use these in future when we intrinsify Vector.Create or similar scenario for sve.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit 5b962c3 into dotnet:main Jun 27, 2024
152 of 165 checks passed
@SwapnilGaikwad SwapnilGaikwad deleted the github-sve-duplicateScalarToVector branch June 27, 2024 16:18
@github-actions github-actions bot locked and limited conversation to collaborators Jul 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.Intrinsics arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants