Implement DivRem intrinsic for X86 #66551

huoyaoyuan · 2022-03-13T09:50:14Z

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

This fixes error while crossgen2 compiling Utf8Formatter.TryFormat(TimeSpan).

dotnet-issue-labeler · 2022-03-13T09:50:21Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2022-03-13T09:50:24Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `new-api-needs-documentation`
Milestone:	-

ghost · 2022-03-13T09:50:52Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `new-api-needs-documentation`
Milestone:	-

src/coreclr/jit/gentree.h

huoyaoyuan · 2022-03-13T09:54:50Z

src/coreclr/jit/hwintrinsicxarch.cpp

        case InstructionSet_Vector128:
-        case InstructionSet_X86Base:
            return impBaseIntrinsic(intrinsic, clsHnd, method, sig, simdBaseJitType, retType, simdSize);
+        case InstructionSet_X86Base:
+        case InstructionSet_X86Base_X64:
+            return impX86BaseIntrinsic(intrinsic, method, sig, simdBaseJitType);


This had an interesting side effect: if FeatureSIMD is turned off, X86Base.Pause will also be turned off.

huoyaoyuan · 2022-03-13T09:56:18Z

src/coreclr/jit/lower.cpp

+    // For local stores on XARCH we can't handle another lclVar source.
+    // If the source was another lclVar similarly promoted, we would
    // have broken it into multiple stores.
-    if (lclNode->OperIs(GT_STORE_LCL_VAR) && !lclNode->gtGetOp1()->OperIs(GT_CALL))
+    if (lclNode->OperIs(GT_STORE_LCL_VAR) && lclNode->gtGetOp1()->OperIs(GT_LCL_VAR))


I don't understand what this is doing, but not doing this will cause crossgen2 failing for Utf8Formatter.TryFormat(DateTime).

Did you try to find the root cause of this?

I didn't have time for this. I brought it from another PR mentioned earlier and it just works. It's better to ask the original author.

@CarolEidt can you help explaining this? Not doing this isn't causing failures now, but the code looks necessary.

@huoyaoyuan - Carol doesn't work on this code base anymore. As the comment suggests, " If the source was another lclVar similarly promoted, we would have broken it into multiple stores.". So, I think you should revert this change, given that it works now.

Yes, that was just a means to try which nodes are caught if op1 != GT_CALL. So the reason it is failing (as expected) is because the tree now contains the op1 == GT_HWINTRINSIC.

┌──▌ t10 int ├──▌ t13 int ├──▌ t58 int N007 ( 8, 9) [000014] ----------- t14 = ▌ HWINTRINSIC struct int DivRem $140 ┌──▌ t14 struct N009 ( 18, 16) [000017] MA--------- ▌ STORE_LCL_VAR struct<System.ValueTuple`2[System.Int32, System.Int32], 8>(P) V05 tmp3 ▌ int V05.Item1 (offs=0x00) -> V13 tmp11 ▌ int V05.Item2 (offs=0x04) -> V14 tmp12

We do want to enregister in such cases and not doing it (retaining op1 != GT_CALL) spills the result on stack.

https://www.diffchecker.com/TUY5sGjV

Isn't this the same code on xarch?

Yes and no. My suggestion includes the (significant) addition that we will DNER all locals, not just promoted ones. It is to prevent the case I described above from arising: struct retyped as long = multi-reg-node and other assigns codegen doesn't support. Alternatively, we could of course support them in codegen, but that seems out of scope for this change.

Also, for the below code, do you mean to say use the current check of independent promotion too?

Yes, the independent promotion check (and the field count check as well) must stay. In fact, I think they should be expanded. Consider struct { promoted field<double> } = HWI(int, int) and other mismatched cases. Right now they would pass, but we need to DNER them, by passing the register count to the method instead of the return type descriptor.

There's now new failures and I don't have enough time to investigate in more depth.

I will take a look hopefully next week.

I updated the code as per @SingleAccretion suggestion.

src/coreclr/jit/lowerxarch.cpp

huoyaoyuan · 2022-03-13T09:59:15Z

src/coreclr/jit/lsraxarch.cpp

+                // DIV implicitly put op1(lower) to EAX and op2(upper) to EDX
+                srcCount += BuildOperandUses(op1, RBM_EAX);
+                srcCount += BuildOperandUses(op2, RBM_EDX);
+                srcCount += BuildOperandUses(op3);


What exactly does RMW mean? Something like ADD src, dst that src is 1 register represented by both operand and result?
I tried BuildDelayFreeUses but causes register conflict in optimized code.

What exactly does RMW mean? Something like ADD src, dst that src is 1 register represented by both operand and result?

Correct.

I tried BuildDelayFreeUses but causes register conflict in optimized code.

Do you have an example?

Do you have an example?

I did false assert when op3 is spilled onto stack, the regNum was remained as the register it was spilled from. This shouldn't be an issue now.

huoyaoyuan · 2022-03-13T10:01:12Z

src/libraries/System.Private.CoreLib/src/System/Math.cs

@@ -317,13 +317,25 @@ public static int DivRem(int a, int b, out int result)
            // Restore to using % and / when the JIT is able to eliminate one of the idivs.
            // In the meantime, a * and - is measurably faster than an extra /.

+            if (X86Base.IsSupported)
+            {
+                (int quitient, result) = X86Base.DivRem((uint)a, a >> 31, b);


There lacks a way to represent the CDQ/CQO instruction. Is there any way better than >> 31?

huoyaoyuan · 2022-03-13T10:03:15Z

src/tests/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx

@@ -1261,6 +1261,20 @@ private static readonly (string templateFileName, Dictionary<string, string> tem
    ("ScalarTernOpBinResTest.template", new Dictionary<string, string> { ["Isa"] = "Bmi2.X64", ["Method"] = "MultiplyNoFlags",     ["RetBaseType"] = "UInt64", ["Op1BaseType"] = "UInt64", ["Op2BaseType"] = "UInt64", ["Op3BaseType"] = "UInt64",        ["NextValueOp1"] = "UInt64.MaxValue",                   ["NextValueOp2"] = "UInt64.MaxValue",                     ["NextValueOp3"] = "0",  ["ValidateResult"] = "ulong expectedHigher = 18446744073709551614, expectedLower = 1; isUnexpectedResult = (expectedHigher != higher) || (expectedLower != lower);" }),
 };

+private static readonly (string templateFileName, Dictionary<string, string> templateData)[] X86BaseInputs = new []
+{
+    ("ScalarTernOpTupleBinRetTest.template", new Dictionary<string, string> { ["Isa"] = "X86Base", ["Method"] = "DivRem", ["RetBaseType"] = "Int32",  ["Op1BaseType"] = "UInt32", ["Op2BaseType"] = "Int32",  ["Op3BaseType"] = "Int32",  ["NextValueOp1"] = "UInt32.MaxValue", ["NextValueOp2"] = "-2", ["NextValueOp3"] = "-0x10001",   ["ValidateResult"] = " int expectedQuotient = 0xFFFF;  int expectedReminder = -2; isUnexpectedResult = (expectedQuotient != ret1) || (expectedReminder != ret2);" }),


The tests were picked to verify correct signed-ness is used.

huoyaoyuan · 2022-03-13T10:11:28Z

Code gen for sample method:

        public static ulong XL(ulong a, ulong b)
        {
            (ulong q, ulong r) = Math.DivRem(a, b);
            return q + r;
        }

G_M52626_IG01:
       mov      qword ptr [rsp+10H], rdx
						;; bbWeight=1    PerfScore 1.00

G_M52626_IG02:
       xor      edx, edx
       mov      rax, rcx
       div      rdx:rax, qword ptr [rsp+10H]
       add      rax, rdx
						;; bbWeight=1    PerfScore 61.75

G_M52626_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 19

In one commit it uses R8 instead of stack, but fails with register confliction in optimized code.

It can be worse if RDX is used in calling convention. For the following method:

        public long DivRemInt64(long a, long b)
        {
            (long q, long r) = Math.DivRem(a, b);
            return q + r;
        }

It saves a copy to stack:

       mov       [rsp+10],rdx
       sar       rdx,3F
       mov       rax,[rsp+10]
       idiv      r8
       add       rax,rdx
       ret
; Total bytes of code 21

And causes micro benchmark regresses. But real world sample can differ by register assignation.

huoyaoyuan · 2022-03-13T10:29:07Z

Do I need to implement for Mono in this PR or create an issue to track it? This is an already implemented ISA and used in corelib. Lack of implementation would cause existing tests to fail. /cc @vargaz

lambdageek · 2023-02-14T17:04:52Z

When running the aot compiler, the MONO_PATH env var needs to be set in such a way that the input corlib assembly is the same assembly as the one found in the directories in MONO_PATH. Not sure why this only fails for this one test.

I do see that this path is setup correctly.

2023-02-13T18:51:19.8150781Z   aot-compile: compiling /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r/System.Private.CoreLib.dll; MONO_PATH: /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r:/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/Tests/Core_Root
2023-02-13T18:51:19.8580478Z EXEC : error : Loaded assembly '/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/Tests/Core_Root/System.Private.CoreLib.dll' doesn't match original file name '/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r/System.Private.CoreLib.dll'. Set MONO_PATH to the assembly's location. [/__w/1/s/src/mono/msbuild/aot-compile.proj]

@lambdageek

The issue is there's two different libraries in the MONO_PATH both called CoreLib.dll - the "real" one and the fake one that the testsuite adds in order to allow the tests to call APIs that are nto in System.Runtime.

That is very likely to confuse the AOT compiler. It would be much better if the fake one didn't end up in the directory for the test case.

I think it's worth investigating if the fake one can be a reference assembly, instead. But I'm not sure even that would be enough.

I'll try to build this PR (or #80297, which uses the same trick) locally to see if there's some way to make the AOT compiler happy

lambdageek · 2023-02-14T21:44:49Z

Left a comment on the other PR that does the same fake corelib trick with one solution: #80297 (comment)

Unfortunately that solution just exposed an issue where the new intrinsic created some condition that the AOT compiler did not expect to see.

But maybe it'll work on this PR. Here's the commit (for the other PR) that shows the approach lambdageek@347d943

Make the DivRem tests reference the DivRem fake CoreLib as reference assembly; and ensure that it is not included as a ProjectReference by the toplevel HardwareIntrinsics merged test runners. The upshot is that the DivRem tests can call the extra APIs via a direct reference to CoreLib (instead of through System.Runtime), but the fake library is not copied into any test artifact directories, and the Mono AOT compiler never sees it.

src/coreclr/jit/lsrabuild.cpp

tannergooding · 2023-02-15T16:39:34Z

src/coreclr/jit/lsraxarch.cpp

+
+                    RefPosition* op3RefPosition;
+                    srcCount += BuildDelayFreeUses(op3, op1, RBM_NONE, &op3RefPosition);
+                    if ((op3RefPosition != nullptr) && !op3RefPosition->delayRegFree)


We don't need to also set it as delay free for op3 because we don't track delayFree with regards to a specific operand, is that right?

Correct. We just want to make sure that op3 is marked as delayFree so neither of op1 or op2's registers conflict with that of op3.

tannergooding

Would be good to have tracking issues covering the mono work and any work required to use this from Math.DivRem before merging.

kunalspathak · 2023-02-15T19:53:02Z

Would be good to have tracking issues covering the mono work and any work required to use this from Math.DivRem before merging.

Created:

kunalspathak · 2023-02-15T21:10:08Z

Failure is known #81123

jkotas · 2023-02-16T00:23:33Z

src/libraries/System.Runtime.Intrinsics/src/CompatibilitySuppressions.xml

@@ -0,0 +1,76 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!-- https://learn.microsoft.com/en-us/dotnet/fundamentals/package-validation/diagnostic-ids -->
+<Suppressions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">


Why do we have all these compatibility suppressions instead of adding these API to ref assemblies?

jkotas · 2023-02-16T00:24:56Z

src/tests/JIT/HardwareIntrinsics/X86/X86Base/DivRem.RefOnly.csproj

+    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
+    <OutputType>Library</OutputType>
+    <CLRTestKind>SharedLibrary</CLRTestKind>
+    <AssemblyName>System.Private.CoreLib</AssemblyName>


Referencing CoreLib from the tests is anti-pattern. Tests should only reference public surface. If the tests need to hack on internal implementation details, they should use private reflection. Why are we doing this?

jkotas · 2023-02-16T00:27:16Z

The structure introduce by this PR is very non-standard and full of anti-patterns. I am tempted to revert it so that it can be done properly. @ViktorHofer @akoeplinger Thoughts?

jkotas · 2023-02-16T00:39:01Z

We don't want the new APIs to get consumed publically just yet,

Why not?

If you are worried that the Mono implementation is not in place yet, it would be fine to add a temporary non-intrinsic implementation under MONO ifdef. It would be very local change and it would avoid spreading the CoreLib hacks through many places.

kunalspathak · 2023-02-16T00:42:04Z

I agree and I too do not like these hacks. Unfortunately, we had to go this route because there are few things, we need to optimize in order to consume it in Math.DivRem (Related #66551 (comment)). We wanted to get in this work, but not expose it just yet and so @tannergooding , @pentp and me agreed to go this route. This is not even about mono at this point but also applicable to coreclr. If you think we should revert it, we can revert it now, fix the optimizations that would make DivRem ready and consumable (e.g. Math.DivRem) and re-merge this one in?

kunalspathak · 2023-02-16T00:48:50Z

We also had planned to do similar thing in #80297 since the new APIs that I am adding are not yet approved, but I decided to rather wait for API approval than to merge such anti-patterns of consuming them in test projects.

jkotas · 2023-02-16T00:54:35Z

It is common to add new public APIs with simpler implementation with less-than-ideal performance first, and then optimize the implementation in subsequent PRs. We have done that many times. For example, #77799 added initial implementation of FrozenColllections and then number of subsequent PRs optimized it further, and I am sure we will get some more optimizations before .NET 8 ships. We should follow the same pattern for codegen intrinsics.

If you are really worried about the API being used accidentally before it is ready for prime time, you can slap RequiresPreviewFeaturesAttribute on it, like how it is done for AvxVnni Intrinsics currently.

jkotas · 2023-02-16T01:02:34Z

We also had planned to do similar thing in #80297 since the new APIs that I am adding are not yet approved, but I decided to rather wait for API approval than to merge such anti-patterns of consuming them in test projects.

Yes, please do wait for the API shape to be approved first before merging the implementation.

kunalspathak · 2023-02-16T01:25:02Z

If you are really worried about the API being used accidentally before it is ready for prime time, you can slap RequiresPreviewFeaturesAttribute on it, like how it is done for AvxVnni Intrinsics currently.

I see. Ok, I will submit a PR to revert part of the hacks and add RequiresPreviewFeaturesAttribute.

tannergooding · 2023-02-16T01:31:04Z

Just noting these are hardware intrinsics and providing a software fallback is itself problematic and against the general design goal of the APIs and can lead users into a pit of failure. If one is provided, it needs to very explicitly still throw where the underlying platform is not x86/x64 and when X86Base.IsSupported reports false.

That being said, exposing a platform specific intrinsic API publicly without it being usable for its intended purpose is likewise problematic and leads users into a different pit of failure. This is why I originally pushed for the general support to be completed and for the API to be used from the primary consumption sites (such as Math.DivRem) before it got merged.

I only relented on that with the premise that these would not be public until after the other work was completed. So if we are reverting this, then any new PR should likely just complete the additional work so we don't risk shipping something that isn't actually functioning the way users expect.

kunalspathak · 2023-02-16T01:37:02Z

So if we are reverting this

I was just planning to do the following:

add back the DivRem methods in ref
add RequiresPreviewFeaturesAttribute on the new APIs.
revert the fake library workaround added to make test projects work

Do we agree that it is the best course or should we completely revert this PR, address the codegen issues and likely close #82194? I am fine with either solution.

jkotas · 2023-02-16T01:40:39Z

Yes, I think that it is the best course of action.

kunalspathak · 2023-02-16T18:07:44Z

Yes, I think that it is the best course of action.

#82221

huoyaoyuan · 2023-02-17T02:26:46Z

Glad to see this finally get merged!

huoyaoyuan added 11 commits March 7, 2022 18:55

Add managed api for divrem

00da1f4

Add NI definition of DivRem

d679cca

Fix DivRem to be static

4eba5c5

Implement DivRem in clrjit

6f8fbbf

Add tests for DivRem

997d8a5

Adjust lsra and RMW

c0c3dd6

Use DivRem intrinsic in Math

c08cee7

Bring lower change from coreclr#37928

1804350

This fixes error while crossgen2 compiling Utf8Formatter.TryFormat(TimeSpan).

Fix signedness of DIV

46c3f78

Revert RMW change and fix reg allocation

2dbd2e9

Fix import of X64 intrinsic

1b6d09b

ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 13, 2022

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation and removed community-contribution Indicates that the PR has been added by a community member labels Mar 13, 2022

huoyaoyuan added area-System.Runtime.Intrinsics and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 13, 2022

huoyaoyuan commented Mar 13, 2022

View reviewed changes

huoyaoyuan added 2 commits March 13, 2022 18:16

Fix static in PNSE version

3ef3600

Fix accidential indent change

acaf211

huoyaoyuan requested a review from tannergooding March 13, 2022 10:20

huoyaoyuan mentioned this pull request Mar 13, 2022

Enable multi-register intrinsics support for Arm64 #64921

Closed

13 tasks

huoyaoyuan marked this pull request as draft March 13, 2022 13:34

huoyaoyuan added 2 commits March 13, 2022 21:38

Apply format patch

728440d

op3 candidate should be different from op1 and op2

4eace5d

tannergooding reviewed Feb 15, 2023

View reviewed changes

src/coreclr/jit/lsrabuild.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Feb 15, 2023

View reviewed changes

Unify AddDelayFreeUses

b5e3dd3

tannergooding approved these changes Feb 15, 2023

View reviewed changes

This was referenced Feb 15, 2023

Consume DivRem intrinsics from Math.DivRem #82194

Open

Support DivRem intrinsincs in mono #82195

Closed

kunalspathak merged commit 45c314f into dotnet:main Feb 15, 2023

jkotas reviewed Feb 16, 2023

View reviewed changes

huoyaoyuan deleted the divrem branch February 16, 2023 17:05

jeffhandley mentioned this pull request Feb 17, 2023

[IGNORE] Testing issue-labeler #82278

Closed

kunalspathak mentioned this pull request Feb 21, 2023

Let GenTreeCopyOrReload handle scenarios when FEATURE_MULTIREG_RET is disabled #82451

Merged

ghost locked as resolved and limited conversation to collaborators Mar 19, 2023

Implement DivRem intrinsic for X86 #66551

Implement DivRem intrinsic for X86 #66551

Conversation

huoyaoyuan commented Mar 13, 2022

dotnet-issue-labeler bot commented Mar 13, 2022

ghost commented Mar 13, 2022

ghost commented Mar 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SingleAccretion Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huoyaoyuan commented Mar 13, 2022

huoyaoyuan commented Mar 13, 2022

lambdageek commented Feb 14, 2023

lambdageek commented Feb 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

kunalspathak commented Feb 15, 2023

kunalspathak commented Feb 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Feb 16, 2023 • edited Loading

jkotas commented Feb 16, 2023

kunalspathak commented Feb 16, 2023

kunalspathak commented Feb 16, 2023

jkotas commented Feb 16, 2023

jkotas commented Feb 16, 2023

kunalspathak commented Feb 16, 2023

tannergooding commented Feb 16, 2023

kunalspathak commented Feb 16, 2023 • edited Loading

jkotas commented Feb 16, 2023

kunalspathak commented Feb 16, 2023

huoyaoyuan commented Feb 17, 2023

SingleAccretion Sep 23, 2022 •

edited

Loading

jkotas commented Feb 16, 2023 •

edited

Loading

kunalspathak commented Feb 16, 2023 •

edited

Loading