-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DivRem intrinsic for X86 #66551
Conversation
This fixes error while crossgen2 compiling Utf8Formatter.TryFormat(TimeSpan).
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsCloses #27292. Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them. Not implementing for Mono because I can't understand the code, and this one has special register constraints.
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsCloses #27292. Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them. Not implementing for Mono because I can't understand the code, and this one has special register constraints.
|
src/coreclr/jit/hwintrinsicxarch.cpp
Outdated
case InstructionSet_Vector128: | ||
case InstructionSet_X86Base: | ||
return impBaseIntrinsic(intrinsic, clsHnd, method, sig, simdBaseJitType, retType, simdSize); | ||
case InstructionSet_X86Base: | ||
case InstructionSet_X86Base_X64: | ||
return impX86BaseIntrinsic(intrinsic, method, sig, simdBaseJitType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had an interesting side effect: if FeatureSIMD is turned off, X86Base.Pause will also be turned off.
src/coreclr/jit/lower.cpp
Outdated
// For local stores on XARCH we can't handle another lclVar source. | ||
// If the source was another lclVar similarly promoted, we would | ||
// have broken it into multiple stores. | ||
if (lclNode->OperIs(GT_STORE_LCL_VAR) && !lclNode->gtGetOp1()->OperIs(GT_CALL)) | ||
if (lclNode->OperIs(GT_STORE_LCL_VAR) && lclNode->gtGetOp1()->OperIs(GT_LCL_VAR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is doing, but not doing this will cause crossgen2 failing for Utf8Formatter.TryFormat(DateTime)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try to find the root cause of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have time for this. I brought it from another PR mentioned earlier and it just works. It's better to ask the original author.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt can you help explaining this? Not doing this isn't causing failures now, but the code looks necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huoyaoyuan - Carol doesn't work on this code base anymore. As the comment suggests, " If the source was another lclVar similarly promoted, we would have broken it into multiple stores.". So, I think you should revert this change, given that it works now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was just a means to try which nodes are caught if op1 != GT_CALL
. So the reason it is failing (as expected) is because the tree now contains the op1 == GT_HWINTRINSIC
.
┌──▌ t10 int
├──▌ t13 int
├──▌ t58 int
N007 ( 8, 9) [000014] ----------- t14 = ▌ HWINTRINSIC struct int DivRem $140
┌──▌ t14 struct
N009 ( 18, 16) [000017] MA--------- ▌ STORE_LCL_VAR struct<System.ValueTuple`2[System.Int32, System.Int32], 8>(P) V05 tmp3
▌ int V05.Item1 (offs=0x00) -> V13 tmp11
▌ int V05.Item2 (offs=0x04) -> V14 tmp12
We do want to enregister in such cases and not doing it (retaining op1 != GT_CALL
) spills the result on stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same code on xarch?
Yes and no. My suggestion includes the (significant) addition that we will DNER all locals, not just promoted ones. It is to prevent the case I described above from arising: struct retyped as long = multi-reg-node
and other assigns codegen doesn't support. Alternatively, we could of course support them in codegen, but that seems out of scope for this change.
Also, for the below code, do you mean to say use the current check of independent promotion too?
Yes, the independent promotion check (and the field count check as well) must stay. In fact, I think they should be expanded. Consider struct { promoted field<double> } = HWI(int, int)
and other mismatched cases. Right now they would pass, but we need to DNER them, by passing the register count to the method instead of the return type descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now new failures and I don't have enough time to investigate in more depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look hopefully next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code as per @SingleAccretion suggestion.
src/coreclr/jit/lsraxarch.cpp
Outdated
// DIV implicitly put op1(lower) to EAX and op2(upper) to EDX | ||
srcCount += BuildOperandUses(op1, RBM_EAX); | ||
srcCount += BuildOperandUses(op2, RBM_EDX); | ||
srcCount += BuildOperandUses(op3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does RMW mean? Something like ADD src, dst
that src
is 1 register represented by both operand and result?
I tried BuildDelayFreeUses
but causes register conflict in optimized code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does RMW mean? Something like ADD src, dst that src is 1 register represented by both operand and result?
Correct.
I tried
BuildDelayFreeUses
but causes register conflict in optimized code.
Do you have an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example?
I did false assert when op3 is spilled onto stack, the regNum was remained as the register it was spilled from. This shouldn't be an issue now.
@@ -317,13 +317,25 @@ public static int DivRem(int a, int b, out int result) | |||
// Restore to using % and / when the JIT is able to eliminate one of the idivs. | |||
// In the meantime, a * and - is measurably faster than an extra /. | |||
|
|||
if (X86Base.IsSupported) | |||
{ | |||
(int quitient, result) = X86Base.DivRem((uint)a, a >> 31, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There lacks a way to represent the CDQ
/CQO
instruction. Is there any way better than >> 31
?
@@ -1261,6 +1261,20 @@ private static readonly (string templateFileName, Dictionary<string, string> tem | |||
("ScalarTernOpBinResTest.template", new Dictionary<string, string> { ["Isa"] = "Bmi2.X64", ["Method"] = "MultiplyNoFlags", ["RetBaseType"] = "UInt64", ["Op1BaseType"] = "UInt64", ["Op2BaseType"] = "UInt64", ["Op3BaseType"] = "UInt64", ["NextValueOp1"] = "UInt64.MaxValue", ["NextValueOp2"] = "UInt64.MaxValue", ["NextValueOp3"] = "0", ["ValidateResult"] = "ulong expectedHigher = 18446744073709551614, expectedLower = 1; isUnexpectedResult = (expectedHigher != higher) || (expectedLower != lower);" }), | |||
}; | |||
|
|||
private static readonly (string templateFileName, Dictionary<string, string> templateData)[] X86BaseInputs = new [] | |||
{ | |||
("ScalarTernOpTupleBinRetTest.template", new Dictionary<string, string> { ["Isa"] = "X86Base", ["Method"] = "DivRem", ["RetBaseType"] = "Int32", ["Op1BaseType"] = "UInt32", ["Op2BaseType"] = "Int32", ["Op3BaseType"] = "Int32", ["NextValueOp1"] = "UInt32.MaxValue", ["NextValueOp2"] = "-2", ["NextValueOp3"] = "-0x10001", ["ValidateResult"] = " int expectedQuotient = 0xFFFF; int expectedReminder = -2; isUnexpectedResult = (expectedQuotient != ret1) || (expectedReminder != ret2);" }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests were picked to verify correct signed-ness is used.
Code gen for sample method: public static ulong XL(ulong a, ulong b)
{
(ulong q, ulong r) = Math.DivRem(a, b);
return q + r;
} G_M52626_IG01:
mov qword ptr [rsp+10H], rdx
;; bbWeight=1 PerfScore 1.00
G_M52626_IG02:
xor edx, edx
mov rax, rcx
div rdx:rax, qword ptr [rsp+10H]
add rax, rdx
;; bbWeight=1 PerfScore 61.75
G_M52626_IG03:
ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 19 In one commit it uses R8 instead of stack, but fails with register confliction in optimized code. It can be worse if public long DivRemInt64(long a, long b)
{
(long q, long r) = Math.DivRem(a, b);
return q + r;
} It saves a copy to stack: mov [rsp+10],rdx
sar rdx,3F
mov rax,[rsp+10]
idiv r8
add rax,rdx
ret
; Total bytes of code 21 And causes micro benchmark regresses. But real world sample can differ by register assignation. |
Do I need to implement for Mono in this PR or create an issue to track it? This is an already implemented ISA and used in corelib. Lack of implementation would cause existing tests to fail. /cc @vargaz |
The issue is there's two different libraries in the MONO_PATH both called CoreLib.dll - the "real" one and the fake one that the testsuite adds in order to allow the tests to call APIs that are nto in System.Runtime. That is very likely to confuse the AOT compiler. It would be much better if the fake one didn't end up in the directory for the test case. I think it's worth investigating if the fake one can be a reference assembly, instead. But I'm not sure even that would be enough. I'll try to build this PR (or #80297, which uses the same trick) locally to see if there's some way to make the AOT compiler happy |
Left a comment on the other PR that does the same fake corelib trick with one solution: #80297 (comment) Unfortunately that solution just exposed an issue where the new intrinsic created some condition that the AOT compiler did not expect to see. But maybe it'll work on this PR. Here's the commit (for the other PR) that shows the approach lambdageek@347d943 |
Make the DivRem tests reference the DivRem fake CoreLib as reference assembly; and ensure that it is not included as a ProjectReference by the toplevel HardwareIntrinsics merged test runners. The upshot is that the DivRem tests can call the extra APIs via a direct reference to CoreLib (instead of through System.Runtime), but the fake library is not copied into any test artifact directories, and the Mono AOT compiler never sees it.
|
||
RefPosition* op3RefPosition; | ||
srcCount += BuildDelayFreeUses(op3, op1, RBM_NONE, &op3RefPosition); | ||
if ((op3RefPosition != nullptr) && !op3RefPosition->delayRegFree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to also set it as delay free for op3
because we don't track delayFree
with regards to a specific operand, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We just want to make sure that op3
is marked as delayFree
so neither of op1
or op2
's registers conflict with that of op3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have tracking issues covering the mono work and any work required to use this from Math.DivRem before merging.
Created: |
Failure is known #81123 |
@@ -0,0 +1,76 @@ | |||
<?xml version="1.0" encoding="utf-8"?> | |||
<!-- https://learn.microsoft.com/en-us/dotnet/fundamentals/package-validation/diagnostic-ids --> | |||
<Suppressions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have all these compatibility suppressions instead of adding these API to ref assemblies?
<AllowUnsafeBlocks>true</AllowUnsafeBlocks> | ||
<OutputType>Library</OutputType> | ||
<CLRTestKind>SharedLibrary</CLRTestKind> | ||
<AssemblyName>System.Private.CoreLib</AssemblyName> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing CoreLib from the tests is anti-pattern. Tests should only reference public surface. If the tests need to hack on internal implementation details, they should use private reflection. Why are we doing this?
The structure introduce by this PR is very non-standard and full of anti-patterns. I am tempted to revert it so that it can be done properly. @ViktorHofer @akoeplinger Thoughts? |
Why not? If you are worried that the Mono implementation is not in place yet, it would be fine to add a temporary non-intrinsic implementation under MONO ifdef. It would be very local change and it would avoid spreading the CoreLib hacks through many places. |
I agree and I too do not like these hacks. Unfortunately, we had to go this route because there are few things, we need to optimize in order to consume it in |
We also had planned to do similar thing in #80297 since the new APIs that I am adding are not yet approved, but I decided to rather wait for API approval than to merge such anti-patterns of consuming them in test projects. |
It is common to add new public APIs with simpler implementation with less-than-ideal performance first, and then optimize the implementation in subsequent PRs. We have done that many times. For example, #77799 added initial implementation of FrozenColllections and then number of subsequent PRs optimized it further, and I am sure we will get some more optimizations before .NET 8 ships. We should follow the same pattern for codegen intrinsics. If you are really worried about the API being used accidentally before it is ready for prime time, you can slap |
Yes, please do wait for the API shape to be approved first before merging the implementation. |
I see. Ok, I will submit a PR to revert part of the hacks and add |
Just noting these are hardware intrinsics and providing a software fallback is itself problematic and against the general design goal of the APIs and can lead users into a pit of failure. If one is provided, it needs to very explicitly still throw where the underlying platform is not x86/x64 and when That being said, exposing a platform specific intrinsic API publicly without it being usable for its intended purpose is likewise problematic and leads users into a different pit of failure. This is why I originally pushed for the general support to be completed and for the API to be used from the primary consumption sites (such as I only relented on that with the premise that these would not be public until after the other work was completed. So if we are reverting this, then any new PR should likely just complete the additional work so we don't risk shipping something that isn't actually functioning the way users expect. |
I was just planning to do the following:
Do we agree that it is the best course or should we completely revert this PR, address the codegen issues and likely close #82194? I am fine with either solution. |
Yes, I think that it is the best course of action. |
|
Glad to see this finally get merged! |
Closes #27292.
Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.
Not implementing for Mono because I can't understand the code, and this one has special register constraints.