-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized conversions between Half
and Single
.
#81632
Optimized conversions between Half
and Single
.
#81632
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsReplaced both
|
Thanks for the contribution! This really needs perf numbers to be provided as well as comments elaborating what the code is doing. SIMD code is not necessarily self explanatory and any future readers of the code will not want to spend 20-30 minutes deciphering each step ;) I also expect this can be simplified down quite a bit. The vast majority of what's being done is basic arithmetic and potentially something we can recognize some patterns around and simply improve the JIT so that the general case is faster and we don't need specialized intrinsic paths. Providing a disassembly diff between the SIMD vs fallback path would help in seeing where the gaps exist |
This comment was marked as outdated.
This comment was marked as outdated.
Added some comments describing algorithm.
…plicit operator Half(float value)`
Can you also post current/updated numbers to this PR. It is important we have up to date information with regards to the benefit vs complexity. Ideally this would be provided in identically laid out There is quite a bit of extra complexity here, with the general code being much harder to read/reason about and some of the up front cost could be removed in other ways (e.g. using If there are some simple bottlenecks in the existing code, it would be preferred to try to optimize those in the JIT as well and with the new |
Can you provide more information as to the failures you're seeing and the process you're following? The general workflow docs for building and testing are here: /~https://github.com/dotnet/runtime/tree/main/docs/workflow |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
@MineCake147E Do docs in the dotnet/performance repo help? |
@danmoseley Thank you for your information! |
Lack of fast (ideally hardware accelerated) FP16 (Half) conversions have been one of the largest limitations in .NET in recent years for my machine learning work. Neural net weights files often consist of hundreds of millions or billions of parameters in fp32 that need to be converted to execute in FP16 on GPU. It's slow (bandwidth limited) to upload in fp32and then convert on GPU. Hence the need to do this quickly in .NET. As I understand from Tanner there is no hardware support planned for the near future.. Two reasons to consider accepting this PR:
|
This comment was marked as resolved.
This comment was marked as resolved.
Yes, because you need to compare "before" vs "after" and the "after" only exists in a local build of the dotnet/runtime (e.g. includes this PR).
Yes. We generally want most perf driven changes to have a corresponding benchmark so performance can be tracked over time.
Yes, the speedup is large but it comes at the cost of a lot of complexity and if we actually care about performance then we'd likely want to implement this using This software path would then remain for hardware without My biggest concern is then the complexity around it and I'm very much interested in whether the same "tricks" can be handled another way. The current perf difference is largely due to the branching the current implementation has and, particularly with the |
* Turned some magic numbers into constants
Half
and Single
.Half
and Single
.
…-conversion-half-float
If any needs for exhaustive tests exist, I'll do it. |
We do not need "exhaustive" tests covering every possible conversion. We only need to test a few common cases and the known interesting scenarios. We should have tests covering most of those already. |
Here's the latest number measured in benchmarks proposed in dotnet/performance#2950 . Comparisonsummary: No Slower results for the provided threshold = 1% and noise filter = 0.3 ns.
BeforeBenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-IQKHZZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
AfterBenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19045.2965)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-UTBGNX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
WarmupCount=1
|
…-conversion-half-float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not going to lie, I am not an expert in numerics. But since both Half and Single are relatively small, I've written a small app that emits all possible inputs for casting and compares the old vs new result. It's available here. Note: I know it could be optimized, I just did not want to introduce bug.
The tests have passed, the new code produces exactly the same output for the same input.
Moreover, I've locally synced your branch and removed the aggressive inlining attribute (please see my comment) and benchmarked it by using the benchmarks from dotnet/performance#2950
For all cases except of NaN
, the performance has improved. For NaNs the regression is acceptable (it's very small and I would not expect NaN to be a common input)
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
LaunchCount=3
Method | Job | value | Mean | Ratio |
---|---|---|---|---|
HalfToSingle | PR | 12344 | 1.8440 ns | 1.01 |
HalfToSingle | main | 12344 | 1.8201 ns | 1.00 |
SingleToHalf | PR | NaN | 1.3870 ns | 1.51 |
SingleToHalf | main | NaN | 0.9199 ns | 1.00 |
SingleToHalf | PR | 6.097555E-05 | 1.3737 ns | 0.37 |
SingleToHalf | main | 6.097555E-05 | 3.7275 ns | 1.00 |
SingleToHalf | PR | 12345 | 1.3651 ns | 0.53 |
SingleToHalf | main | 12345 | 2.6000 ns | 1.00 |
HalfToSingle | PR | 6.1E-05 | 1.7831 ns | 0.79 |
HalfToSingle | main | 6.1E-05 | 2.2691 ns | 1.00 |
SingleToHalf | PR | 65520 | 1.3917 ns | 0.67 |
SingleToHalf | main | 65520 | 2.0886 ns | 1.00 |
HalfToSingle | PR | NaN | 1.8133 ns | 1.14 |
HalfToSingle | main | NaN | 1.5873 ns | 1.00 |
@MineCake147E thank you for your contribution! As soon as you apply my suggestion I am going to approve and merge the PR.
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, again thank you for your contribution @MineCake147E !
I am wondering why the |
I don't know and I am sorry because I currently don't have the free cycles to run a dedicated investigation for that. |
Replaced both
public static explicit operator Half(float value)
andpublic static explicit operator float(Half value)
with new algorithm.I have no idea how to properly test these codes on my PC. Build always fails, saying that multiple files are missing.I was wrong. CMake hasn't been PATHed properly, and it was guided to a wrong path of
cl.exe
.It passed the test
System.Tests.HalfTests
.If merged, closes #69667.