-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA 11.3+ Register Usage #560
Comments
CUDA 11.4 also uses 170 reg/thread. |
The experimental shared memory curve implementation (
11.3 and 11.2 both use 65 reg / thread. |
CUDA 11.5, seatbelts=OFF is 168 reg/thread for SM 70. Seatbelts=ON, SM70 is 218 reg/thread. Shared mem curve will still be required to improve perf. |
CUDA 11.7, SM_86, Seatbelts=OFF is 145 reg/thread. 160 for SM_70, so still poor. |
Register usage for CUDA 11.3 appears to be significantly higher than previous cuda versions, especially for the iteration kernel in boids bruteforce.
This is probably worth promoting to an nvbug report.
Steps to reproduce
Or to generate a profile:
The above results are built for SM 70, as of 70c2e17, although the results should be the same as when using c3524e6 which just adds verbose ptxas so profiling is not required.
Enabling LTO brings it down a little, but not significanlty (~156 ish).
When built for SM 61 instead, 162 registers are used.
CUDA 11.3 introduces a way to dump the device callgraph at link time (the following cmake). This doesn't provide any useful information, just showing that the kernel is using 170 reg/thread (its 2 sub-calls both use < 30 reg, so its not a sub call issue.
By commenting out sections of the intputdata method in examples/boids_bruteforce/src/main.cu some more insight can be gained to why this kernel has higher register use, and where the main source of difference comes from.
An experimental build of the
bods_spatial3D
model in therdc_off
branch which builds without relocatable device code uses157
registers / thread rather than170
, so it has an impact but the register use regression is not RDC specific.This uses the following to enable compiler output of register use
The text was updated successfully, but these errors were encountered: