Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 11.3+ Register Usage #560

Closed
ptheywood opened this issue Jun 17, 2021 · 4 comments · Fixed by #846
Closed

CUDA 11.3+ Register Usage #560

ptheywood opened this issue Jun 17, 2021 · 4 comments · Fixed by #846

Comments

@ptheywood
Copy link
Member

ptheywood commented Jun 17, 2021

Register usage for CUDA 11.3 appears to be significantly higher than previous cuda versions, especially for the iteration kernel in boids bruteforce.

This is probably worth promoting to an nvbug report.

Steps to reproduce

git clone git@github.com:FLAMEGPU/FLAMEGPU2.git
cd FLAMEGPU2
git checkout c3524e6
# Ensure the correct CUDA version to check is on the path / use module load
cmake .. -DCUDA_ARCH=70 -DSEATBELTS=OFF -DUSE_NVTX=ON
make -j 8 flamegpu2 boids_bruteforce
# Inspect the values generated for `Z22agent_function_wrapperI14inputdata_impl...` 
# I.e. ptxas info    : Used 170 registers, 408 bytes cmem[0], 4 bytes cmem[2]

Or to generate a profile:

ncu --set=full -f -o 11-x ./bin/linux-x64/Release/boids_bruteforce -t -s 1
CUDA Version Reg/thread
11.0 60
11.1 60
11.2 70
11.3 170

The above results are built for SM 70, as of 70c2e17, although the results should be the same as when using c3524e6 which just adds verbose ptxas so profiling is not required.

Enabling LTO brings it down a little, but not significanlty (~156 ish).

When built for SM 61 instead, 162 registers are used.

CUDA 11.3 introduces a way to dump the device callgraph at link time (the following cmake). This doesn't provide any useful information, just showing that the kernel is using 170 reg/thread (its 2 sub-calls both use < 30 reg, so its not a sub call issue.

    add_link_options("$<DEVICE_LINK:SHELL:-Xnvlink -dump-callgraph>")

By commenting out sections of the intputdata method in examples/boids_bruteforce/src/main.cu some more insight can be gained to why this kernel has higher register use, and where the main source of difference comes from.

Commenting 11.0 11.3
Just a return statement 8 8
Message loop commented out 46 46
Message loop with no body 56 112
Message loop with just getVariable 40 142
Perceived Count being updated 50 160
Global velocity being udpated in loop 50 160
Fully enabled (collision update/check) 60 170

An experimental build of the bods_spatial3D model in the rdc_off branch which builds without relocatable device code uses 157 registers / thread rather than 170, so it has an impact but the register use regression is not RDC specific.

This uses the following to enable compiler output of register use

add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:SHELL:-Xptxas -v>")
@ptheywood
Copy link
Member Author

CUDA 11.4 also uses 170 reg/thread.

@ptheywood
Copy link
Member Author

ptheywood commented Jun 30, 2021

The experimental shared memory curve implementation (cineca-experimental-smcurve) appears to reduce the register usage back to much more sane levels.
boids_bruteforce uses only 64 reg / thread with CUDA 11.4 when in that branch (unsure why cmem is still being reported)

ptxas info    : Used 64 registers, 440 bytes cmem[0]

11.3 and 11.2 both use 65 reg / thread.

@ptheywood ptheywood changed the title CUDA 11.3 Register Usage CUDA 11.3+ Register Usage Jul 28, 2021
@ptheywood ptheywood added this to the v2.0.0-alpha.N milestone Aug 11, 2021
@ptheywood
Copy link
Member Author

ptheywood commented Oct 21, 2021

CUDA 11.5, seatbelts=OFF is 168 reg/thread for SM 70.

Seatbelts=ON, SM70 is 218 reg/thread.
Seatbelts=ON SM61 is 175 reg/thread.

Shared mem curve will still be required to improve perf.

@ptheywood
Copy link
Member Author

CUDA 11.7, SM_86, Seatbelts=OFF is 145 reg/thread. 160 for SM_70, so still poor.

@Robadob Robadob mentioned this issue May 18, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant