-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt] Add optimized unique_and_compact_batched
.
#7239
Conversation
To trigger regression tests:
|
@TristonC is my dispatch mechanism correct so that I can run one code for Compute Capability >= 70 and another otherwise. |
unique_and_compact_batched
.unique_and_compact_batched
.
6fea3d2
to
3fd7efd
Compare
@Rhett-Ying CI failure:
|
If we need to dispatch another kernel w.r.t. the GPU's compute capability in the future similar to how we did here, we can refactor the logic in the code. Let's keep it in mind. |
Merging this PR. Will monitor the regression tests to see how much it helped. Feel free to comment, suggest improvements here. I will make another PR to address them. |
Description
Unique and compact has GPU synchronizations. When we call it separately for each etype, it slows down a lot. I make it batched so that the synchronizations are shared across etypes. I added an actual batched algorithm based on hash tables (Should have exact same output as CPU version). However, the map based code can only run on CUDA compute capability >= 70. That is why, we keep the batched algorithm based on sorting and enable the map based code only for newer GPUs. To compile the newly added map code only for new GPU architectures, we create a CUDA extension library that we link to GraphBolt.
With #7264 and this PR, we should be officially faster than DGL for every use case whether it is puregpu or UVA etc.
Checklist
Please feel free to remove inapplicable items for your PR.
Changes