Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Accelerate DGL csr neighbor sampling #13588

Merged
merged 24 commits into from
Dec 19, 2018

Conversation

BullDemonKing
Copy link
Contributor

Description

The DGL csr neighbor sampling has many hashtable lookups. Hashtable lookups turn out to be expensive operations. This PR tries to accelerate sampling by reducing hashtable lookups.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@BullDemonKing
Copy link
Contributor Author

@aksnzhy could you please show the benchmark result?

@aksnzhy
Copy link
Contributor

aksnzhy commented Dec 9, 2018

The benchmark result are as follows:


-------------------------------------------------------Uniform Benchmark---------------------------------------------------------
time: 0.10378909111 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 324203
time: 0.0584669113159 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 651667
time: 0.0640590190887 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1302931
time: 0.110497951508 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2599568
time: 0.19756603241 (s) thread: 32 seed size: 16000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5204015
time: 0.202709913254 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5319358
time: 0.373208999634 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10514977
time: 0.776125907898 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20600452
time: 1.31858420372 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39975706
time: 1.91078400612 (s) thread: 32 seed size: 16000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 61360893
-----------------------------------------------------Non-Uniform Benchmark-------------------------------------------------------
time: 0.0534448623657 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 322616
time: 0.0302369594574 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 646296
time: 0.0457458496094 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1301642
time: 0.107753992081 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2607545
time: 0.265137910843 (s) thread: 32 seed size: 16000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5199199
time: 0.300200939178 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5334001
time: 0.521604061127 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10505232
time: 0.759475946426 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20589664
time: 1.37040305138 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39789737
time: 2.38633799553 (s) thread: 32 seed size: 16000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 68132821

@BullDemonKing
Copy link
Contributor Author

BullDemonKing commented Dec 9, 2018

The improvement in this PR has about 2-3 times speedup over the implementation in the master branch.
The speed in the master branch:

-------------------------------------------------------Uniform Bechmark---------------------------------------------------------
time: 0.0651424407959 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 325274
time: 0.07034907341 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 647972
time: 0.130498409271 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1300821
time: 0.251270961761 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2598921
time: 0.453542613983 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5276791
time: 0.940950870514 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10515177
time: 1.6341050148 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20668389
time: 2.81761116982 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39889110
-----------------------------------------------------Non-Uniform Bechmark-------------------------------------------------------
time: 0.0410866260529 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 323537
time: 0.0525182723999 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 649116
time: 0.108578252792 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1301567
time: 0.210351467133 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2599414
time: 0.483125591278 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5334587
time: 0.998479986191 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10499253
time: 1.81847443581 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20587578
time: 3.07297439575 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39759114

The speed in this PR:

-------------------------------------------------------Uniform Bechmark---------------------------------------------------------
time: 0.0255964756012 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 323513
time: 0.0243901729584 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 652299
time: 0.0369716644287 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1299192
time: 0.0984917640686 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2597010
time: 0.172056150436 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5318903
time: 0.357382392883 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10521593
time: 0.554566431046 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20684741
time: 1.08307261467 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39896653
-----------------------------------------------------Non-Uniform Bechmark-------------------------------------------------------
time: 0.0144423961639 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 325661
time: 0.0177308082581 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 651627
time: 0.0373013973236 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1299527
time: 0.107610464096 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2601329
time: 0.231380796432 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5303039
time: 0.430892515182 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10528228
time: 0.779708433151 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20552607
time: 1.36538419724 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39731365

@roywei
Copy link
Member

roywei commented Dec 12, 2018

@BullDemonKing Thanks for the contribution! could you take a look at failed tests?
@zheng-da @eric-haibin-lin could you take a look?

@roywei
Copy link
Member

roywei commented Dec 12, 2018

@mxnet-label-bot add[Operator, pr-awaiting-review]

@marcoabreu marcoabreu added Operator pr-awaiting-review PR is waiting for code review labels Dec 12, 2018
std::queue<ver_node> node_queue;
std::unordered_set<dgl_id_t> sub_ver_mp;
std::vector<std::pair<dgl_id_t, dgl_id_t> > sub_vers;
sub_vers.reserve(num_seeds * 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 10 a general good constant for sampling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is used for reserving memory, so it doesn't have to be very accurate. The goal here is to balance memory reallocation and memory consumption in std::vector. If num_hop is 1, 10 might be good enough; if num_hop is 2, 10 will be too small. But I don't feel that we need to make it overcomplex. After all, we only want to reduce the overhead of memory reallocation in std::vector.

@aksnzhy
Copy link
Contributor

aksnzhy commented Dec 16, 2018

It looks really good for this pr.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments about documentation, which I think is ok to improve in later PR.

}
for (dgl_id_t i = num_vertices+1; i <= max_num_vertices; ++i) {
for (size_t i = num_vertices+1; i <= max_num_vertices; ++i) {
indptr_out[i] = indptr_out[i-1];
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind also enhancing the documentation for the operator? For example, the number of outputs and what each output is for. For example, -1 will be filled for vertex id not sampled.
Also, printing the output of a.asnumpy() will be helpful

[[ 0  1  2  3  4]
 [ 5  0  6  7  8]
 [ 9 10  0 11 12]
 [13 14 15  0 16]
 [17 18 19 20  0]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the example usage:

out = mx.nd.contrib.dgl_csr_neighbor_uniform_sample(a, seed, num_args=2, num_hops=1, num_neighbor=2, max_num_vertices=5)

I don't think num_args has to be set by the user. It should be automatically set /~https://github.com/apache/incubator-mxnet/blob/779bdc5e7ee3abd6a2d23e8bd97d47fc08ae6bc5/src/imperative/imperative_utils.h#L311-L313

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the current example seed doesn't result in -1 in the output. Maybe we want to pick some input which is more representative

// Let's check if there is a vertex that we haven't sampled its neighbors.
for (; idx < sub_vers.size(); idx++) {
if (sub_vers[idx].second < num_hops) {
LOG(WARNING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LOG(WARNING) working? I was not aware that this is functional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works. i can see warning messages after the number of sampled vertices exceeds the maximal number.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement

@@ -718,20 +691,37 @@ static void SampleSubgraph(const NDArray &csr,
dgl_id_t* indptr_out = sub_csr.aux_data(0).dptr<dgl_id_t>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the problem of the indent here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. Incorrect rendering by github. looks good

@eric-haibin-lin eric-haibin-lin merged commit defa614 into apache:master Dec 19, 2018
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Speedup and fix bug in dgl_csr_sampling op

* Update dgl_graph.cc

* simplify functions.

* avoid adding nodes in the last level in the queue.

* remove a hashtable lookup in neigh_pos.

* reduce a hashtable lookup in sub_ver_mp.

* merge copying vids and layers.

* reduce hashtable lookup when writing to output csr.

* fix a bug.

* limit the number of sampled vertices.

* fix lint.

* fix a compile error.

* fix compile error.

* fix compile.

* remove one hashtable lookup per vertex and hashtable iteration.

* remove queue.

* use vector for neigh_pos.

* fix lint

* avoid init output arrays.

* fix tests.

* fix tests.

* update docs.

* retrigger

* retrigger
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants