Accelerate DGL csr neighbor sampling #13588

BullDemonKing · 2018-12-08T14:29:55Z

Description

The DGL csr neighbor sampling has many hashtable lookups. Hashtable lookups turn out to be expensive operations. This PR tries to accelerate sampling by reducing hashtable lookups.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

BullDemonKing · 2018-12-08T14:31:43Z

@aksnzhy could you please show the benchmark result?

aksnzhy · 2018-12-09T03:30:06Z

The benchmark result are as follows:


-------------------------------------------------------Uniform Benchmark---------------------------------------------------------
time: 0.10378909111 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 324203
time: 0.0584669113159 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 651667
time: 0.0640590190887 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1302931
time: 0.110497951508 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2599568
time: 0.19756603241 (s) thread: 32 seed size: 16000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5204015
time: 0.202709913254 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5319358
time: 0.373208999634 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10514977
time: 0.776125907898 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20600452
time: 1.31858420372 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39975706
time: 1.91078400612 (s) thread: 32 seed size: 16000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 61360893
-----------------------------------------------------Non-Uniform Benchmark-------------------------------------------------------
time: 0.0534448623657 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 322616
time: 0.0302369594574 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 646296
time: 0.0457458496094 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1301642
time: 0.107753992081 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2607545
time: 0.265137910843 (s) thread: 32 seed size: 16000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5199199
time: 0.300200939178 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5334001
time: 0.521604061127 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10505232
time: 0.759475946426 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20589664
time: 1.37040305138 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39789737
time: 2.38633799553 (s) thread: 32 seed size: 16000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 68132821

BullDemonKing · 2018-12-09T09:09:49Z

The improvement in this PR has about 2-3 times speedup over the implementation in the master branch.
The speed in the master branch:

-------------------------------------------------------Uniform Bechmark---------------------------------------------------------
time: 0.0651424407959 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 325274
time: 0.07034907341 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 647972
time: 0.130498409271 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1300821
time: 0.251270961761 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2598921
time: 0.453542613983 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5276791
time: 0.940950870514 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10515177
time: 1.6341050148 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20668389
time: 2.81761116982 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39889110
-----------------------------------------------------Non-Uniform Bechmark-------------------------------------------------------
time: 0.0410866260529 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 323537
time: 0.0525182723999 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 649116
time: 0.108578252792 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1301567
time: 0.210351467133 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2599414
time: 0.483125591278 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5334587
time: 0.998479986191 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10499253
time: 1.81847443581 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20587578
time: 3.07297439575 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39759114

The speed in this PR:

-------------------------------------------------------Uniform Bechmark---------------------------------------------------------
time: 0.0255964756012 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 323513
time: 0.0243901729584 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 652299
time: 0.0369716644287 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1299192
time: 0.0984917640686 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2597010
time: 0.172056150436 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5318903
time: 0.357382392883 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10521593
time: 0.554566431046 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20684741
time: 1.08307261467 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39896653
-----------------------------------------------------Non-Uniform Bechmark-------------------------------------------------------
time: 0.0144423961639 (s) thread: 32 seed size: 1000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 325661
time: 0.0177308082581 (s) thread: 32 seed size: 2000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 651627
time: 0.0373013973236 (s) thread: 32 seed size: 4000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 1299527
time: 0.107610464096 (s) thread: 32 seed size: 8000 num_hops: 1 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 2601329
time: 0.231380796432 (s) thread: 32 seed size: 1000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 5303039
time: 0.430892515182 (s) thread: 32 seed size: 2000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 10528228
time: 0.779708433151 (s) thread: 32 seed size: 4000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 20552607
time: 1.36538419724 (s) thread: 32 seed size: 8000 num_hops: 2 num neighbors: 16 max vertices: 1000000 sample_num_ver: 32000032 sample_num_edge: 39731365

roywei · 2018-12-12T00:43:38Z

@BullDemonKing Thanks for the contribution! could you take a look at failed tests?
@zheng-da @eric-haibin-lin could you take a look?

roywei · 2018-12-12T00:44:24Z

@mxnet-label-bot add[Operator, pr-awaiting-review]

aksnzhy · 2018-12-16T12:50:40Z

src/operator/contrib/dgl_graph.cc

-  std::queue<ver_node> node_queue;
+  std::unordered_set<dgl_id_t> sub_ver_mp;
+  std::vector<std::pair<dgl_id_t, dgl_id_t> > sub_vers;
+  sub_vers.reserve(num_seeds * 10);


Is 10 a general good constant for sampling?

This number is used for reserving memory, so it doesn't have to be very accurate. The goal here is to balance memory reallocation and memory consumption in std::vector. If num_hop is 1, 10 might be good enough; if num_hop is 2, 10 will be too small. But I don't feel that we need to make it overcomplex. After all, we only want to reduce the overhead of memory reallocation in std::vector.

aksnzhy · 2018-12-16T12:56:02Z

It looks really good for this pr.

eric-haibin-lin

Just some minor comments about documentation, which I think is ok to improve in later PR.

eric-haibin-lin · 2018-12-17T19:19:14Z

src/operator/contrib/dgl_graph.cc

  }
-  for (dgl_id_t i = num_vertices+1; i <= max_num_vertices; ++i) {
+  for (size_t i = num_vertices+1; i <= max_num_vertices; ++i) {
    indptr_out[i] = indptr_out[i-1];
  }
 }


Do you mind also enhancing the documentation for the operator? For example, the number of outputs and what each output is for. For example, -1 will be filled for vertex id not sampled.
Also, printing the output of a.asnumpy() will be helpful

[[ 0 1 2 3 4] [ 5 0 6 7 8] [ 9 10 0 11 12] [13 14 15 0 16] [17 18 19 20 0]]

Regarding the example usage:

out = mx.nd.contrib.dgl_csr_neighbor_uniform_sample(a, seed, num_args=2, num_hops=1, num_neighbor=2, max_num_vertices=5)

I don't think num_args has to be set by the user. It should be automatically set /~https://github.com/apache/incubator-mxnet/blob/779bdc5e7ee3abd6a2d23e8bd97d47fc08ae6bc5/src/imperative/imperative_utils.h#L311-L313

Also, the current example seed doesn't result in -1 in the output. Maybe we want to pick some input which is more representative

eric-haibin-lin · 2018-12-18T00:10:30Z

src/operator/contrib/dgl_graph.cc

+  // Let's check if there is a vertex that we haven't sampled its neighbors.
+  for (; idx < sub_vers.size(); idx++) {
+    if (sub_vers[idx].second < num_hops) {
+      LOG(WARNING)


Is LOG(WARNING) working? I was not aware that this is functional.

it works. i can see warning messages after the number of sampled vertices exceeds the maximal number.

eric-haibin-lin

Great improvement

eric-haibin-lin · 2018-12-18T05:03:34Z

src/operator/contrib/dgl_graph.cc

@@ -718,20 +691,37 @@ static void SampleSubgraph(const NDArray &csr,
  dgl_id_t* indptr_out = sub_csr.aux_data(0).dptr<dgl_id_t>();


nit: fix indentation

what is the problem of the indent here?

nvm. Incorrect rendering by github. looks good

* Speedup and fix bug in dgl_csr_sampling op * Update dgl_graph.cc * simplify functions. * avoid adding nodes in the last level in the queue. * remove a hashtable lookup in neigh_pos. * reduce a hashtable lookup in sub_ver_mp. * merge copying vids and layers. * reduce hashtable lookup when writing to output csr. * fix a bug. * limit the number of sampled vertices. * fix lint. * fix a compile error. * fix compile error. * fix compile. * remove one hashtable lookup per vertex and hashtable iteration. * remove queue. * use vector for neigh_pos. * fix lint * avoid init output arrays. * fix tests. * fix tests. * update docs. * retrigger * retrigger

aksnzhy and others added 10 commits December 8, 2018 13:18

Speedup and fix bug in dgl_csr_sampling op

a6cfce2

Update dgl_graph.cc

a33b6e0

simplify functions.

30415e9

avoid adding nodes in the last level in the queue.

3999533

remove a hashtable lookup in neigh_pos.

742a8b4

reduce a hashtable lookup in sub_ver_mp.

ffa4097

merge copying vids and layers.

e013536

fix a bug.

9cfb11c

reduce hashtable lookup when writing to output csr.

c2f99c5

limit the number of sampled vertices.

05257e6

BullDemonKing requested a review from anirudh2290 as a code owner December 8, 2018 14:29

BullDemonKing added 4 commits December 8, 2018 22:39

fix lint.

f8e4bbb

fix a compile error.

87b0039

fix compile error.

7304217

fix compile.

3c9110a

BullDemonKing added 3 commits December 9, 2018 15:47

remove one hashtable lookup per vertex and hashtable iteration.

6b4fa25

remove queue.

b85a015

use vector for neigh_pos.

9f3adf4

BullDemonKing added 2 commits December 9, 2018 17:11

fix lint

fc5f3e2

avoid init output arrays.

043c1b4

zheng-da force-pushed the accel_sampling branch from 3ee7bec to 043c1b4 Compare December 9, 2018 10:04

marcoabreu added Operator pr-awaiting-review PR is waiting for code review labels Dec 12, 2018

BullDemonKing added 2 commits December 12, 2018 17:24

fix tests.

834f56e

fix tests.

be88caa

aksnzhy reviewed Dec 16, 2018

View reviewed changes

eric-haibin-lin reviewed Dec 18, 2018

View reviewed changes

eric-haibin-lin approved these changes Dec 18, 2018

View reviewed changes

BullDemonKing added 3 commits December 18, 2018 16:34

update docs.

fa910ef

retrigger

3bed4af

retrigger

259c0f7

eric-haibin-lin merged commit defa614 into apache:master Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate DGL csr neighbor sampling #13588

Accelerate DGL csr neighbor sampling #13588

BullDemonKing commented Dec 8, 2018

BullDemonKing commented Dec 8, 2018

aksnzhy commented Dec 9, 2018 •

edited

Loading

BullDemonKing commented Dec 9, 2018 •

edited

Loading

roywei commented Dec 12, 2018

roywei commented Dec 12, 2018

aksnzhy Dec 16, 2018

BullDemonKing Dec 17, 2018

aksnzhy commented Dec 16, 2018

eric-haibin-lin left a comment

eric-haibin-lin Dec 17, 2018

eric-haibin-lin Dec 17, 2018

eric-haibin-lin Dec 17, 2018

eric-haibin-lin Dec 18, 2018

BullDemonKing Dec 18, 2018

eric-haibin-lin left a comment

eric-haibin-lin Dec 18, 2018

BullDemonKing Dec 18, 2018

eric-haibin-lin Dec 18, 2018

		@@ -718,20 +691,37 @@ static void SampleSubgraph(const NDArray &csr,
		dgl_id_t* indptr_out = sub_csr.aux_data(0).dptr<dgl_id_t>();

Accelerate DGL csr neighbor sampling #13588

Accelerate DGL csr neighbor sampling #13588

Conversation

BullDemonKing commented Dec 8, 2018

Description

Checklist

Essentials

Changes

Comments

BullDemonKing commented Dec 8, 2018

aksnzhy commented Dec 9, 2018 • edited Loading

BullDemonKing commented Dec 9, 2018 • edited Loading

roywei commented Dec 12, 2018

roywei commented Dec 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aksnzhy commented Dec 16, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aksnzhy commented Dec 9, 2018 •

edited

Loading

BullDemonKing commented Dec 9, 2018 •

edited

Loading