Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull #9887

eric-haibin-lin · 2018-02-26T07:20:59Z

Description

This PR adds async execution support for kv.row_sparse_pull.
The operation was blocking because it requires unique row_ids, whose shape cannot be inferred ahead of time. This PR stores the unique row_ids in a row_sparse NDArray, whose data shape can be changed at run time when executed asynchronously.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Removed use_copy param in BroadcastRowSparse - this is essentially the same as calling Broadcast
Removed CopyRetainedRowsToGPU and always use SparseRetain because CopyRetainedRowsToGPU has high invocation overhead and SparseRetain has improved performance
Revised test cases to test against shapes/dtypes commonly used by users
Tested example/sparse/linear_classification with dist_sync kvstore
Tested tests/nightly/dist_sync_kvstore.py
Tested tests/python/gpu/test_kvstore.py
Tested tests/python/unittest/test_kvstore.py

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

* draft * rm use_copy. fix dist kvstore. TODO: fix dtype * fix dtype, shape * remove reshape * cleanup

rahul003

Looks good. Just wanted to talk about if it is safe to use local variables for ndarrays. I see that at many places. The one I marked below is just one example. Is there a chance they can go out of scope, be freed, and cause segfaults?

rahul003 · 2018-02-28T17:56:56Z

src/kvstore/kvstore_dist.h

      }
+      CHECK_EQ(num_vals, 1) << "RowSparsePull with multiple values is not supported yet";
+      NDArray& indices = target_val_rowids[0].second;


Is it safe to use local variable ndarray?

It should be fine, they're passed by value when pushed to engine

eric-haibin-lin · 2018-03-01T08:06:31Z

@reminisce could you help review?

rahul003 · 2018-03-01T09:56:09Z

src/kvstore/kvstore_utils.cc

+    IType *dptr = out.data().dptr<IType>();
+    common::ParallelSort(dptr, dptr + num_elements,
+                         engine::OpenMP::Get()->GetRecommendedOMPThreadCount());
+    const size_t num_selected_out = std::unique(dptr, dptr + num_elements) - dptr;


std::unique only removes duplicates of elements present consecutively. Is that what's expected?

Yes because inputs are sorted first

reminisce

LGTM.

reminisce · 2018-03-04T06:02:41Z

tests/python/unittest/test_kvstore.py

@@ -76,7 +76,7 @@ def check_row_sparse_pull(kv, count):
        for i in range(count):
            vals.append(mx.nd.zeros(shape).tostype('row_sparse'))
            row_id = np.random.randint(num_rows, size=num_rows)
-            row_ids.append(mx.nd.array(row_id))
+            row_ids.append(mx.nd.array(row_id).reshape((2,2)))


Why reshape?
One suggestion: (2, 2) is too hard-coded. If shape is changed in the beginning, the test would fail.

You are right.. Made it less hard-coded by calculating the dimension based on total number of elements.
I'm testing 2-D rowids explicitly because that seems to be a common use case, many users forget to reshape it to 1-D before passing it to kvstore. It is now supported with this PR.

…store-pr

eric-haibin-lin · 2018-03-05T03:58:37Z

This PR also fixed a bug in the original GPU "Unique" implementation, where the same pointer is passed as both the input/output to cub::unique. This results in incorrect output when the number of rowids is large. Fixed by storing the input/output in separate buffers.

eric-haibin-lin · 2018-03-05T04:01:31Z

@ZiyueHuang please view the changes for gpu unique

ZiyueHuang · 2018-03-05T06:34:07Z

@eric-haibin-lin Thanks for the fix!

Looks good.

… by device kvstore.row_sparse_pull (apache#9887)" This reverts commit 02dd89a.

…ce kvstore.row_sparse_pull (apache#9887) * nonblocking Kvstore (apache#195) * draft * rm use_copy. fix dist kvstore. TODO: fix dtype * fix dtype, shape * remove reshape * cleanup * fix compilation * rsp draft * update param name * doc update and small refactoring * minor updates * enhance test case with 2-D rowids * update gpu tests * rewrite gpu unique kernels * update gpu tests * update reshape test/ * fix lint * update test for py3

eric-haibin-lin and others added 8 commits February 21, 2018 09:49

nonblocking Kvstore (#195)

846a804

* draft * rm use_copy. fix dist kvstore. TODO: fix dtype * fix dtype, shape * remove reshape * cleanup

fix compilation

4f772ea

rsp draft

eee5fa2

update param name

9a38690

doc update and small refactoring

89bdac1

minor updates

698e793

enhance test case with 2-D rowids

398e365

update gpu tests

2d42f76

eric-haibin-lin requested review from cjolivier01 and szha as code owners February 26, 2018 07:20

ZiyueHuang approved these changes Feb 27, 2018

View reviewed changes

rahul003 reviewed Feb 28, 2018

View reviewed changes

rahul003 reviewed Mar 1, 2018

View reviewed changes

reminisce reviewed Mar 4, 2018

View reviewed changes

ZiyueHuang and others added 6 commits March 5, 2018 02:43

rewrite gpu unique kernels

acee7a4

Merge branch 'kvstore-pr' of github.com:eric-haibin-lin/mxnet into kv…

299d85e

…store-pr

update gpu tests

af22be9

update reshape test/

489eb8b

fix lint

17c22a1

update test for py3

baea0eb

eric-haibin-lin added Bug Performance labels Mar 5, 2018

eric-haibin-lin changed the title ~~Non-blocking row_sparse_pull~~ Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull Mar 5, 2018

Merge remote-tracking branch 'upstream/master' into kvstore-pr

3ddfe47

eric-haibin-lin merged commit 02dd89a into apache:master Mar 5, 2018

CoinCheung added a commit to CoinCheung/incubator-mxnet that referenced this pull request Mar 18, 2018

Revert "Non-blocking row_sparse_pull. Fix incorrect indices generated…

6ba09ad

… by device kvstore.row_sparse_pull (apache#9887)" This reverts commit 02dd89a.

eric-haibin-lin deleted the kvstore-pr branch September 18, 2018 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull #9887

Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull #9887

eric-haibin-lin commented Feb 26, 2018 •

edited

Loading

rahul003 left a comment •

edited

Loading

rahul003 Feb 28, 2018

eric-haibin-lin Mar 1, 2018

eric-haibin-lin commented Mar 1, 2018

rahul003 Mar 1, 2018

eric-haibin-lin Mar 1, 2018

reminisce left a comment

reminisce Mar 4, 2018

eric-haibin-lin Mar 5, 2018

eric-haibin-lin commented Mar 5, 2018 •

edited

Loading

eric-haibin-lin commented Mar 5, 2018

ZiyueHuang commented Mar 5, 2018

Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull #9887

Non-blocking row_sparse_pull. Fix incorrect indices generated by device kvstore.row_sparse_pull #9887

Conversation

eric-haibin-lin commented Feb 26, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

rahul003 left a comment • edited Loading

Choose a reason for hiding this comment

rahul003 Feb 28, 2018

Choose a reason for hiding this comment

eric-haibin-lin Mar 1, 2018

Choose a reason for hiding this comment

eric-haibin-lin commented Mar 1, 2018

rahul003 Mar 1, 2018

Choose a reason for hiding this comment

eric-haibin-lin Mar 1, 2018

Choose a reason for hiding this comment

reminisce left a comment

Choose a reason for hiding this comment

reminisce Mar 4, 2018

Choose a reason for hiding this comment

eric-haibin-lin Mar 5, 2018

Choose a reason for hiding this comment

eric-haibin-lin commented Mar 5, 2018 • edited Loading

eric-haibin-lin commented Mar 5, 2018

ZiyueHuang commented Mar 5, 2018

eric-haibin-lin commented Feb 26, 2018 •

edited

Loading

rahul003 left a comment •

edited

Loading

eric-haibin-lin commented Mar 5, 2018 •

edited

Loading