Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput #15124

Merged
merged 1 commit into from
Feb 2, 2020

Conversation

anandj91
Copy link
Contributor

@anandj91 anandj91 commented Jun 1, 2019

Description

https://issues.apache.org/jira/browse/MXNET-1294

During data parallel training , the parameters of each layer is synchronized through parameter server after slicing them into smaller slices and synchronizing them independent to each other based on their priority. Currently priority of a layer is set as its negative index.

For more details, refer https://anandj.in/wp-content/uploads/sysml.pdf.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@piyushghai
Copy link
Contributor

Thanks for your contribution @anandj91. Can you look into the CI failures ?
@mxnet-label-bot Add [pr-awaiting-review, Backend].

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review labels Jun 4, 2019
@anandj91 anandj91 force-pushed the prio branch 5 times, most recently from 07f9818 to d85967f Compare June 6, 2019 03:21
@anandj91
Copy link
Contributor Author

anandj91 commented Jun 6, 2019

Is there a way to rerun the test cases without doing a git push?

@piyushghai
Copy link
Contributor

piyushghai commented Jun 6, 2019

Is there a way to rerun the test cases without doing a git push?

You could push an empty commit.
git commit -m "Empty commit" --allow-empty

That should trigger the CI on your PR again.

src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
@vandanavk
Copy link
Contributor

@pinaraws @anirudhacharya for review

@anandj91 anandj91 force-pushed the prio branch 2 times, most recently from 7ef001a to c64be49 Compare June 19, 2019 23:12
@anandj91
Copy link
Contributor Author

Modified the code to address the review comments. Sorry for the delay.
I have added a new flag in the config.mk to enable priority based update (USE_PRIORITY_UPDATE). the flag is disabled by default. @eric-haibin-lin Can you please tell me on how to add a new test case to build the code with this flag turned on and run the dist_kvstore test cases?

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandj91 thanks for the update and sorry for the late reply.

Thanks!

@roywei
Copy link
Member

roywei commented Jul 8, 2019

@anandj91 gentle ping, any update on this PR?

@anandj91
Copy link
Contributor Author

anandj91 commented Jul 8, 2019

@roywei The current implementation uses multiple ThreadVar to specify dependency between push and pull between slices. After some benchmarking on large models like VGG-19, I found that this causes large overhead and the training performance reduce to 50%.

Instead I'm planning to introduce a new API for pushpull which combines push and pull of one slice. I had an offline discussion with @eric-haibin-lin and he is fine with this approach.

@anandj91 anandj91 mentioned this pull request Jul 16, 2019
6 tasks
@anandj91
Copy link
Contributor Author

This PR is waiting on #15559 for the PushPull API in KVStore.

@anandj91
Copy link
Contributor Author

I'm facing some design level challenges to properly implement Priority based update (P3) on top of PushPull API. MXNet does a simple load balancing before pushing or pulling key-values by splitting NDArrays equally to the parameter servers. P3 requires a round-robin style parameter distribution which means slicing a large NDArray into thousands of smaller ones. Much more granular than current default distribution strategy and each PS would get more than one slice.

With the way mxnet and ps-lite designed right now, ps-lite assumes a single ZPush/ZPull/ZPushPull belongs to a single layer/NDArray. It also assumes that one slice only belong to one PS. These assumption need to be broken for implementing P3. What I have done right now is to add round-robin (RR) distribution strategy along with the default one and use a boolean flag to switch between these two. When user chooses to use RR, KVStore consider each slice as separate key-value pair. Otherwise fallback to the default mode.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on readability.

What would be the user experience for an end-user if he is interested in P3?

.gitmodules Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
docs/static_site/src/pages/api/faq/env_var.md Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
docs/static_site/src/pages/api/faq/env_var.md Outdated Show resolved Hide resolved
python/mxnet/kvstore/base.py Show resolved Hide resolved
python/mxnet/kvstore/kvstore.py Outdated Show resolved Hide resolved
src/kvstore/p3store.h Outdated Show resolved Hide resolved
src/kvstore/p3store.h Outdated Show resolved Hide resolved
src/kvstore/p3store.h Outdated Show resolved Hide resolved
include/mxnet/c_api.h Show resolved Hide resolved
python/mxnet/kvstore/base.py Outdated Show resolved Hide resolved
python/mxnet/kvstore/kvstore.py Show resolved Hide resolved
python/mxnet/kvstore/kvstore.py Show resolved Hide resolved
@@ -67,6 +68,7 @@ def __init__(self, handle):
self._updater = None
self._updater_func = None
self._str_updater_func = None
self._is_p3 = (os.getenv('DMLC_PS_VAN_TYPE', '') == 'p3')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will u also change launch.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I just wanted to confirm if we are planning to only add --p3 to launch.py now and not --horovod or --byteps because of the reasons I mentioned in the previous comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a bit more on this? According to this RFC (#16795), p3 will be treated as another kvstore just like byteps/horovod. Why would it be different?

Copy link
Member

@eric-haibin-lin eric-haibin-lin Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandj91 and I had a recent discussion. Having separate python kvstore class for p3 leads to lots of duplicated code because it also supports both device and local mode. Its source code lives inside mxnet, and sharing the same python KVStore class is reasonable in terms of complexity and maintenance efforts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeforest We do have a separate class for P3 in the backend though.

python/mxnet/model.py Show resolved Hide resolved
src/kvstore/kvstore_dist.h Outdated Show resolved Hide resolved
@@ -349,6 +363,7 @@ def pushpull(self, key, value, out=None, priority=0):

out: NDArray or list of NDArray or list of list of NDArray
Values corresponding to the keys.
`out` should have length as `value`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there such a restriction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is grouping of keys and corresponding ndarrays happening at the backend based on their positions in the list before performing kv operations. Since we are using same key list for both value and out, their layouts should be identical. length check is just for a sanity.

this is the reason why I had both vkeys and okeys separate in the previous implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. However, if there's one key, having value = a list of 4 arrays, and out = a list of one array is also valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case we need to bring back separate key list for value and out.

python/mxnet/model.py Outdated Show resolved Hide resolved
@anandj91 anandj91 force-pushed the prio branch 3 times, most recently from 0358357 to 876744b Compare January 31, 2020 17:37
P3StoreDist extends KVStoreDist class to perform
Push, Pull and PushPull based on the priority of the packets

* Use --p3 flag in launch.py script to use P3
  * DMLC_PS_VAN_TYPE - to turn on P3 with distributed kvstore
  * DMLC_PS_WATER_MARK - to set the maximum in-flight packets
    in the network queue

Note: Currently P3 does not support:
1. operations on sparse ndarrays
2. gradient compression
3. update on kvstore server
@eric-haibin-lin
Copy link
Member

@anandj91 great work!

zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
P3StoreDist extends KVStoreDist class to perform
Push, Pull and PushPull based on the priority of the packets

* Use --p3 flag in launch.py script to use P3
  * DMLC_PS_VAN_TYPE - to turn on P3 with distributed kvstore
  * DMLC_PS_WATER_MARK - to set the maximum in-flight packets
    in the network queue

Note: Currently P3 does not support:
1. operations on sparse ndarrays
2. gradient compression
3. update on kvstore server
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants