Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[BERT] Multi-GPU Training with Tree reduce #520

Open
eric-haibin-lin opened this issue Jan 4, 2019 · 4 comments
Open

[BERT] Multi-GPU Training with Tree reduce #520

eric-haibin-lin opened this issue Jan 4, 2019 · 4 comments

Comments

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Jan 4, 2019

BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.

The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.

However, the following error occurs for MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1

[07:02:29] src/kvstore/./././gpu_topology.h:60: Weight:
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 2 0 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 2 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 0 0 0 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 0 0 0 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 0 2 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 2 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 2 0 2 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 2 0 2 0 0

[06:50:46] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[06:50:46] src/kvstore/./././gpu_topology.h:1030: No valid binary tree found from root 0, try backtracking
Traceback (most recent call last):
  File "run_pretraining.py", line 257, in <module>
    train()
  File "run_pretraining.py", line 228, in train
    trainer.step(1)
  File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 290, in step
    self._allreduce_grads()
  File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 320, in _allreduce_grads
    self._kvstore.push(i, param.list_grad(), priority=-i)
  File "/home/ubuntu/mxnet/python/mxnet/kvstore.py", line 237, in push
    self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
  File "/home/ubuntu/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:50:46] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 0 using backtracking
@szha
Copy link
Member

szha commented Feb 18, 2019

@eric-haibin-lin looks like this was resolved?

@eric-haibin-lin
Copy link
Member Author

We have a fallback patch instead of a complete fix..

@szha
Copy link
Member

szha commented Feb 18, 2019

This seems to be better tracked in MXNet.

@kaonashi-tyc
Copy link

Any update on the status?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants