You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.
The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.
However, the following error occurs for MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1
[07:02:29] src/kvstore/./././gpu_topology.h:60: Weight:
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 2 0 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 2 0 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 2 0 0 0 0 0 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 2 0 0 0 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 0 2 2 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 0 2 0 0 2
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 2 0 2 0 0 0
[07:02:29] src/kvstore/./././gpu_topology.h:67: 0 0 0 2 0 2 0 0
[06:50:46] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[06:50:46] src/kvstore/./././gpu_topology.h:1030: No valid binary tree found from root 0, try backtracking
Traceback (most recent call last):
File "run_pretraining.py", line 257, in <module>
train()
File "run_pretraining.py", line 228, in train
trainer.step(1)
File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 290, in step
self._allreduce_grads()
File "/home/ubuntu/mxnet/python/mxnet/gluon/trainer.py", line 320, in _allreduce_grads
self._kvstore.push(i, param.list_grad(), priority=-i)
File "/home/ubuntu/mxnet/python/mxnet/kvstore.py", line 237, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/home/ubuntu/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:50:46] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 0 using backtracking
The text was updated successfully, but these errors were encountered:
BERT_BASE and BERT_LARGE contains 110M and 340M parameters respectively. Currently multi-GPU scaling is poor for this model and the result shows large overhead for cross-GPU ndarray copies.
The default kvstore push/pull do not leverage the communication pattern on the machine (e.g. AWS p3 instance). It would be great to use the experimental tree reduce push/pull introduced by @ctcyang.
However, the following error occurs for
MXNET_KVSTORE_LOGTREE=1 MXNET_KVSTORE_USETREE=1
The text was updated successfully, but these errors were encountered: