-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Increase perfomance of BulkAppend and BulkFlush #14067
Conversation
@mxnet-label-bot add [pr-awaiting-review, Performance] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. So we should be able to see bigger performance boost when training with Horovod where a process handles a single GPU?
@mxnet-label-bot update [pr-awaiting-merge, Performance] |
@yuxihu It is actually the other way around - single process per GPU alleviates some of those issues, because each GPU is handled independently. This helps the most the single process-multi GPU cases (where single Python thread needs to launch everything on all GPUs) and small batch size scenarios, where you do not have much time to launch your work. |
@ptrendx Thanks for the explanation. Anyway it is good improvement. |
Why a shared_ptr rather than unique_ptr? |
@junrushao1994 2 reasons:
|
But I agree unique_ptr would be the ultimate solution there. |
I see. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Better bulkappend * Fix lint
* Better bulkappend * Fix lint
* Better bulkappend * Fix lint
Description
Increase the performance of
BulkAppend
andBulkFlush
methods used in Gluon hybridized models withstatic_alloc=True, static_shape=False
.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
BulkStatus
and populating that vector.BulkAppend
, butBulkFlush
still needed to perform the copy of the entire vector of lambdas, including all of the environment. This is alleviated by, instead of passing the vector by value, ashared_ptr
is passed instead, increasing the performance ofBulkFlush
function by ~3.5x from 70us to ~20us.Comments
@eric-haibin-lin