-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add pin_device_id option to Gluon DataLoader #14136
Conversation
@apeforest @eric-haibin-lin @zhreshold Please review. Thanks. |
@mxnet-label-bot add [Gluon, pr-awaiting-review] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I recently had some issue with Gluon DataLoader and found pin_memory useful. Could you share how to use pin_memory and pin_device_id option together? Thanks!
@yuxihu Does it make any difference when you specify device_id for cpu_pinned context? |
@roywei If your script runs with a single process, you can just set pin_memory=True in your script without worrying about the pin_device_id, which default value is 0, as of now. If you have multiple processes, you'd better set pin_device_id to use different devices for each process to avoid out of memory error. One such use case is distributed training using MXNet with Horovod. You can set pin_device_id=hvd.local_rank(), similar to the usage of ImageRecordIter here. |
@zhreshold In Horovod case, each training process is attached to a GPU. If we do not specify device_id for the cpu_pinned context, all processes will use the memory in GPU 0 (because the default device_id for cpu_pinned context is 0) and cause out of memory error. I had a similar enhancement for ImageRecordIter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a basic unit test to check the output context?
@yuxihu Thanks for the clarification and refer link to the other merged feature. LGTM |
@eric-haibin-lin Let me try to add one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
After adding the unit-test subject to it passing all CI tests, LGTM!.
Also thanks for the explanation on how to use pin_device_id
, wondering if this can be documented somewhere for easy reference.
@marcoabreu Looks like the windows-gpu run passed but hang. Do I have to retrigger the CI? |
@eric-haibin-lin please help review and merge. thanks. |
@mxnet-label-bot update [Gluon, pr-awaiting-merge] |
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
* add pin_device_id option to DataLoader * add unit test to check output context * trigger CI
This PR adds a new option pin_device_id to Gluon DataLoader. The pin_device_id will be used to allocate pinned memory if pin_memory is True. This option is needed if we want to use pinned memory in DataLoader for distributed training with MXNet and Horovod. Otherwise, multiple training processes will allocate memory on a single device and then cause out of memory error. The default value for pin_device_id is 0 which is the same with the current behavior.