Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add pin_device_id option to Gluon DataLoader #14136

Merged
merged 3 commits into from
Feb 13, 2019

Conversation

yuxihu
Copy link
Member

@yuxihu yuxihu commented Feb 12, 2019

This PR adds a new option pin_device_id to Gluon DataLoader. The pin_device_id will be used to allocate pinned memory if pin_memory is True. This option is needed if we want to use pinned memory in DataLoader for distributed training with MXNet and Horovod. Otherwise, multiple training processes will allocate memory on a single device and then cause out of memory error. The default value for pin_device_id is 0 which is the same with the current behavior.

@yuxihu yuxihu requested a review from szha as a code owner February 12, 2019 21:32
@yuxihu
Copy link
Member Author

yuxihu commented Feb 12, 2019

@apeforest @eric-haibin-lin @zhreshold Please review. Thanks.

@yuxihu
Copy link
Member Author

yuxihu commented Feb 12, 2019

@mxnet-label-bot add [Gluon, pr-awaiting-review]

@marcoabreu marcoabreu added Gluon pr-awaiting-review PR is waiting for code review labels Feb 12, 2019
@szha szha requested a review from zhreshold February 12, 2019 21:53
Copy link
Member

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I recently had some issue with Gluon DataLoader and found pin_memory useful. Could you share how to use pin_memory and pin_device_id option together? Thanks!

@zhreshold
Copy link
Member

@yuxihu Does it make any difference when you specify device_id for cpu_pinned context?

@yuxihu
Copy link
Member Author

yuxihu commented Feb 12, 2019

@roywei If your script runs with a single process, you can just set pin_memory=True in your script without worrying about the pin_device_id, which default value is 0, as of now. If you have multiple processes, you'd better set pin_device_id to use different devices for each process to avoid out of memory error. One such use case is distributed training using MXNet with Horovod. You can set pin_device_id=hvd.local_rank(), similar to the usage of ImageRecordIter here.

@yuxihu
Copy link
Member Author

yuxihu commented Feb 12, 2019

@zhreshold In Horovod case, each training process is attached to a GPU. If we do not specify device_id for the cpu_pinned context, all processes will use the memory in GPU 0 (because the default device_id for cpu_pinned context is 0) and cause out of memory error. I had a similar enhancement for ImageRecordIter.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a basic unit test to check the output context?

@zhreshold
Copy link
Member

@yuxihu Thanks for the clarification and refer link to the other merged feature. LGTM

@yuxihu
Copy link
Member Author

yuxihu commented Feb 12, 2019

@eric-haibin-lin Let me try to add one.

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.
After adding the unit-test subject to it passing all CI tests, LGTM!.
Also thanks for the explanation on how to use pin_device_id, wondering if this can be documented somewhere for easy reference.

@yuxihu
Copy link
Member Author

yuxihu commented Feb 13, 2019

@marcoabreu Looks like the windows-gpu run passed but hang. Do I have to retrigger the CI?

@yuxihu
Copy link
Member Author

yuxihu commented Feb 13, 2019

@eric-haibin-lin please help review and merge. thanks.

@yuxihu
Copy link
Member Author

yuxihu commented Feb 13, 2019

@mxnet-label-bot update [Gluon, pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Feb 13, 2019
@eric-haibin-lin
Copy link
Member

lgtm

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhreshold zhreshold merged commit 0b1761f into apache:master Feb 13, 2019
@yuxihu yuxihu deleted the data_loader branch February 13, 2019 21:56
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 19, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Mar 4, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
yuxihu added a commit to yuxihu/incubator-mxnet that referenced this pull request Apr 22, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
szha pushed a commit that referenced this pull request Apr 23, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* add pin_device_id option to DataLoader

* add unit test to check output context

* trigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Gluon pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants