-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a xmap decorator into reader module for optimizing performance #2242
Add a xmap decorator into reader module for optimizing performance #2242
Conversation
python/paddle/v2/dataset/flowers.py
Outdated
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
CIFAR dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have a cifar.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I forgot to delete this line. Actually,it is flowers dataset which has more class dimensions.
python/paddle/v2/dataset/flowers.py
Outdated
SETID_MD5 = 'a5357ecc9cb78c4bef273ce3793fc85c' | ||
|
||
|
||
def extract_file(tarFile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try if we can read the data without untarring the tarball file. This is important because we will run these demos on Paddle Cloud, and distributed filesystems like CephFS do not favor many small files, but like few big files. This determines the efficiency of disk I/O.
An good example that doesn't extract all files is at here: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py#L56
Another good one is this: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/cifar.py#L60
Please notice that
- tarfile.extractall extracts all files in a tarball into the current working directory, whereas
- [tarfile.extractfile)[https://docs.python.org/2/library/tarfile.html#tarfile.TarFile.extractfile) doesn't extract files, but opens a TarFile object representing the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it. Thanks for the important suggestion. I will optimize my code.
''' | ||
map image bytes data to type needed by model input layer | ||
''' | ||
img, label = sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module seems reading many images from the tarball. If so, it might be great if we can call tarfile.next()
, which returns a TarFile objects like tarfile.extractfile
. But tarfile.next()
reads files in the tarball one-by-one. This reduces the amount of disk seeks which reduces the number of moves of the magnetic head of our disk.
python/paddle/v2/reader/decorator.py
Outdated
|
||
def xmap(mapper, reader, process_num, buffer_size): | ||
""" | ||
Use multiprocess to map samples from reader by a mapper defined by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember that @helinwang had a function which uses multiprocess to accelerate loading. Could @helinwang please confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's here: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/decorator.py#L162
the buffered
decorator will use a background thread to fetch the data. If you want map with multi-thread to speed up read, you can put a map decorator on top of the buffered
decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi,@helinwang:
- The
buffered
decorator is not a thread-safe data provider.
So i can't put a multi-thread map decorator on top of thebuffered
decorator directly. - To guarantee that handle workers read data from reader safely, multi-threads map decorator hold a queue whihin it, with which there is no need to use bufferd decorator.
python/paddle/v2/reader/decorator.py
Outdated
|
||
def xmap(mapper, reader, process_num, buffer_size): | ||
""" | ||
Use multiprocess to map samples from reader by a mapper defined by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to change multiprocess to multithread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,you're right!
I used multiprocess to take advantage of multiple CPUs, but synchronization between processes cost so much time.
Experiments indicate that multithread is better than multiprocess in my application.
python/paddle/v2/dataset/flowers.py
Outdated
return paddle.reader.xmap(mapper, reader, cpu_count(), 1024 * 8) | ||
|
||
|
||
def create_batch(data_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a common function used to make batched data for images. I think it can be moved to v2/image.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i will rewrite this function to read imags from tar file directly.
python/paddle/v2/dataset/flowers.py
Outdated
data = [] | ||
labellist = [] | ||
for index in indexes[start:end]: | ||
img_name = "%s/jpg/image_%05d.jpg" % (data_dir, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If move this function to v2/image.py
, the img_name
should be modified for more general use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it. Thx.
python/paddle/v2/image.py
Outdated
|
||
.. code-block:: python | ||
with open('cat.jpg') as f: | ||
im = load_image(f.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example usage is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry.It's my fault.
369ee7b
to
2800239
Compare
python/paddle/v2/image.py
Outdated
except ImportError: | ||
cv2 = None | ||
|
||
from cv2 import resize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉这行吧,下面显示的用cv2.resize吧,这样没安装cv2,import paddle.v2 as paddle时,也不会报错吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it. thx.
python/paddle/v2/reader/decorator.py
Outdated
pass | ||
|
||
|
||
def xmap(mapper, reader, process_num, buffer_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xmap -> xmap_readers吧,名字更形象一些~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i have renamed it.
images reader: read the data without untarring the tarball file. image.py: move batch function from reader to image.py
1a8fffa
to
990b7d7
Compare
LGTM. |
Add flowers dataset reader for image classification model.
Add a xmap decorator into reader module for optimizing performance of image data reader.
Fix #2241