Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a xmap decorator into reader module for optimizing performance #2242

Merged
merged 3 commits into from
Jun 6, 2017

Conversation

wanghaoshuang
Copy link
Contributor

Add flowers dataset reader for image classification model.
Add a xmap decorator into reader module for optimizing performance of image data reader.
Fix #2241

@wanghaoshuang wanghaoshuang requested a review from qingqing01 May 23, 2017 16:29
# See the License for the specific language governing permissions and
# limitations under the License.
"""
CIFAR dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have a cifar.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I forgot to delete this line. Actually,it is flowers dataset which has more class dimensions.

SETID_MD5 = 'a5357ecc9cb78c4bef273ce3793fc85c'


def extract_file(tarFile):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try if we can read the data without untarring the tarball file. This is important because we will run these demos on Paddle Cloud, and distributed filesystems like CephFS do not favor many small files, but like few big files. This determines the efficiency of disk I/O.

An good example that doesn't extract all files is at here: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py#L56

Another good one is this: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/cifar.py#L60

Please notice that

  • tarfile.extractall extracts all files in a tarball into the current working directory, whereas
  • [tarfile.extractfile)[https://docs.python.org/2/library/tarfile.html#tarfile.TarFile.extractfile) doesn't extract files, but opens a TarFile object representing the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. Thanks for the important suggestion. I will optimize my code.

'''
map image bytes data to type needed by model input layer
'''
img, label = sample
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module seems reading many images from the tarball. If so, it might be great if we can call tarfile.next(), which returns a TarFile objects like tarfile.extractfile. But tarfile.next() reads files in the tarball one-by-one. This reduces the amount of disk seeks which reduces the number of moves of the magnetic head of our disk.


def xmap(mapper, reader, process_num, buffer_size):
"""
Use multiprocess to map samples from reader by a mapper defined by user.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember that @helinwang had a function which uses multiprocess to accelerate loading. Could @helinwang please confirm?

Copy link
Contributor

@helinwang helinwang May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's here: /~https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/decorator.py#L162

the buffered decorator will use a background thread to fetch the data. If you want map with multi-thread to speed up read, you can put a map decorator on top of the buffered decorator.

Copy link
Contributor Author

@wanghaoshuang wanghaoshuang May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi,@helinwang:

  1. The buffered decorator is not a thread-safe data provider.
    So i can't put a multi-thread map decorator on top of the buffered decorator directly.
  2. To guarantee that handle workers read data from reader safely, multi-threads map decorator hold a queue whihin it, with which there is no need to use bufferd decorator.


def xmap(mapper, reader, process_num, buffer_size):
"""
Use multiprocess to map samples from reader by a mapper defined by user.
Copy link
Contributor

@helinwang helinwang May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to change multiprocess to multithread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,you're right!
I used multiprocess to take advantage of multiple CPUs, but synchronization between processes cost so much time.
Experiments indicate that multithread is better than multiprocess in my application.

return paddle.reader.xmap(mapper, reader, cpu_count(), 1024 * 8)


def create_batch(data_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common function used to make batched data for images. I think it can be moved to v2/image.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i will rewrite this function to read imags from tar file directly.

data = []
labellist = []
for index in indexes[start:end]:
img_name = "%s/jpg/image_%05d.jpg" % (data_dir, index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If move this function to v2/image.py, the img_name should be modified for more general use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. Thx.


.. code-block:: python
with open('cat.jpg') as f:
im = load_image(f.read())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example usage is not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry.It's my fault.

@wanghaoshuang wanghaoshuang force-pushed the flowers_reader branch 2 times, most recently from 369ee7b to 2800239 Compare June 2, 2017 02:52
except ImportError:
cv2 = None

from cv2 import resize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉这行吧,下面显示的用cv2.resize吧,这样没安装cv2,import paddle.v2 as paddle时,也不会报错吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it. thx.

pass


def xmap(mapper, reader, process_num, buffer_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xmap -> xmap_readers吧,名字更形象一些~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i have renamed it.

wanghaoshuang@baidu.com and others added 3 commits June 5, 2017 16:34
images reader: read the data without untarring the tarball file.
image.py: move batch function from reader to image.py
@qingqing01
Copy link
Contributor

LGTM.

@wanghaoshuang wanghaoshuang merged commit 3d7a613 into PaddlePaddle:develop Jun 6, 2017
@wanghaoshuang wanghaoshuang deleted the flowers_reader branch June 6, 2017 03:46
@wanghaoshuang wanghaoshuang changed the title Add flowers dataset for image classification model Add a xmap decorator into reader module for optimizing performance Aug 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants