Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataprovider #1395

Closed
studyPaddle opened this issue Feb 21, 2017 · 4 comments
Closed

Dataprovider #1395

studyPaddle opened this issue Feb 21, 2017 · 4 comments

Comments

@studyPaddle
Copy link

你好,我是一个PaddlePaddle的初学者,在PaddlePaddle的文档中没有提到DataProvider的那个process函数如果需要传入多个数据文件的,怎么处理,如果train.list中有好多行,需要传入多个数据文件怎么处理呢?

from paddle.trainer.PyDataProvider2 import *

Define a py data provider

@Provider(input_types=[dense_vector(28 * 28), integer_value(10)])
def process(settings, filename): # settings is not used currently.
f = open(filename, 'r') # open one of training file
for line in f: # read each line
label, pixel = line.split(';')
# get features and label
pixels_str = pixel.split(' ')
pixels_float = []
for each_pixel_str in pixels_str:
pixels_float.append(float(each_pixel_str))
# give data to paddle.
yield pixels_float, int(label)

文档中给的这个例子只有传入一个数据文件的情况,请问如何传入多个数据文件?

f.close()  # close file
@helinwang
Copy link
Contributor

这段代码我不是很熟,你试一下process里面加上print filename,看看train.list多行的情况下process有没有被调用多次?(一行文件一次)

@Z-TAO
Copy link
Contributor

Z-TAO commented Feb 23, 2017

@studyPaddle 在trainer_config下指定好test.list/train.list 后,dataprovider会:

  1. 将list内所有的“文件名“shuffle 一次
  2. 每次paddle内部会调用get_batch(or 类似的方法名),会自动使用process函数,当filelist有多个文件的时候,process函数传入的变量(filename)将会不同。文件的选择是随机的,但是多文件读取中内部的逻辑对你是无感知的。process函数内只需要处理这个文件,下的内容即可。

@studyPaddle
Copy link
Author

这样说是不是意味着.list文件中不能是多个文件??? @Z-TAO

@helinwang
Copy link
Contributor

@studyPaddle 听Z-TAO的意思是可以.list文件中可以存多行,每一行是一个文件名吧?

将list内所有的“文件名“shuffle 一次

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
* fix bart perf

* update fastergeneration doc

* add img

* add img

* change img

* update img

* fix img

* update docs

* fix readme

* update readme

* fix perf

* fix perf

* fix modelling

* fix perf and sample code

* fix perf

* fix perf

* fix seq_len for gpt_sample

* add forced eos token id for faster

* upgrade perf and add forced eos token id

* chenge stack to gather

* add auto perf

* minor fix

* remove encoder change

* Update bart_perf.py

* Update bart_perf.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants