[Datasets] Improve Covost 2 #3281

patrickvonplaten · 2021-11-16T15:32:19Z

It's currently quite confusing to understand the manual data download instruction of Covost and not very user-friendly.

Currenty the user has to:

Go on Common Voice website
Find the correct dataset which is not mentioned in the error message
Download it
Untar it
Create a language id folder (why? this folder does not exist in the .tar downloaded file)
pass the folder containing the created language id folder

This PR improves this to:

Go on Common Voice website
Find the correct dataset which is mentioned in the error message
Download it
Untar it
pass the untared folder

Note: This PR is not at all time-critical

lhoestq

Cool thanks !

I removed some dummy data files leftovers. I also tried to remove the file you added in the .hypothesis directory but for some reason GitHub doesn't allow me to remove it in the browser.

Feel free to delete this file and merge :)

…datasets-1 into improve_covost

shaikmoeed · 2022-01-26T07:18:18Z

I am trying to use load_dataset with the French dataset(common voice corpus 1) which is downloaded from a common voice site and the target language is English (using colab)

Steps I have followed:

1. untar:
!tar xvzf fr.tar -C data_dir

2. load data:
load_dataset('covost2', 'fr_en', data_dir="/content/data_dir")

0 rows are loading as shown below:

Using custom data configuration fr_en-data_dir=%2Fcontent%2Fdata_dir
Reusing dataset covost2 (/root/.cache/huggingface/datasets/covost2/fr_en-data_dir=%2Fcontent%2Fdata_dir/1.0.0/bba950aae1ffa5a14b876b7e09c17b44de2c3cf60e7bd5d459640beffc78e35b)
100%
3/3 [00:00<00:00, 54.98it/s]
DatasetDict({
    train: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
    validation: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
    test: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
})

Can you please provide a sample working example code to load the dataset?

lhoestq · 2022-01-26T16:17:05Z

Hi ! I think it only works with the subsets of Common Voice Corpus 4, not Common Voice Corpus 1

[Datasets] Improve Covost 2

c2b22e5

patrickvonplaten requested review from patil-suraj, albertvillanova and lhoestq November 16, 2021 15:35

patrickvonplaten and others added 3 commits November 16, 2021 16:13

up

29e50a8

Delete validated.tsv

51b3211

Delete covost_v2.en_de.tsv

032c3d3

lhoestq approved these changes Nov 17, 2021

View reviewed changes

patrickvonplaten added 2 commits November 18, 2021 10:22

finish

831bc0a

Merge branch 'improve_covost' of /~https://github.com/patrickvonplaten/…

8f24dff

…datasets-1 into improve_covost

patrickvonplaten merged commit e598a00 into huggingface:master Nov 18, 2021

patrickvonplaten deleted the improve_covost branch November 18, 2021 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Improve Covost 2 #3281

[Datasets] Improve Covost 2 #3281

patrickvonplaten commented Nov 16, 2021 •

edited

Loading

lhoestq left a comment

shaikmoeed commented Jan 26, 2022 •

edited

Loading

lhoestq commented Jan 26, 2022

[Datasets] Improve Covost 2 #3281

[Datasets] Improve Covost 2 #3281

Conversation

patrickvonplaten commented Nov 16, 2021 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

shaikmoeed commented Jan 26, 2022 • edited Loading

lhoestq commented Jan 26, 2022

patrickvonplaten commented Nov 16, 2021 •

edited

Loading

shaikmoeed commented Jan 26, 2022 •

edited

Loading