Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Improve Covost 2 #3281

Merged
merged 6 commits into from
Nov 18, 2021

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Nov 16, 2021

It's currently quite confusing to understand the manual data download instruction of Covost and not very user-friendly.

Currenty the user has to:

  1. Go on Common Voice website
  2. Find the correct dataset which is not mentioned in the error message
  3. Download it
  4. Untar it
  5. Create a language id folder (why? this folder does not exist in the .tar downloaded file)
  6. pass the folder containing the created language id folder

This PR improves this to:

  1. Go on Common Voice website
  2. Find the correct dataset which is mentioned in the error message
  3. Download it
  4. Untar it
  5. pass the untared folder

Note: This PR is not at all time-critical

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks !

I removed some dummy data files leftovers. I also tried to remove the file you added in the .hypothesis directory but for some reason GitHub doesn't allow me to remove it in the browser.

Feel free to delete this file and merge :)

@patrickvonplaten patrickvonplaten merged commit e598a00 into huggingface:master Nov 18, 2021
@patrickvonplaten patrickvonplaten deleted the improve_covost branch November 18, 2021 10:44
@shaikmoeed
Copy link

shaikmoeed commented Jan 26, 2022

I am trying to use load_dataset with the French dataset(common voice corpus 1) which is downloaded from a common voice site and the target language is English (using colab)

Steps I have followed:

1. untar:
!tar xvzf fr.tar -C data_dir

2. load data:
load_dataset('covost2', 'fr_en', data_dir="/content/data_dir")

0 rows are loading as shown below:

Using custom data configuration fr_en-data_dir=%2Fcontent%2Fdata_dir
Reusing dataset covost2 (/root/.cache/huggingface/datasets/covost2/fr_en-data_dir=%2Fcontent%2Fdata_dir/1.0.0/bba950aae1ffa5a14b876b7e09c17b44de2c3cf60e7bd5d459640beffc78e35b)
100%
3/3 [00:00<00:00, 54.98it/s]
DatasetDict({
    train: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
    validation: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
    test: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 0
    })
})

Can you please provide a sample working example code to load the dataset?

@lhoestq
Copy link
Member

lhoestq commented Jan 26, 2022

Hi ! I think it only works with the subsets of Common Voice Corpus 4, not Common Voice Corpus 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants