-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add British Library books dataset #3603
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Thanks a lot for adding this dataset :)
The dataset card is awesome ! And you did well regarding iter_archive
and the configurations, feel free to continue in this direction !
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
The only thing missing are the dummy data we use to test the dataset script regularly. You can use the datasets-cli dummy_data ./datasets/blbooks
command to have the instructions to generate them.
Since the dataset has a very specific structure it might not be that easy so feel free to ping me if you have questions or if I can help !
Thanks for all the help and suggestions
I did get a little stuck here! So far I have created directories for each config i.e:
I have then added two examples of the Since def _generate_examples(self, data_dirs): takes as input
|
I think I managed to create the dummy data :) I think everything is good now, if you don't have other changes to do, please mark your PR as "ready for review" and ping me! |
Thanks so much for that!
Think it is ready to merge from my end @lhoestq. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks :)
The CI failure on windows is unrelated to your PR and fixed on |
This pull request adds a dataset of text from digitised (primarily 19th Century) books from the British Library. This collection has previously been used for training language models, e.g. /~https://github.com/dbmdz/clef-hipe/blob/main/hlms.md. It would be nice to make this dataset more accessible for others to use through datasets.
This is still a WIP but I wanted to get some initial feedback in particular; I wanted to check:
iter_archive
correctly - I intend to ensure thatdl_manager.download
gets the complete list of URLs to download upfront, so the progress bar knows how much is left to download and then to pass through thegen_kwargs
a list of downloaded zip archives wrapped initer_archive
. I am unsure if there is a more elegant approach for this?If there are other glaring omissions or mistakes, I'd be happy to hear them. If this approach seems sensible in general, I will finish all the remaining TODOs, generate dummy_data, etc.