From fe1f9e241f4df70af2de54214730cac1fb611be2 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 19 Jan 2022 17:36:23 +0000 Subject: [PATCH 01/21] loading script draft --- datasets/blbooks/README.md | 535 ++++++++++++++++++++++++++++++++++++ datasets/blbooks/blbooks.py | 229 +++++++++++++++ 2 files changed, 764 insertions(+) create mode 100644 datasets/blbooks/README.md create mode 100644 datasets/blbooks/blbooks.py diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md new file mode 100644 index 00000000000..4916b8f6ba7 --- /dev/null +++ b/datasets/blbooks/README.md @@ -0,0 +1,535 @@ +--- +annotations_creators: +- no-annotation + language_creators: +- machine-generated + languages: +- en +- fr +- de +- es +- it +- nl + licenses: +- cc0-1.0 + multilinguality: +- multilingual + pretty_name: British Library Books + size_categories: +- unknown + source_datasets: +- original + task_categories: +- sequence-modeling +- other + task_ids: +- language-modeling +- other-other-digital-humanities-research +--- + +# Dataset Card for British Library Books + +## Table of Contents + +- [Dataset Card for British Library Books](#dataset-card-for-British-Library-Books) + - [Table of Contents](#table-of-contents) + - [Dataset Description](#dataset-description) + - [Dataset Summary](#dataset-summary) + - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) + - [Language model training](#language-model-training) + - [Supervised tasks](#supervised-tasks) + - [Languages](#languages) + - [Language change](#language-change) + - [Optical Character Recognition](#optical-character-recognition) + - [OCR word confidence](#ocr-word-confidence) + - [Dataset Structure](#dataset-structure) + - [Data Instances](#data-instances) + - [Data Fields](#data-fields) + - [Data Splits](#data-splits) + - [Dataset Creation](#dataset-creation) + - [Curation Rationale](#curation-rationale) + - [Source Data](#source-data) + - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) + - [Date normalization](#date-normalization) + - [Metadata included](#metadata-included) + - [Who are the source language producers?](#who-are-the-source-language-producers) + - [Annotations](#annotations) + - [Annotation process](#annotation-process) + - [Who are the annotators?](#who-are-the-annotators) + - [Personal and Sensitive Information](#personal-and-sensitive-information) + - [Considerations for Using the Data](#considerations-for-using-the-data) + - [Social Impact of Dataset](#social-impact-of-dataset) + - [Discussion of Biases](#discussion-of-biases) + - [Colonialism](#colonialism) + - [Other Known Limitations](#other-known-limitations) + - [Additional Information](#additional-information) + - [Dataset Curators](#dataset-curators) + - [Licensing Information](#licensing-information) + - [Citation Information](#citation-information) + - [Contributions](#contributions) + +## Dataset Description + +- **Homepage:** +- **Repository:** +- **Paper:** +- **Leaderboard:** +- **Point of Contact:** + +### Dataset Summary + +This dataset consists of books digitised by the British Library in partnership with Microsoft. The dataset includes ~25 million pages of out of copyright texts. The majority of the texts were published in the 18th and 19th Century, but the collection also consists of a smaller number of books from earlier periods. Items within this collection cover a wide range of subject areas, including geography, philosophy, history, poetry and literature and are published in various languages. + +While the books are predominately from the 18th and 19th Centuries, there are fewer books from earlier periods. The number of pages in the corpus by decade: + +| | page count | +| ---- | ---------- | +| 1510 | 94 | +| 1520 | 32 | +| 1540 | 184 | +| 1550 | 16 | +| 1580 | 276 | +| 1590 | 540 | +| 1600 | 1117 | +| 1610 | 1132 | +| 1620 | 1856 | +| 1630 | 9274 | +| 1640 | 4232 | +| 1650 | 2944 | +| 1660 | 5858 | +| 1670 | 11415 | +| 1680 | 8348 | +| 1690 | 13756 | +| 1700 | 10160 | +| 1710 | 9556 | +| 1720 | 10314 | +| 1730 | 13282 | +| 1740 | 10778 | +| 1750 | 12001 | +| 1760 | 21415 | +| 1770 | 28490 | +| 1780 | 32676 | +| 1790 | 50014 | +| 1800 | 307806 | +| 1810 | 478008 | +| 1820 | 589419 | +| 1830 | 681212 | +| 1840 | 1113473 | +| 1850 | 1726108 | +| 1860 | 1725407 | +| 1870 | 2069089 | +| 1880 | 2585159 | +| 1890 | 3365031 | + +[More Information Needed] + +### Supported Tasks and Leaderboards + +This collection has been previously used across various digital history and humanities projects since being published. + +The dataset consists of text and a range of metadata associated with this text. This metadata includes: + +- date of publication +- place of publication +- country of publication +- language +- OCR quality +- physical description of the original physical item + +#### Language model training + +As a relatively large dataset, `blbooks` provides a source dataset for training language models. The presence of this metadata also offers interesting opportunities to use this dataset as a source for training language models based on: + +- specific time-periods +- specific languages +- certain OCR quality thresholds + +The above is not an exhaustive list but offer some suggestions of how the dataset can be used to explore topics such as the impact of OCR quality on language models, the ‘transferability’ of language models across time or the impact of training multilingual language models on historical languages. + +#### Supervised tasks + +Whilst this dataset does not have annotations for a specific NLP task, such as Named Entity Recognition, it does include a wide variety of metadata. This metadata has the potential to be used for training and/or evaluating a variety of supervised tasks predicting this metadata. + +### Languages + +This dataset consists of books published in several languages. The breakdown of the languages included (at the page level) is: + +| Language | Pages | +| --------------------- | -------- | +| English | 10039463 | +| French | 1442929 | +| German | 1172793 | +| Spanish | 286778 | +| Italian | 214255 | +| Dutch | 204759 | +| Russian | 193347 | +| Danish | 93366 | +| Hungarian | 88094 | +| Swedish | 76225 | +| Polish | 58901 | +| Greek, Modern (1453-) | 26104 | +| Latin | 25611 | +| Portuguese | 25410 | +| Czech | 20160 | +| Bulgarian | 7891 | +| Finnish | 5677 | +| Irish | 2743 | +| Serbian | 1975 | +| Romanian | 1544 | +| Norwegian Nynorsk | 1398 | +| Croatian | 1306 | +| Norwegian | 1227 | +| Icelandic | 902 | +| Slovak | 840 | +| Lithuanian | 714 | +| Welsh | 580 | +| Slovenian | 545 | +| Indonesian | 418 | +| Cornish | 223 | + +This breakdown was derived from the first language in the associated metadata field. Some books include multiple languages. Some of the languages codes for this data were also derived using computational methods. Therefore, the language fields in the dataset should be treated with some caution (discussed in more detail below). + +#### Language change + +The publication dates of books in the data cover a broad period of time (1500-1900). For languages in the dataset with broad temporal coverage, significant [language change](https://en.wikipedia.org/wiki/Language_change) might be found. The ability to study this change by taking reasonably large samples of languages covering different time periods is one of the opportunities offered by this dataset. The fact that the text in this dataset was produced via Optical Character Recognition (OCR) causes some challenges for this type of research (see below). + +#### Optical Character Recognition + +The digitised books in this collection were transformed into machine-readable text using Optical Character Recognition (OCR) software. The text produced via OCR software will usually include some errors. These errors include; mistakes at the character level; for example, an `i’ is mistaken for an `l`, at the word level or across significant passages of text. + +The books in this dataset can pose some additional challenges for OCR software. OCR errors can stem from: + +- the quality of the original printing: printing technology was a developing technology during the time period covered by this corpus; some of the original book text will include misprints, blurred or faded ink that is hard to read +- damage to the page: some of the books will have become damaged over time, this can obscure all or parts of the text on a page +- poor quality scans: scanning books can be challenging; for example, if the book has tight bindings, it can be hard to capture text that has fallen into the [gutter](https://www.abaa.org/glossary/entry/gutter) of the book. +- the language used in the books may differ from the languages OCR software is predominantly trained to recognise. + +##### OCR word confidence + +Many OCR engines produce some form of confidence score alongside the predicted text. These confidence scores are usually at the character or word level. The word confidence score was given for each word in the original ALTO XML versions of the text in this dataset in this dataset. The OCR confidence scores should be treated with some scepticism. For historical text or in a lower resource language, for example, a low confidence score may be more likely for words not included in a modern dictionary but may be accurate transcriptions of the original text. With that said, the confidence scores do give some sense of the OCR quality. + +An example of text with a high (over 90% mean word confidence score): + +``` +8 direction to the Conduit, round which is a wide open space, and a good broad pavement called the Parade. It commands a pleasant peep of the slopes and terrace throughout its entire length. The street continuing from the Conduit, in the same general direction, was known anciently as Lodborne Lane, and is now named South Street. From the Conduit two other streets, at right angles to these, are Long Street, leading Eastwards, and Half-Moon Street (formerly Lodborne), leading to Westbury, Trendle Street, and the Horsecastles Road. +``` + +An example of text with a score below 40%: + +``` +Hannover. Schrift und Druck von Fr. CultniTmn,', + "LeMNs'utluirui.", + 'ü 8u«llim» M^äalßwi 01de!lop 1»**Kmm lie« !»^2!M kleine lii!* ttünee!<»e^ v»n tndzt Lievclum, 1872, +``` + +The quality of OCR - as measured by mean OCR confidence for a page - across the dataset correlates with other features. A groupby of publication decade and mean word confidence: + +| decade | mean_wc_ocr | +| ------ | ----------- | +| 1510 | 0.499151 | +| 1520 | 0.544818 | +| 1540 | 0.511589 | +| 1550 | 0.4505 | +| 1580 | 0.321858 | +| 1590 | 0.461282 | +| 1600 | 0.467318 | +| 1610 | 0.495895 | +| 1620 | 0.501257 | +| 1630 | 0.49766 | +| 1640 | 0.512095 | +| 1650 | 0.528534 | +| 1660 | 0.521014 | +| 1670 | 0.592575 | +| 1680 | 0.583901 | +| 1690 | 0.567202 | +| 1700 | 0.575175 | +| 1710 | 0.61436 | +| 1720 | 0.627725 | +| 1730 | 0.658534 | +| 1740 | 0.64214 | +| 1750 | 0.657357 | +| 1760 | 0.6389 | +| 1770 | 0.651883 | +| 1780 | 0.632326 | +| 1790 | 0.664279 | +| 1800 | 0.682338 | +| 1810 | 0.708915 | +| 1820 | 0.730015 | +| 1830 | 0.730973 | +| 1840 | 0.713886 | +| 1850 | 0.697106 | +| 1860 | 0.696701 | +| 1870 | 0.717233 | +| 1880 | 0.733331 | +| 1890 | 0.762364 | + +As might be expected, the earlier periods have lower mean word confidence scores. Again, all of this should be treated with some scepticism, especially as the size of the data grows over time. + +As with time, the mean word confidence of the OCR software varies across languages: + +| Language_1 | mean_wc_ocr | +| --------------------- | ----------- | +| Croatian | 0.755565 | +| Welsh | 0.7528 | +| Norwegian Nynorsk | 0.751648 | +| Slovenian | 0.746007 | +| French | 0.740772 | +| Finnish | 0.738032 | +| Czech | 0.737849 | +| Hungarian | 0.736076 | +| Dutch | 0.734977 | +| Cornish | 0.733682 | +| Danish | 0.733106 | +| English | 0.733037 | +| Irish | 0.732658 | +| Portuguese | 0.727746 | +| Spanish | 0.725111 | +| Icelandic | 0.724427 | +| Italian | 0.715839 | +| Swedish | 0.715633 | +| Polish | 0.715133 | +| Lithuanian | 0.700003 | +| Bulgarian | 0.694657 | +| Romanian | 0.692957 | +| Latin | 0.689022 | +| Russian | 0.685847 | +| Serbian | 0.674329 | +| Slovak | 0.66739 | +| Greek, Modern (1453-) | 0.632195 | +| German | 0.631457 | +| Indonesian | 0.6155 | +| Norwegian | 0.597987 | + +Again, these numbers should be treated sceptically since some languages appear very infrequently. For example, the above table suggests the mean word confidence for Welsh is relatively high. However, there isn’t much Welsh in the dataset. Therefore, it is unlikely that this data will be particularly useful for training (historic) Welsh language models. + +[More Information Needed] + +## Dataset Structure + +The dataset has a number of configurations: + +TODO + +- `skip_empty_pages` + +### Data Instances + +An example data instance: + +```python +{'Country of publication 1': 'England', +'Language_1': 'English', +'Language_2': None, +'Language_3': None, +'Language_4': None, +'Physical description': None, +'Publisher': None, +'all Countries of publication': 'England', +'all names': 'Settle, Elkanah [person]', +'date': 1689, +'empty_pg': True, +'mean_wc_ocr': 0.0, +'multi_language': False, +'name': 'Settle, Elkanah', +'pg': 1, +'place': 'London', +'raw_date': '1689', +'record_id': '001876770', +'std_wc_ocr': 0.0, +'text': None, +‘title’: ‘The Female Prelate: being the history and the life and death of Pope Joan. A tragedy [in five acts and in verse] . Written by a Person of Quality [i.e. Elkanah Settle]’} + +``` + +Each instance in the dataset represents a single page from an original digitised book. + +### Data Fields + +Included in this dataset are: + +| Field | Data Type | Description | +| ---------------------------- | --------- | ------------------------------------------------------------------------------------------------------------- | +| record_id | string | British Library ID for the item | +| date | int | parsed/normalised year for the item. i.e. 1850 | +| raw_date | string | the original raw date for an item i.e. 1850- | +| title | string | title of the book | +| place | string | Place of publication, i.e. London | +| empty_pg | bool | whether page contains text | +| text | string | OCR generated text for a page | +| pg | int | page in original book the instance refers to | +| mean_wc_ocr | float | mean word confidence values for the page | +| std_wc_ocr | float | standard deviation of the word confidence values for the page | +| name | string | name associated with the item (usually author) | +| all names | string | all names associated with a publication | +| Publisher | string | publisher of the book | +| Country of publication 1 | string | first country associated with publication | +| all Countries of publication | string | all countries associated with a publication | +| Physical description | string | physical description of the item (size). This requires some normalisation before use and isn’t always present | +| Language_1 | string | first language associated with the book, this is usually present | +| Language_2 | string | | +| Language_3 | string | | +| Language_4 | string | | +| multi_language | bool | | + +Some of these fields are not populated a large proportion of the time. You can get some sense of this from this [Pandas Profiling](/~https://github.com/pandas-profiling/pandas-profiling) [report](https://davanstrien.github.io/BL-datasets-pandas-profile-reports/pandas_profile_report_MS_digitised_books_2021-01-09.html) + +The majority of these fields relate to metadata about the books. Most of these fields were created by staff working for the British Library. The notable exception is the “Languages” fields that have sometimes been determined using computational methods. This work is reported in more detail in [Automated Language Identification of Bibliographic Resources](https://doi.org/10.1080/01639374.2019.1700201). It is important to note that metadata is neither perfect nor static. The metadata associated with this book was generated based on export from the British Library catalogue in 2021. + +[More Information Needed] + +### Data Splits + +This dataset contains a single split `train`. + +## Dataset Creation + +**Note** this section is a work in progress. + +### Curation Rationale + +The books in this collection were digitised as part of a project partnership between the British Library and Microsoft. [Mass digitisation](https://en.wikipedia.org/wiki/Category:Mass_digitization), i.e. projects intending to quickly digitise large volumes of materials shape the selection of materials to include in several ways. Some considerations which are often involved in the decision of whether to include items for digitisation include (but are not limited to): + +- copyright status +- preservation needs +- the size of an item, very large and very small items are often hard to digitise quickly + +These criteria can have knock-on effects on the makeup of a collection. For example, systematically excluding large books may result in some types of book content not being digitised. Large volumes are likely to be correlated to content to at least some extent, so excluding them from digitisation will mean that material is underrepresented. Similarly, copyright status is often (but not only) determined by publication date. This can often lead to a rapid fall in the number of items in a collection after a certain cut-off date. + +All of the above is largely to make clear that this collection was not curated to create a representative sample of the British Library’s holdings. Some material will be over-represented, and others under-represented. Similarly, the collection should not be considered a representative sample of what was published across the period covered by the dataset (nor that the relative proportions of the data for each time period represent a proportional sample of publications from that period). Finally, and this probably does not need stating, the language included in the text should not be considered representative of either written or spoken language(s) from that time period. + +[More Information Needed] + +### Source Data + +The source data (physical items) includes a variety of resources (predominantly monographs) held by the [British Library](bl.uk/](https://bl.uk/). The British Library is a [Legal Deposit](https://www.bl.uk/legal-deposit/about-legal-deposit) library. “Legal deposit requires publishers to provide a copy of every work they publish in the UK to the British Library. It’s existed in English law since 1662.” [source](https://www.bl.uk/legal-deposit/about-legal-deposit). + +The source data for this version of the data is derived from the original ALTO XML files and a recent metadata export #TODO add links + +[More Information Needed] + +#### Initial Data Collection and Normalization + +This version of the dataset was created using the original ALTO XML files and, where a match was found, updating the metadata associated with that item with more recent metadata using an export from the British Library catalogue. The process of creating this new dataset is documented here #TODO add link. + +There are a few decisions made in the above processing steps worth highlighting in particular: + +##### Date normalization + +The metadata around date of publication for an item is not always exact. It often is represented as a date range e.g. `1850-1860`. The `date` field above takes steps to normalise this date to a single integer value. In most cases, this is taking the mean of the values associated with the item. The `raw_date` field includes the unprocessed date string. + +##### Metadata included + +The metadata associated with each item includes most of the fields available via the ALTO XML. However, the data doesn’t include some metadata fields from the metadata export file. The reason fields were excluded because they are frequently not populated. A cut off of 50% was chosen, i.e. values from the metadata which are missing above 50% of the time were not included. This is slightly arbitrary, but since the aim of this version of the data was to support computational research using the collection it was felt that these fields with frequent missing values would be less valuable. + +#### Who are the source language producers? + +[More Information Needed] + +### Annotations + +This dataset does not include annotations as usually understood in the context of NLP. The data does include metadata associated with the books. + +#### Annotation process + +[More Information Needed] + +#### Who are the annotators? + +[More Information Needed] + +### Personal and Sensitive Information + +[More Information Needed] + +## Considerations for Using the Data + +There a range of considerations around using the data. These include the representativeness of the dataset, the OCR quality and the language used. Depending on your use case, these may be more or less important. For example, the impact of OCR quality on downstream tasks will depend on the target task. It may also be possible to mitigate this negative impact from OCR through tokenizer choice, Language Model training objectives, oversampling high-quality OCR, etc. + +[More Information Needed] + +### Social Impact of Dataset + +[More Information Needed] + +### Discussion of Biases + +The text in this collection is derived from historical text. As a result, the text will reflect this time period's social beliefs and attitudes. The books include both fiction and non-fiction books. + +Examples of book titles that appear in the data (these are randomly sampled from all titles): + +- ‘Rhymes and Dreams, Legends of Pendle Forest, and other poems’, +- “Précis of Information concerning the Zulu Country, with a map. Prepared in the Intelligence Branch of the Quarter-Master-General’s Department, Horse Guards, War Office, etc”, +- ‘The fan. A poem’, +- ‘Grif; a story of Australian Life’, +- ‘Calypso; a masque: in three acts, etc’, +- ‘Tales Uncle told [With illustrative woodcuts.]’, +- 'Questings', +- 'Home Life on an Ostrich Farm. With ... illustrations’, +- ‘Bulgarya i Bulgarowie’, +- 'Εἰς τα βαθη της Ἀφρικης [In darkest Africa.] ... Μεταφρασις Γεωρ. Σ. Βουτσινα, etc', +- ‘The Corsair, a tale’, + ‘Poems ... With notes [With a portrait.]’, +- ‘Report of the Librarian for the year 1898 (1899, 1901, 1909)’, +- “The World of Thought. A novel. By the author of ‘Before I began to speak.’”, +- 'Amleto; tragedia ... recata in versi italiani da M. Leoni, etc'] + +While using titles alone is insufficient to integrate bias in this collection, it gives some insight into the topics covered by books. Further, the tiles highlight some particular types of bias we might find in the collection. This should in no way be considered an exhaustive list. + +#### Colonialism + +Even in the above random sample of titles examples of colonial attitudes, we can see examples of titles. We can try and interrogate this further by searching for the name of places that were part of the British Empire when many of these books were published. + +Searching for the string `India` in the titles and randomly sampling 10 titles returns: + +- “Travels in India in the Seventeenth Century: by Sir Thomas Roe and Dr. John Fryer. Reprinted from the ‘Calcutta Weekly Englishman.’”, +- ‘A Winter in India and Malaysia among the Methodist Missions’, +- “The Tourist’s Guide to all the principal stations on the railways of Northern India [By W. W.] ... Fifth edition”, +- ‘Records of Sport and Military Life in Western India ... With an introduction by ... G. B. Malleson’, +- "Lakhmi, the Rájpút's Bride. A tale of Gujarát in Western India [A poem.]”, +- ‘The West India Commonplace Book: compiled from parliamentary and official documents; shewing the interest of Great Britain in its Sugar Colonies’, +- “From Tonkin to India : by the sources of the Irawadi, January’ 95-January ’96”, +- ‘Case of the Ameers of Sinde : speeches of Mr. John Sullivan, and Captain William Eastwick, at a special court held at the India House, ... 26th January, 1844’, +- ‘The Andaman Islands; their colonisation, etc. A correspondence addressed to the India Office’, +- ‘Ancient India as described by Ptolemy; being a translation of the chapters which describe India and Eastern Asia in the treatise on Geography written by Klaudios Ptolemaios ... with introduction, commentary, map of India according to Ptolemy, and ... index, by J. W. McCrindle’] + +Searching form the string `Africa` in the titles and randomly sampling 10 titles returns: + +- ['De Benguella ás Terras de Iácca. Descripção de uma viagem na Africa Central e Occidental ... Expedição organisada nos annos de 1877-1880. Edição illustrada', +- ‘To the New Geographical Society of Edinburgh [An address on Africa by H. M. Stanley.]’, +- ‘Diamonds and Gold in South Africa ... With maps, etc’, +- ‘Missionary Travels and Researches in South Africa ... With notes by F. S. Arnot. With map and illustrations. New edition’, +- ‘A Narrative of a Visit to the Mauritius and South Africa ... Illustrated by two maps, sixteen etchings and twenty-eight wood-cuts’, +- ‘Side Lights on South Africa ... With a map, etc’, +- ‘My Second Journey through Equatorial Africa ... in ... 1886 and 1887 ... Translated ... by M. J. A. Bergmann. With a map ... and ... illustrations, etc’, +- ‘Missionary Travels and Researches in South Africa ... With portrait and fullpage illustrations’, +- ‘[African sketches.] Narrative of a residence in South Africa ... A new edition. To which is prefixed a biographical sketch of the author by J. Conder’, +- ‘Lake Ngami; or, Explorations and discoveries during four years wandering in the wilds of South Western Africa ... With a map, and numerous illustrations, etc’] + +[More Information Needed] + +### Other Known Limitations + +[More Information Needed] + +## Additional Information + +### Dataset Curators + +[More Information Needed] + +### Licensing Information + +The books are licensed under the [CC Public Domain Mark 1.0](https://creativecommons.org/publicdomain/mark/1.0/) license. + +### Citation Information + +[More Information Needed] + +### Contributions + +Thanks to [@davanstrien](/~https://github.com/davanstrien) for adding this dataset. + +``` + +``` diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py new file mode 100644 index 00000000000..4057db0a0a2 --- /dev/null +++ b/datasets/blbooks/blbooks.py @@ -0,0 +1,229 @@ +import gzip +import json +from typing import List, Union + +import datasets +from datasets.tasks import LanguageModeling + + +"TODO finalize citation" +_CITATION = """\ +@misc{bllabs2021, + author = {British Library Labs}, + title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)}, + year = {2021}, + publisher = {British Library}, + howpublished={https://doi.org/10.23636/r7w6-zy15} +""" + +_DESCRIPTION = """\ +A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. +The books cover a wide range of subject areas including philosophy, history, poetry and literature. +""" + +_BASE_URL = "https://bl.iro.bl.uk/downloads/" + + +_DATA_URLS = { + "1510_1699": _BASE_URL + "61f58234-b370-422f-8591-8f98e46c2757?locale=en", + "1700_1799": _BASE_URL + "78b4a8ec-395e-4383-831c-809faff85ad7?locale=en", + "1800_1809": _BASE_URL + "91ae15cb-e08f-4abf-8396-e4742d9d4e37?locale=en", + "1810_1819": _BASE_URL + "6d1a6e17-f28d-45b9-8f7a-a03cf3a96491?locale=en", + "1820_1829": _BASE_URL + "ec764dbd-1ed4-4fc2-8668-b4df5c8ec451?locale=en", + "1830_1839": _BASE_URL + "eab68022-0418-4df7-a401-78972514ed20?locale=en", + "1840_1849": _BASE_URL + "d16d88b0-aa3f-4dfe-b728-c58d168d7b4d?locale=en", + "1850_1859": _BASE_URL + "a6a44ea8-8d33-4880-8b17-f89c90e3d89a?locale=en", + "1860_1869": _BASE_URL + "2e17f00f-52e6-4259-962c-b88ad60dec23?locale=en", + "1870_1879": _BASE_URL + "899c3719-030c-4517-abd3-b28fdc85eed4?locale=en", + "1880_1889": _BASE_URL + "ec3b8545-775b-47bd-885d-ce895263709e?locale=en", + "1890_1899": _BASE_URL + "54ed2842-089a-439a-b751-2179b3ffba28?locale=en", +} +_ALL = list(_DATA_URLS.values()) +_1800s = [ + _DATA_URLS.get(subset) + for subset in [ + "1800_1809", + "1810_1819", + "1820_1829", + "1830_1839", + "1840_1849", + "1850_1859", + "1860_1869", + "1870_1879", + "1880_1889", + "1890_1899", + ] +] + +_1700s = [_DATA_URLS.get(subset) for subset in ["1700_1799"]] +_1510_1699 = [_DATA_URLS.get(subset) for subset in ["1510_1699"]] + +URL = "https://doi.org/10.23636/r7w6-zy15" + +features = datasets.Features( + { + "record_id": datasets.Value("string"), + "date": datasets.Value("int32"), + "raw_date": datasets.Value("string"), + "title": datasets.Value("string"), + "place": datasets.Value("string"), + "empty_pg": datasets.Value("bool"), + "text": datasets.Value("string"), + "pg": datasets.Value("int32"), + "mean_wc_ocr": datasets.Value("float32"), + "std_wc_ocr": datasets.Value("float64"), + "name": datasets.Value("string"), + "all_names": datasets.Value("string"), + "Publisher": datasets.Value("string"), + "Country of publication 1": datasets.Value("string"), + "all Countries of publication": datasets.Value("string"), + "Physical description": datasets.Value("string"), + "Language_1": datasets.Value("string"), + "Language_2": datasets.Value("string"), + "Language_3": datasets.Value("string"), + "Language_4": datasets.Value("string"), + "multi_language": datasets.Value("bool"), + } +) + + +class BritishLibraryBooksConfig(datasets.BuilderConfig): + """BuilderConfig for BritishLibraryBooks.""" + + def __init__(self, data_urls, citation, url, skip_empty=False, **kwargs): + """BuilderConfig for BritishLibraryBooks. + + Args: + data_url: `string`, url to download the zip file from. + citation: `string`, citation for the data set. + url: `string`, url for information about the data set. + skip_empty: `bool`, whether to skip empty pages. + **kwargs: keyword arguments forwarded to super. + """ + + super(BritishLibraryBooksConfig, self).__init__(version=datasets.Version("1.0.2"), **kwargs) + self.data_urls: Union[str, List[str]] = data_urls + self.citation: str = citation + self.skip_empty: bool = skip_empty + + +class BritishLibraryBooks(datasets.GeneratorBasedBuilder): + """The BritishLibraryBooks dataset.""" + + BUILDER_CONFIGS = [ + BritishLibraryBooksConfig( + name="all", + description="All periods of" + _DESCRIPTION, + data_urls=_ALL, + citation=_CITATION, + url="TODO", + skip_empty=True, + ), + BritishLibraryBooksConfig( + name="1800s", + description="A subset covering texts published during the 1800-1899 of" + _DESCRIPTION, + data_urls=_1800s, + citation=_CITATION, + url="TODO", + skip_empty=True, + ), + BritishLibraryBooksConfig( + name="1700s", + description="Subset covering 1700-1799 of" + _DESCRIPTION, + data_urls=_1700s, + citation=_CITATION, + url="TODO", + skip_empty=True, + ), + BritishLibraryBooksConfig( + name="1510_1699", + description="Subset covering 1510-1699 of " + _DESCRIPTION, + data_urls=_1510_1699, + citation=_CITATION, + url="TODO", + skip_empty=True, + ), + ] + + DEFAULT_CONFIG_NAME = "all" + + def _info(self): + return datasets.DatasetInfo( + description=_DESCRIPTION, + features=features, + supervised_keys=None, + homepage="https://www.bl.uk/collection-guides/digitised-printed-books", + citation=_CITATION, + task_templates=[LanguageModeling(text_column="text")], + ) + + def _split_generators(self, dl_manager: datasets.DownloadManager): + urls_to_download = self.config.data_urls + downloaded_archives = dl_manager.download(urls_to_download) + downloaded_archives = [dl_manager.iter_archive(archive) for archive in downloaded_archives] + return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"data_dirs": downloaded_archives})] + + def _generate_examples(self, data_dirs): + skip_empty = self.config.skip_empty + id_ = 0 + for data_dir in data_dirs: + for path, file in data_dir: + if not path.endswith(".gz"): + continue + with gzip.open(file) as json_l: + for row in json_l.readlines(): + data = json.loads(row) + empty_pg = data["empty_pg"] + if skip_empty and empty_pg: + continue + record_id = data["record_id"] + date = data["date"] + if not date: + continue + date = date + raw_date = data["raw_date"] + title = data["title"] + place = data["place"] + # if place: + # place = str(place) + text = data["text"] + pg = data["pg"] + mean_wc_ocr = data["mean_wc_ocr"] + mean_wc_ocr = float(mean_wc_ocr) if mean_wc_ocr else 0.0 + std_wc_ocr = data["std_wc_ocr"] + std_wc_ocr = float(data["std_wc_ocr"]) if std_wc_ocr else 0.0 + name = data["Name"] + all_names = data["All names"] + publisher = data["Publisher"] + country_of_publication_1 = data["Country of publication 1"] + all_Countries_of_publication = data["All Countries of publication"] + Physical_description = data["Physical description"] + Language_1 = data["Language_1"] + Language_2 = data["Language_2"] + Language_3 = data["Language_3"] + Language_4 = data["Language_4"] + multi_language = data["multi_language"] + id_ += 1 + yield id_, { + "record_id": record_id, + "date": date, + "raw_date": raw_date, + "title": title, + "place": place, + "empty_pg": empty_pg, + "text": text, + "pg": int(pg), + "mean_wc_ocr": mean_wc_ocr, + "std_wc_ocr": std_wc_ocr, + "name": name, + "all_names": all_names, + "Publisher": publisher, + "Country of publication 1": country_of_publication_1, + "all Countries of publication": all_Countries_of_publication, + "Physical description": Physical_description, + "Language_1": Language_1, + "Language_2": Language_2, + "Language_3": Language_3, + "Language_4": Language_4, + "multi_language": multi_language, + } From 73d7786acc125ea14f173778a00938569f667917 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sun, 23 Jan 2022 18:41:27 +0000 Subject: [PATCH 02/21] improve config naming --- datasets/blbooks/blbooks.py | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 4057db0a0a2..1d0d0a6c7a4 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -39,7 +39,7 @@ "1890_1899": _BASE_URL + "54ed2842-089a-439a-b751-2179b3ffba28?locale=en", } _ALL = list(_DATA_URLS.values()) -_1800s = [ +_1800_1899 = [ _DATA_URLS.get(subset) for subset in [ "1800_1809", @@ -55,7 +55,7 @@ ] ] -_1700s = [_DATA_URLS.get(subset) for subset in ["1700_1799"]] +_1700_1799 = [_DATA_URLS.get(subset) for subset in ["1700_1799"]] _1510_1699 = [_DATA_URLS.get(subset) for subset in ["1510_1699"]] URL = "https://doi.org/10.23636/r7w6-zy15" @@ -112,7 +112,7 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): BUILDER_CONFIGS = [ BritishLibraryBooksConfig( - name="all", + name="1500_1899", description="All periods of" + _DESCRIPTION, data_urls=_ALL, citation=_CITATION, @@ -120,17 +120,17 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): skip_empty=True, ), BritishLibraryBooksConfig( - name="1800s", + name="1800_1899", description="A subset covering texts published during the 1800-1899 of" + _DESCRIPTION, - data_urls=_1800s, + data_urls=_1800_1899, citation=_CITATION, url="TODO", skip_empty=True, ), BritishLibraryBooksConfig( - name="1700s", + name="1700_1799", description="Subset covering 1700-1799 of" + _DESCRIPTION, - data_urls=_1700s, + data_urls=_1700_1799, citation=_CITATION, url="TODO", skip_empty=True, From 78d2103180d3d2ce1b45602d3bf770fbbb705bfa Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sun, 23 Jan 2022 18:42:30 +0000 Subject: [PATCH 03/21] move parsing code into function --- datasets/blbooks/blbooks.py | 95 +++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 52 deletions(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 1d0d0a6c7a4..8a87cd631c2 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -1,6 +1,8 @@ import gzip import json -from typing import List, Union +from datetime import datetime +from functools import lru_cache +from typing import Dict, List import datasets from datasets.tasks import LanguageModeling @@ -63,7 +65,7 @@ features = datasets.Features( { "record_id": datasets.Value("string"), - "date": datasets.Value("int32"), + "date": datasets.Value("timestamp[s]"), "raw_date": datasets.Value("string"), "title": datasets.Value("string"), "place": datasets.Value("string"), @@ -163,6 +165,43 @@ def _split_generators(self, dl_manager: datasets.DownloadManager): downloaded_archives = [dl_manager.iter_archive(archive) for archive in downloaded_archives] return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"data_dirs": downloaded_archives})] + @lru_cache(maxsize=512) + def _parse_date(self, date): + if date is not None: + date = datetime.strptime(str(date), "%Y") + return date + + def _parse_data(self, data: Dict) -> Dict: + mean_wc_ocr = data["mean_wc_ocr"] + mean_wc_ocr = float(mean_wc_ocr) if mean_wc_ocr else None + std_wc_ocr = data["std_wc_ocr"] + std_wc_ocr = float(data["std_wc_ocr"]) if std_wc_ocr else None + date = data["date"] + if date is not None: + date = datetime.strptime(str(date), "%Y") + return { + "record_id": data["record_id"], + "date": date, + "raw_date": data["raw_date"], + "title": data["title"], + "place": data["place"], + "text": data["text"], + "pg": int(data["pg"]), + "mean_wc_ocr": data["mean_wc_ocr"], + "std_wc_ocr": std_wc_ocr, + "name": data["Name"], + "all_names": data["All names"], + "Publisher": data["Publisher"], + "Country of publication 1": data["Country of publication 1"], + "all Countries of publication": data["All Countries of publication"], + "Physical description": data["Physical description"], + "Language_1": data["Language_1"], + "Language_2": data["Language_2"], + "Language_3": data["Language_3"], + "Language_4": data["Language_4"], + "multi_language": data["multi_language"], + } + def _generate_examples(self, data_dirs): skip_empty = self.config.skip_empty id_ = 0 @@ -176,54 +215,6 @@ def _generate_examples(self, data_dirs): empty_pg = data["empty_pg"] if skip_empty and empty_pg: continue - record_id = data["record_id"] - date = data["date"] - if not date: - continue - date = date - raw_date = data["raw_date"] - title = data["title"] - place = data["place"] - # if place: - # place = str(place) - text = data["text"] - pg = data["pg"] - mean_wc_ocr = data["mean_wc_ocr"] - mean_wc_ocr = float(mean_wc_ocr) if mean_wc_ocr else 0.0 - std_wc_ocr = data["std_wc_ocr"] - std_wc_ocr = float(data["std_wc_ocr"]) if std_wc_ocr else 0.0 - name = data["Name"] - all_names = data["All names"] - publisher = data["Publisher"] - country_of_publication_1 = data["Country of publication 1"] - all_Countries_of_publication = data["All Countries of publication"] - Physical_description = data["Physical description"] - Language_1 = data["Language_1"] - Language_2 = data["Language_2"] - Language_3 = data["Language_3"] - Language_4 = data["Language_4"] - multi_language = data["multi_language"] + parsed_data = self._parse_data(data) + yield id_, {**parsed_data, **{"empty_pg": empty_pg}} id_ += 1 - yield id_, { - "record_id": record_id, - "date": date, - "raw_date": raw_date, - "title": title, - "place": place, - "empty_pg": empty_pg, - "text": text, - "pg": int(pg), - "mean_wc_ocr": mean_wc_ocr, - "std_wc_ocr": std_wc_ocr, - "name": name, - "all_names": all_names, - "Publisher": publisher, - "Country of publication 1": country_of_publication_1, - "all Countries of publication": all_Countries_of_publication, - "Physical description": Physical_description, - "Language_1": Language_1, - "Language_2": Language_2, - "Language_3": Language_3, - "Language_4": Language_4, - "multi_language": multi_language, - } From 2b0735fcf5ab0da994ace0a444c8bd596f1bf004 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sun, 23 Jan 2022 18:43:05 +0000 Subject: [PATCH 04/21] fix type hints --- datasets/blbooks/blbooks.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 8a87cd631c2..ea7ced4a2f3 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -104,7 +104,8 @@ def __init__(self, data_urls, citation, url, skip_empty=False, **kwargs): """ super(BritishLibraryBooksConfig, self).__init__(version=datasets.Version("1.0.2"), **kwargs) - self.data_urls: Union[str, List[str]] = data_urls + self.url: str = url + self.data_urls: List[str] = data_urls self.citation: str = citation self.skip_empty: bool = skip_empty From b8ae9974ed1befcbca93910c84541ddcd02c9c8d Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Mon, 24 Jan 2022 09:44:56 +0000 Subject: [PATCH 05/21] fix default config name --- datasets/blbooks/blbooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index ea7ced4a2f3..52c577df565 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -148,7 +148,7 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): ), ] - DEFAULT_CONFIG_NAME = "all" + DEFAULT_CONFIG_NAME = "1500_1899" def _info(self): return datasets.DatasetInfo( From 6d65aea3c5e34a17ee300f174e7e0a8369b41cba Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Mon, 24 Jan 2022 18:04:25 +0000 Subject: [PATCH 06/21] fix typo Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- datasets/blbooks/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 4916b8f6ba7..6eb9e7c473d 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -195,7 +195,7 @@ The publication dates of books in the data cover a broad period of time (1500-19 #### Optical Character Recognition -The digitised books in this collection were transformed into machine-readable text using Optical Character Recognition (OCR) software. The text produced via OCR software will usually include some errors. These errors include; mistakes at the character level; for example, an `i’ is mistaken for an `l`, at the word level or across significant passages of text. +The digitised books in this collection were transformed into machine-readable text using Optical Character Recognition (OCR) software. The text produced via OCR software will usually include some errors. These errors include; mistakes at the character level; for example, an `i` is mistaken for an `l`, at the word level or across significant passages of text. The books in this dataset can pose some additional challenges for OCR software. OCR errors can stem from: From 2c2f4d005296be28f3001bce9ead5a33bc82b8fb Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Mon, 24 Jan 2022 18:05:05 +0000 Subject: [PATCH 07/21] add header Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- datasets/blbooks/blbooks.py | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 52c577df565..efa59ab92f3 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -1,3 +1,17 @@ +# coding=utf-8 +# Copyright 2021 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import gzip import json from datetime import datetime From 3b13ac780ff63941357d982d694f1aed466ff763 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Mon, 24 Jan 2022 18:05:39 +0000 Subject: [PATCH 08/21] remove readlines call Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- datasets/blbooks/blbooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index efa59ab92f3..f08b478fc73 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -225,7 +225,7 @@ def _generate_examples(self, data_dirs): if not path.endswith(".gz"): continue with gzip.open(file) as json_l: - for row in json_l.readlines(): + for row in json_l: data = json.loads(row) empty_pg = data["empty_pg"] if skip_empty and empty_pg: From 1f8188d815387e2bfee5bfc9927dee0ffce48e45 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 17:01:34 +0000 Subject: [PATCH 09/21] update copyright date --- datasets/blbooks/blbooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index f08b478fc73..1e1c5ec22ef 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -1,5 +1,5 @@ # coding=utf-8 -# Copyright 2021 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. +# Copyright 2022 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. From 6a1dd85ae453253b29ded30752adf6f1880bf270 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 17:15:30 +0000 Subject: [PATCH 10/21] add citation to README --- datasets/blbooks/README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 6eb9e7c473d..5083509bf8a 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -524,7 +524,15 @@ The books are licensed under the [CC Public Domain Mark 1.0](https://creativecom ### Citation Information -[More Information Needed] +```bibtext +@misc{bllabs2021, + author = {British Library Labs}, + title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)}, + year = {2021}, + publisher = {British Library}, + howpublished={https://doi.org/10.23636/r7w6-zy15} + +``` ### Contributions From 90020bec1b8c660a24ee640667666a1c479a53a9 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 17:34:17 +0000 Subject: [PATCH 11/21] update citation key --- datasets/blbooks/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 5083509bf8a..2510720cf78 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -525,7 +525,7 @@ The books are licensed under the [CC Public Domain Mark 1.0](https://creativecom ### Citation Information ```bibtext -@misc{bllabs2021, +@misc{bBritishLibraryBooks2021, author = {British Library Labs}, title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)}, year = {2021}, From ea2d3cdb1f8a5301d74e128cbad004247089a2ef Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 17:34:37 +0000 Subject: [PATCH 12/21] update citation key --- datasets/blbooks/blbooks.py | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 1e1c5ec22ef..cfee72530d9 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -22,9 +22,8 @@ from datasets.tasks import LanguageModeling -"TODO finalize citation" _CITATION = """\ -@misc{bllabs2021, +@misc{BritishLibraryBooks2021, author = {British Library Labs}, title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)}, year = {2021}, From 59ab73702e9803b1b982a2b174f175b3c200ba73 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 18:16:32 +0000 Subject: [PATCH 13/21] add contact details --- datasets/blbooks/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 2510720cf78..677ea644d21 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -70,11 +70,11 @@ annotations_creators: ## Dataset Description -- **Homepage:** -- **Repository:** +- **Homepage:** https://www.bl.uk/collection-guides/digitised-printed-books +- **Repository:** https://doi.org/10.21250/db14 - **Paper:** - **Leaderboard:** -- **Point of Contact:** +- **Point of Contact:** labs@bl.uk ### Dataset Summary From 38a0eea97ba1a741402b21fc7d4c209298dc53e1 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 18:19:07 +0000 Subject: [PATCH 14/21] add URLs to configs --- datasets/blbooks/blbooks.py | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index cfee72530d9..98651e394f5 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -132,15 +132,16 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): description="All periods of" + _DESCRIPTION, data_urls=_ALL, citation=_CITATION, - url="TODO", + url=URL, skip_empty=True, ), BritishLibraryBooksConfig( name="1800_1899", - description="A subset covering texts published during the 1800-1899 of" + _DESCRIPTION, + description="A subset covering texts published during the 1800-1899 of" + + _DESCRIPTION, data_urls=_1800_1899, citation=_CITATION, - url="TODO", + url=URL, skip_empty=True, ), BritishLibraryBooksConfig( @@ -148,7 +149,7 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): description="Subset covering 1700-1799 of" + _DESCRIPTION, data_urls=_1700_1799, citation=_CITATION, - url="TODO", + url=URL, skip_empty=True, ), BritishLibraryBooksConfig( From b8d470d5e1fffb3d28e35bbfbff845da1cc25c17 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 18:21:59 +0000 Subject: [PATCH 15/21] add url --- datasets/blbooks/blbooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index 98651e394f5..a91427de591 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -157,7 +157,7 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): description="Subset covering 1510-1699 of " + _DESCRIPTION, data_urls=_1510_1699, citation=_CITATION, - url="TODO", + url=URL, skip_empty=True, ), ] From a3d26528e748b031272681da2b6f773b69056d3e Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 18:23:07 +0000 Subject: [PATCH 16/21] black formatting --- datasets/blbooks/blbooks.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/datasets/blbooks/blbooks.py b/datasets/blbooks/blbooks.py index a91427de591..627dda3023e 100644 --- a/datasets/blbooks/blbooks.py +++ b/datasets/blbooks/blbooks.py @@ -53,6 +53,7 @@ "1880_1889": _BASE_URL + "ec3b8545-775b-47bd-885d-ce895263709e?locale=en", "1890_1899": _BASE_URL + "54ed2842-089a-439a-b751-2179b3ffba28?locale=en", } + _ALL = list(_DATA_URLS.values()) _1800_1899 = [ _DATA_URLS.get(subset) @@ -69,7 +70,6 @@ "1890_1899", ] ] - _1700_1799 = [_DATA_URLS.get(subset) for subset in ["1700_1799"]] _1510_1699 = [_DATA_URLS.get(subset) for subset in ["1510_1699"]] @@ -137,8 +137,7 @@ class BritishLibraryBooks(datasets.GeneratorBasedBuilder): ), BritishLibraryBooksConfig( name="1800_1899", - description="A subset covering texts published during the 1800-1899 of" - + _DESCRIPTION, + description="A subset covering texts published during the 1800-1899 of" + _DESCRIPTION, data_urls=_1800_1899, citation=_CITATION, url=URL, From 39f18faf7cffd155614abaa92b495a489a6fdb22 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Tue, 25 Jan 2022 18:38:09 +0000 Subject: [PATCH 17/21] add config options to readme --- datasets/blbooks/README.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 677ea644d21..0c049ae9f2b 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -307,11 +307,18 @@ Again, these numbers should be treated sceptically since some languages appear v ## Dataset Structure -The dataset has a number of configurations: +The dataset has a number of configurations relating to the different dates of publication in the underlying data: -TODO +- `1500_1899`: this configuration covers all years +- `1800_1899`: this configuration covers the years between 1800 and 1899 +- `1700_1799`: this configuration covers the years between 1700 and 1799 +- `1510_1699`: this configuration covers the years between 1510 and 1699 -- `skip_empty_pages` +### Configuration option + +All of the configurations have an optional keyword argument `skip_empty_pages` which is set to `True` by default. The underlying dataset includes some pages where there is no text. This could either be because the underlying book page didn't have any text or the OCR software failed to detect this text. + +For many uses of this dataset it doesn't make sense to include empty pages so these are skipped by default. However, for some uses you may prefer to retain a representation of the data that includes these empty pages. Passing `skip_empty_pages=False` when loading the dataset will enable this option. ### Data Instances From 17b507c9b0567c663005e7723da37dc60134d5b7 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Wed, 26 Jan 2022 12:39:57 +0000 Subject: [PATCH 18/21] generate dataset_infos --- datasets/blbooks/dataset_infos.json | 1 + 1 file changed, 1 insertion(+) create mode 100644 datasets/blbooks/dataset_infos.json diff --git a/datasets/blbooks/dataset_infos.json b/datasets/blbooks/dataset_infos.json new file mode 100644 index 00000000000..ef743a9fb2d --- /dev/null +++ b/datasets/blbooks/dataset_infos.json @@ -0,0 +1 @@ +{"all": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{bllabs2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "int32", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "all", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 30394267732, "num_examples": 14011953, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/61f58234-b370-422f-8591-8f98e46c2757?locale=en": {"num_bytes": 42320165, "checksum": "6795dd54f6b489f9fcd9146bdd6d5d9abdcb34e1492e7a283b720936292ac599"}, "https://bl.iro.bl.uk/downloads/78b4a8ec-395e-4383-831c-809faff85ad7?locale=en": {"num_bytes": 95137895, "checksum": "da6ca8597b7e65c0be51b239f5aa06462419970332c2c82308b8f2e1fe66b58e"}, "https://bl.iro.bl.uk/downloads/91ae15cb-e08f-4abf-8396-e4742d9d4e37?locale=en": {"num_bytes": 178669204, "checksum": "fb78f4c5b46bb9013c5d261d531771faf5e931a329c3b5f7a07854ce2ba523e0"}, "https://bl.iro.bl.uk/downloads/6d1a6e17-f28d-45b9-8f7a-a03cf3a96491?locale=en": {"num_bytes": 283713235, "checksum": "015c76358b83f868afb72444d7a46583e3f3b002176fe7cd5b393216ce5cd2b8"}, "https://bl.iro.bl.uk/downloads/ec764dbd-1ed4-4fc2-8668-b4df5c8ec451?locale=en": {"num_bytes": 383417283, "checksum": "0e0c27465ac18593ef7132985fc736b6c24d21551015327f9e3662cdf402760c"}, "https://bl.iro.bl.uk/downloads/eab68022-0418-4df7-a401-78972514ed20?locale=en": {"num_bytes": 472114735, "checksum": "82a5d00f0e3115bab44e9223b936d0114dd9299767ee3e8a2f749d0ebb60b3b4"}, "https://bl.iro.bl.uk/downloads/d16d88b0-aa3f-4dfe-b728-c58d168d7b4d?locale=en": {"num_bytes": 896281411, "checksum": "87cc662769eb445f9b429a29d7b7f933568719ff37640d08a7c677ee31f4f206"}, "https://bl.iro.bl.uk/downloads/a6a44ea8-8d33-4880-8b17-f89c90e3d89a?locale=en": {"num_bytes": 1327206960, "checksum": "1091a8316519888a0cf9b8dbc6a1084ef23b9d756b75d2ae14229570aa8f8024"}, "https://bl.iro.bl.uk/downloads/2e17f00f-52e6-4259-962c-b88ad60dec23?locale=en": {"num_bytes": 1297572144, "checksum": "dbe664a178dc52bc60f1600a7ebf339a4e5431a578a03c2e8ad45cde8ac8c71b"}, "https://bl.iro.bl.uk/downloads/899c3719-030c-4517-abd3-b28fdc85eed4?locale=en": {"num_bytes": 1486429823, "checksum": "0fb963b2273d9f1c040f20fd77ebc33d6d8fdaec6b3f39b7f9e97faf856b7b9e"}, "https://bl.iro.bl.uk/downloads/ec3b8545-775b-47bd-885d-ce895263709e?locale=en": {"num_bytes": 1890726047, "checksum": "b17d1126462b29d904c60e5269f86a4f321069421d72f14052c83c769ce34c50"}, "https://bl.iro.bl.uk/downloads/54ed2842-089a-439a-b751-2179b3ffba28?locale=en": {"num_bytes": 2132446760, "checksum": "9805bbabe285b90cb1f6a3b79a88918aa1c03b9799432775e16f7fd862443196"}}, "download_size": 10486035662, "post_processing_size": null, "dataset_size": 30394267732, "size_in_bytes": 40880303394}, "1800s": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{bllabs2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "int32", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1800s", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 30020434670, "num_examples": 13781747, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/91ae15cb-e08f-4abf-8396-e4742d9d4e37?locale=en": {"num_bytes": 178669204, "checksum": "fb78f4c5b46bb9013c5d261d531771faf5e931a329c3b5f7a07854ce2ba523e0"}, "https://bl.iro.bl.uk/downloads/6d1a6e17-f28d-45b9-8f7a-a03cf3a96491?locale=en": {"num_bytes": 283713235, "checksum": "015c76358b83f868afb72444d7a46583e3f3b002176fe7cd5b393216ce5cd2b8"}, "https://bl.iro.bl.uk/downloads/ec764dbd-1ed4-4fc2-8668-b4df5c8ec451?locale=en": {"num_bytes": 383417283, "checksum": "0e0c27465ac18593ef7132985fc736b6c24d21551015327f9e3662cdf402760c"}, "https://bl.iro.bl.uk/downloads/eab68022-0418-4df7-a401-78972514ed20?locale=en": {"num_bytes": 472114735, "checksum": "82a5d00f0e3115bab44e9223b936d0114dd9299767ee3e8a2f749d0ebb60b3b4"}, "https://bl.iro.bl.uk/downloads/d16d88b0-aa3f-4dfe-b728-c58d168d7b4d?locale=en": {"num_bytes": 896281411, "checksum": "87cc662769eb445f9b429a29d7b7f933568719ff37640d08a7c677ee31f4f206"}, "https://bl.iro.bl.uk/downloads/a6a44ea8-8d33-4880-8b17-f89c90e3d89a?locale=en": {"num_bytes": 1327206960, "checksum": "1091a8316519888a0cf9b8dbc6a1084ef23b9d756b75d2ae14229570aa8f8024"}, "https://bl.iro.bl.uk/downloads/2e17f00f-52e6-4259-962c-b88ad60dec23?locale=en": {"num_bytes": 1297572144, "checksum": "dbe664a178dc52bc60f1600a7ebf339a4e5431a578a03c2e8ad45cde8ac8c71b"}, "https://bl.iro.bl.uk/downloads/899c3719-030c-4517-abd3-b28fdc85eed4?locale=en": {"num_bytes": 1486429823, "checksum": "0fb963b2273d9f1c040f20fd77ebc33d6d8fdaec6b3f39b7f9e97faf856b7b9e"}, "https://bl.iro.bl.uk/downloads/ec3b8545-775b-47bd-885d-ce895263709e?locale=en": {"num_bytes": 1890726047, "checksum": "b17d1126462b29d904c60e5269f86a4f321069421d72f14052c83c769ce34c50"}, "https://bl.iro.bl.uk/downloads/54ed2842-089a-439a-b751-2179b3ffba28?locale=en": {"num_bytes": 2132446760, "checksum": "9805bbabe285b90cb1f6a3b79a88918aa1c03b9799432775e16f7fd862443196"}}, "download_size": 10348577602, "post_processing_size": null, "dataset_size": 30020434670, "size_in_bytes": 40369012272}, "1700s": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{bllabs2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "int32", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1700s", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 266382657, "num_examples": 178224, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/78b4a8ec-395e-4383-831c-809faff85ad7?locale=en": {"num_bytes": 95137895, "checksum": "da6ca8597b7e65c0be51b239f5aa06462419970332c2c82308b8f2e1fe66b58e"}}, "download_size": 95137895, "post_processing_size": null, "dataset_size": 266382657, "size_in_bytes": 361520552}, "1510_1699": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{BritishLibraryBooks2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "timestamp[s]", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1510_1699", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 107667469, "num_examples": 51982, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/61f58234-b370-422f-8591-8f98e46c2757?locale=en": {"num_bytes": 42320165, "checksum": "6795dd54f6b489f9fcd9146bdd6d5d9abdcb34e1492e7a283b720936292ac599"}}, "download_size": 42320165, "post_processing_size": null, "dataset_size": 107667469, "size_in_bytes": 149987634}, "1500_1899": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{BritishLibraryBooks2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "timestamp[s]", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1500_1899", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 30452067039, "num_examples": 14011953, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/61f58234-b370-422f-8591-8f98e46c2757?locale=en": {"num_bytes": 42320165, "checksum": "6795dd54f6b489f9fcd9146bdd6d5d9abdcb34e1492e7a283b720936292ac599"}, "https://bl.iro.bl.uk/downloads/78b4a8ec-395e-4383-831c-809faff85ad7?locale=en": {"num_bytes": 95137895, "checksum": "da6ca8597b7e65c0be51b239f5aa06462419970332c2c82308b8f2e1fe66b58e"}, "https://bl.iro.bl.uk/downloads/91ae15cb-e08f-4abf-8396-e4742d9d4e37?locale=en": {"num_bytes": 178669204, "checksum": "fb78f4c5b46bb9013c5d261d531771faf5e931a329c3b5f7a07854ce2ba523e0"}, "https://bl.iro.bl.uk/downloads/6d1a6e17-f28d-45b9-8f7a-a03cf3a96491?locale=en": {"num_bytes": 283713235, "checksum": "015c76358b83f868afb72444d7a46583e3f3b002176fe7cd5b393216ce5cd2b8"}, "https://bl.iro.bl.uk/downloads/ec764dbd-1ed4-4fc2-8668-b4df5c8ec451?locale=en": {"num_bytes": 383417283, "checksum": "0e0c27465ac18593ef7132985fc736b6c24d21551015327f9e3662cdf402760c"}, "https://bl.iro.bl.uk/downloads/eab68022-0418-4df7-a401-78972514ed20?locale=en": {"num_bytes": 472114735, "checksum": "82a5d00f0e3115bab44e9223b936d0114dd9299767ee3e8a2f749d0ebb60b3b4"}, "https://bl.iro.bl.uk/downloads/d16d88b0-aa3f-4dfe-b728-c58d168d7b4d?locale=en": {"num_bytes": 896281411, "checksum": "87cc662769eb445f9b429a29d7b7f933568719ff37640d08a7c677ee31f4f206"}, "https://bl.iro.bl.uk/downloads/a6a44ea8-8d33-4880-8b17-f89c90e3d89a?locale=en": {"num_bytes": 1327206960, "checksum": "1091a8316519888a0cf9b8dbc6a1084ef23b9d756b75d2ae14229570aa8f8024"}, "https://bl.iro.bl.uk/downloads/2e17f00f-52e6-4259-962c-b88ad60dec23?locale=en": {"num_bytes": 1297572144, "checksum": "dbe664a178dc52bc60f1600a7ebf339a4e5431a578a03c2e8ad45cde8ac8c71b"}, "https://bl.iro.bl.uk/downloads/899c3719-030c-4517-abd3-b28fdc85eed4?locale=en": {"num_bytes": 1486429823, "checksum": "0fb963b2273d9f1c040f20fd77ebc33d6d8fdaec6b3f39b7f9e97faf856b7b9e"}, "https://bl.iro.bl.uk/downloads/ec3b8545-775b-47bd-885d-ce895263709e?locale=en": {"num_bytes": 1890726047, "checksum": "b17d1126462b29d904c60e5269f86a4f321069421d72f14052c83c769ce34c50"}, "https://bl.iro.bl.uk/downloads/54ed2842-089a-439a-b751-2179b3ffba28?locale=en": {"num_bytes": 2132446760, "checksum": "9805bbabe285b90cb1f6a3b79a88918aa1c03b9799432775e16f7fd862443196"}}, "download_size": 10486035662, "post_processing_size": null, "dataset_size": 30452067039, "size_in_bytes": 40938102701}, "1800_1899": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{BritishLibraryBooks2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "timestamp[s]", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1800_1899", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 30077284377, "num_examples": 13781747, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/91ae15cb-e08f-4abf-8396-e4742d9d4e37?locale=en": {"num_bytes": 178669204, "checksum": "fb78f4c5b46bb9013c5d261d531771faf5e931a329c3b5f7a07854ce2ba523e0"}, "https://bl.iro.bl.uk/downloads/6d1a6e17-f28d-45b9-8f7a-a03cf3a96491?locale=en": {"num_bytes": 283713235, "checksum": "015c76358b83f868afb72444d7a46583e3f3b002176fe7cd5b393216ce5cd2b8"}, "https://bl.iro.bl.uk/downloads/ec764dbd-1ed4-4fc2-8668-b4df5c8ec451?locale=en": {"num_bytes": 383417283, "checksum": "0e0c27465ac18593ef7132985fc736b6c24d21551015327f9e3662cdf402760c"}, "https://bl.iro.bl.uk/downloads/eab68022-0418-4df7-a401-78972514ed20?locale=en": {"num_bytes": 472114735, "checksum": "82a5d00f0e3115bab44e9223b936d0114dd9299767ee3e8a2f749d0ebb60b3b4"}, "https://bl.iro.bl.uk/downloads/d16d88b0-aa3f-4dfe-b728-c58d168d7b4d?locale=en": {"num_bytes": 896281411, "checksum": "87cc662769eb445f9b429a29d7b7f933568719ff37640d08a7c677ee31f4f206"}, "https://bl.iro.bl.uk/downloads/a6a44ea8-8d33-4880-8b17-f89c90e3d89a?locale=en": {"num_bytes": 1327206960, "checksum": "1091a8316519888a0cf9b8dbc6a1084ef23b9d756b75d2ae14229570aa8f8024"}, "https://bl.iro.bl.uk/downloads/2e17f00f-52e6-4259-962c-b88ad60dec23?locale=en": {"num_bytes": 1297572144, "checksum": "dbe664a178dc52bc60f1600a7ebf339a4e5431a578a03c2e8ad45cde8ac8c71b"}, "https://bl.iro.bl.uk/downloads/899c3719-030c-4517-abd3-b28fdc85eed4?locale=en": {"num_bytes": 1486429823, "checksum": "0fb963b2273d9f1c040f20fd77ebc33d6d8fdaec6b3f39b7f9e97faf856b7b9e"}, "https://bl.iro.bl.uk/downloads/ec3b8545-775b-47bd-885d-ce895263709e?locale=en": {"num_bytes": 1890726047, "checksum": "b17d1126462b29d904c60e5269f86a4f321069421d72f14052c83c769ce34c50"}, "https://bl.iro.bl.uk/downloads/54ed2842-089a-439a-b751-2179b3ffba28?locale=en": {"num_bytes": 2132446760, "checksum": "9805bbabe285b90cb1f6a3b79a88918aa1c03b9799432775e16f7fd862443196"}}, "download_size": 10348577602, "post_processing_size": null, "dataset_size": 30077284377, "size_in_bytes": 40425861979}, "1700_1799": {"description": "A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900.\nThe books cover a wide range of subject areas including philosophy, history, poetry and literature.\n", "citation": "@misc{BritishLibraryBooks2021,\n author = {British Library Labs},\n title = {Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)},\n year = {2021},\n publisher = {British Library},\n howpublished={https://doi.org/10.23636/r7w6-zy15}\n", "homepage": "https://www.bl.uk/collection-guides/digitised-printed-books", "license": "", "features": {"record_id": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "timestamp[s]", "id": null, "_type": "Value"}, "raw_date": {"dtype": "string", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "place": {"dtype": "string", "id": null, "_type": "Value"}, "empty_pg": {"dtype": "bool", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "pg": {"dtype": "int32", "id": null, "_type": "Value"}, "mean_wc_ocr": {"dtype": "float32", "id": null, "_type": "Value"}, "std_wc_ocr": {"dtype": "float64", "id": null, "_type": "Value"}, "name": {"dtype": "string", "id": null, "_type": "Value"}, "all_names": {"dtype": "string", "id": null, "_type": "Value"}, "Publisher": {"dtype": "string", "id": null, "_type": "Value"}, "Country of publication 1": {"dtype": "string", "id": null, "_type": "Value"}, "all Countries of publication": {"dtype": "string", "id": null, "_type": "Value"}, "Physical description": {"dtype": "string", "id": null, "_type": "Value"}, "Language_1": {"dtype": "string", "id": null, "_type": "Value"}, "Language_2": {"dtype": "string", "id": null, "_type": "Value"}, "Language_3": {"dtype": "string", "id": null, "_type": "Value"}, "Language_4": {"dtype": "string", "id": null, "_type": "Value"}, "multi_language": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": [{"task": "language-modeling", "text_column": "text"}], "builder_name": "british_library_books", "config_name": "1700_1799", "version": {"version_str": "1.0.2", "description": null, "major": 1, "minor": 0, "patch": 2}, "splits": {"train": {"name": "train", "num_bytes": 267117831, "num_examples": 178224, "dataset_name": "british_library_books"}}, "download_checksums": {"https://bl.iro.bl.uk/downloads/78b4a8ec-395e-4383-831c-809faff85ad7?locale=en": {"num_bytes": 95137895, "checksum": "da6ca8597b7e65c0be51b239f5aa06462419970332c2c82308b8f2e1fe66b58e"}}, "download_size": 95137895, "post_processing_size": null, "dataset_size": 267117831, "size_in_bytes": 362255726}} \ No newline at end of file From 9796416f9088c109209617734040658e7dadee76 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 31 Jan 2022 14:56:35 +0100 Subject: [PATCH 19/21] add dummy data --- .../blbooks/dummy/1500_1899/1.0.2/dummy_data.zip | Bin 0 -> 5038 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 datasets/blbooks/dummy/1500_1899/1.0.2/dummy_data.zip diff --git a/datasets/blbooks/dummy/1500_1899/1.0.2/dummy_data.zip b/datasets/blbooks/dummy/1500_1899/1.0.2/dummy_data.zip new file mode 100644 index 0000000000000000000000000000000000000000..0faf40eab40eb0342a41b831eb401fd7100da3ff GIT binary patch literal 5038 zcmcJR2~ZSQ8pnrg86YvFL?r@V6%>TVo<62uYZf6&l$!(*WHr#&AO=JPPmoCD5JgcA z1F@tGLDmB|ngpU?SQAuYiSb}#hzbM=D{6+ILcleIeGM6#Ar0AbYu?n?JySFDe*gEr zuiy6%@-;EFQHbO4`uR%+-~4{DQcP3u>tbU!Mes}_Gbd!piwg7ld%a8(4Dr5hqm{zQ zw8u!H7@Q#J*+jcWOysUV5v^hb70I!PfC3FdFsudvIITtmf+8@U<_Q6(9Pu||;y5No zaKsl1tB2V8AyXHlvXjSVlZ3!yfWQzG&!bX4HZVQ7Wf)nG`3WGc81ZQMmi1!gQYTvJZYnz8n#(+$}XuO!;vIA$?k+cn1cqr113)wHjnZ)Y@_MPB%L znX*suTG+hI#;m%k^aa&#UA}Lh&|O*|zx&J11ZRhjQa0?(x1a7tHAgtu`8;?%+$6h{ z%7@h56W;wLA|R)?FzLy^lABZPTK1JVHJuOJv`G7QMv0R{>2LE* z6a7MOWZZJN8EfCxVC2wHP|=z)PJ4ZdcJKrFZg@uCOSSbGxJ~We-4-2vp`VvK4P=FCH|1n`)|JdPs-04M@jF$I9;XjH&?c#O)e`zU1VVpR5_vDpGgsR^Ft5f$8A zgwSvp0@P}Nu(0Gw4hS4fgB-gKvUM>kyX4Q_9%eE{;1**fYKU9qX8tt%jP>5vUM>kyJ&1S zuTt{>uqeVXIPA^5Mu7bh0|-X~_!ZzOxcPGH{sgjhAA9%~d=1gyBcJkNG zVeu~<`61|HdVbF99@Dqa@VLRwuiO8R4ab|-y_f!GVRt)?; zdhGPalcU#Jg&h1Prn1I#VNFNaH|f<|BO87Sb(-~fMxR%Q*Y}g_lt1#=m$f+O zJ-*PN-yLW^%V}f43oXjsEoUp|m9I)`KhlVfn^TB(x79WbL?3OfKGN*p6IT{~rSbIa z!d7Knp?{uT{Tg?#J8w7qxF-7wm2}Fz*kSs;?jOeI23%!NhVAs3`{u)@6}asQ+uQHi zEe<(FAFoJviA>__z9{%MEwA|KvPCziob)V(d)M zTR%fN)V_DBCoT>Y3mHa`uaU8h$>>g6Y~0+@+d3H^u1l$oFx;()u2DBi zQ}4n8_&+_Q4kk`2X8?6Dt!I{r-f|3L2c(Z=QzR#%YRIJC~1*1h?166 zeJuvDi;@;O!zgL#*VkeY`zUFVGmw&&O=fyIGKig&w8$AsNlTBu7K7MJNsFApl(fYE zMQ>jWVmBo%a)wjVq8+Kl0QOVTB4`3)eJuvDr$btF26cE( q7Y(Xzl%_JU>kn(M&vDj0^PK;KT@CUzgEgU0Si@HkBrmoSKm8Ym?$;au literal 0 HcmV?d00001 From 93417defa19f92cff914d5c1a82ebf730a181291 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 31 Jan 2022 14:56:41 +0100 Subject: [PATCH 20/21] fix tags --- datasets/blbooks/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 0c049ae9f2b..4e9bc2cb314 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -1,28 +1,28 @@ --- annotations_creators: - no-annotation - language_creators: +language_creators: - machine-generated - languages: +languages: - en - fr - de - es - it - nl - licenses: +licenses: - cc0-1.0 - multilinguality: +multilinguality: - multilingual - pretty_name: British Library Books - size_categories: -- unknown - source_datasets: +pretty_name: British Library Books +size_categories: +- 100K Date: Mon, 31 Jan 2022 17:51:27 +0100 Subject: [PATCH 21/21] Update datasets/blbooks/README.md --- datasets/blbooks/README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/datasets/blbooks/README.md b/datasets/blbooks/README.md index 4e9bc2cb314..2542ce04f9a 100644 --- a/datasets/blbooks/README.md +++ b/datasets/blbooks/README.md @@ -544,7 +544,3 @@ The books are licensed under the [CC Public Domain Mark 1.0](https://creativecom ### Contributions Thanks to [@davanstrien](/~https://github.com/davanstrien) for adding this dataset. - -``` - -```