From 8b453c5296774fe9a4be63687e83fc0a0d11458e Mon Sep 17 00:00:00 2001 From: Xing Han Lu Date: Mon, 4 Oct 2021 20:12:31 -0400 Subject: [PATCH 1/7] Add further details about the MeDAL dataset --- datasets/medal/README.md | 58 +++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/datasets/medal/README.md b/datasets/medal/README.md index 77a885483ef..bb573cabce7 100644 --- a/datasets/medal/README.md +++ b/datasets/medal/README.md @@ -70,11 +70,22 @@ English (en) ## Dataset Structure -[More Information Needed] +Each file is a table consisting of three columns: +* TEXT: The normalized content of an abstract +* LOCATION: The location (index) of each abbreviation that was substituted +* LABEL: The word at that was substituted at the given location + ### Data Instances -[More Information Needed] +TEXT: +> a report is given on the recent discovery of outstanding immunological properties in ba ncyanoethyleneurea having a low molecular mass m experiments in ds CS bearing wistar rats have shown that ba at a dosage of only about percent ld mg kg and negligible lethality percent results in a REC rate of percent without hyperglycemia and in one test of percent with hyperglycemia under otherwise unchanged conditions the REF substance ifosfamide if a further development of cyclophosphamide applied without hyperglycemia in its most efficient dosage of percent ld mg kg brought about a recovery rate of percent at a lethality of percent contrary to ba min hyperglycemia caused no further improvement of the REC rate however this comparison is characterized by the fact that both substances exhibit two quite different complementary mechanisms of action leucocyte counts made T3 application of the said cancerostatics and dosages have shown a pronounced stimulation with ba and with ifosfamide the known suppression in the posttherapeutic interval usually found with standard cancerostatics in combination with the cited PI test for ba blood pictures then allow conclusions on the immunity status since if can be taken as one of the most efficient cancerostaticsthere is no other chemotherapeutic known up to now that has a more significant effect on the ds carcinosarcoma in rats these findings are of special importance finally the total amount of leucocytes and lymphocytes as well as their time behaviour was determined from the blood picture of tumourfree rats after iv application of ba the thus obtained numerical values clearly show that further research work on the prophylactic use of this substance seems to be necessary and very promising + +LOCATION: +> 24|49|68|113|137|172 + +LABEL: +> carcinosarcoma|recovery|reference|recovery|after|plaque ### Data Fields @@ -82,7 +93,12 @@ English (en) ### Data Splits -[More Information Needed] +The following files are present: + +* `full_data.csv`: The full dataset with all 14M abstracts. +* `train.csv`: The subset used to train the baseline and proposed models. +* `valid.csv`: The subset used to validate the model during training for hyperparameter selection. +* `test.csv`: The subset used to evaluate the model and report the results in the tables. ## Dataset Creation @@ -93,7 +109,7 @@ English (en) ### Source Data -[More Information Needed] +The original dataset was retrieved and modified from the [NLM website](https://www.nlm.nih.gov/databases/download/pubmed_medline.html). #### Initial Data Collection and Normalization @@ -105,7 +121,7 @@ English (en) ### Annotations -[More Information Needed] +Details on how the abbreviations were created can be found in section 2.2 (Dataset Creation) of the [ACL ClinicalNLP paper](https://aclanthology.org/2020.clinicalnlp-1.15.pdf). #### Annotation process @@ -127,7 +143,7 @@ English (en) ### Discussion of Biases -[More Information Needed] +Since the abstracts are written in English, the data is biased towards anglo-centric medical research. If you plan to use a model pre-trained on this dataset for a predominantly non-English community, it is important to verify whether there are negative biases present in your model, and ensure that they are correctly mitigated. For instance, you could fine-tune your dataset on a multilingual medical disambiguation dataset, or collect a dataset specific to your use case. ### Other Known Limitations @@ -141,7 +157,35 @@ English (en) ### Licensing Information -[More Information Needed] +The ELECTRA model is licensed under [Apache 2.0](/~https://github.com/google-research/electra/blob/master/LICENSE). The license for the libraries used in this project (`transformers`, `pytorch`, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license. + + +The original dataset was retrieved and modified from the [NLM website](https://www.nlm.nih.gov/databases/download/pubmed_medline.html). By using this dataset, you are bound by the [terms and conditions](https://www.nlm.nih.gov/databases/download/terms_and_conditions_pubmed.html) specified by NLM: + +> INTRODUCTION +> +> Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data. +> +> MEDLINE/PUBMED SPECIFIC TERMS +> +> NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright. +> +> GENERAL TERMS AND CONDITIONS +> +> * Users of the data agree to: +> * acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner, +> * properly use registration and/or trademark symbols when referring to NLM products, and +> * not indicate or imply that NLM has endorsed its products/services/applications. +> +> * Users who republish or redistribute the data (services, products or raw data) agree to: +> * maintain the most current version of all distributed data, or +> * make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM. +> +> * These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data. +> +> * NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page. +> +> * NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates. ### Citation Information From f993afef2e441bd32148a15f039f6ea76bc7be24 Mon Sep 17 00:00:00 2001 From: Xing Han Lu Date: Mon, 4 Oct 2021 20:15:26 -0400 Subject: [PATCH 2/7] Update the download link to the most recent version (v4) --- datasets/medal/medal.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/datasets/medal/medal.py b/datasets/medal/medal.py index 008951d6b4c..a9e7da34bc4 100644 --- a/datasets/medal/medal.py +++ b/datasets/medal/medal.py @@ -45,7 +45,7 @@ A large medical text dataset (14Go) curated to 4Go for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. For example, DHF can be disambiguated to dihydrofolate, diastolic heart failure, dengue hemorragic fever or dihydroxyfumarate """ -_URL = "https://zenodo.org/record/4276178/files/" +_URL = "https://zenodo.org/record/4482922/files/" _URLS = { "train": _URL + "train.csv", "test": _URL + "test.csv", @@ -57,7 +57,7 @@ class Medal(datasets.GeneratorBasedBuilder): """Medal: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining""" - VERSION = datasets.Version("1.0.0") + VERSION = datasets.Version("4.0.0") def _info(self): return datasets.DatasetInfo( From e5804dda42b2f34c391595ef59042a7d771c84d4 Mon Sep 17 00:00:00 2001 From: Xing Han Lu Date: Wed, 6 Oct 2021 13:42:59 -0400 Subject: [PATCH 3/7] Change dummy data folder to 4.0.0 --- .../medal/dummy/{1.0.0 => 4.0.0}/dummy_data.zip | Bin 1 file changed, 0 insertions(+), 0 deletions(-) rename datasets/medal/dummy/{1.0.0 => 4.0.0}/dummy_data.zip (100%) diff --git a/datasets/medal/dummy/1.0.0/dummy_data.zip b/datasets/medal/dummy/4.0.0/dummy_data.zip similarity index 100% rename from datasets/medal/dummy/1.0.0/dummy_data.zip rename to datasets/medal/dummy/4.0.0/dummy_data.zip From 411869b41df1dd100f77387c0ccdf65a8926bf5d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Fri, 8 Oct 2021 18:03:10 +0200 Subject: [PATCH 4/7] Update README.md --- datasets/medal/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/datasets/medal/README.md b/datasets/medal/README.md index bb573cabce7..1508aef989a 100644 --- a/datasets/medal/README.md +++ b/datasets/medal/README.md @@ -18,8 +18,9 @@ task_categories: task_ids: - other-other-disambiguation paperswithcode_id: medal +pretty_name: MeDAL --- -# Dataset Card Creation Guide +# Dataset Card for the MeDAL dataset ## Table of Contents - [Dataset Description](#dataset-description) @@ -208,4 +209,4 @@ The original dataset was retrieved and modified from the [NLM website](https://w ### Contributions -Thanks to [@Narsil](/~https://github.com/Narsil) for adding this dataset. \ No newline at end of file +Thanks to [@Narsil](/~https://github.com/Narsil) for adding this dataset. From 6dfea569e072c19ac40d99817f8f4a373fb77205 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 8 Oct 2021 18:48:44 +0200 Subject: [PATCH 5/7] update infos --- datasets/medal/dataset_infos.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/medal/dataset_infos.json b/datasets/medal/dataset_infos.json index fc318c17f04..c5892d7ecaf 100644 --- a/datasets/medal/dataset_infos.json +++ b/datasets/medal/dataset_infos.json @@ -1 +1 @@ -{"default": {"description": "A large medical text dataset (14Go) curated to 4Go for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. For example, DHF can be disambiguated to dihydrofolate, diastolic heart failure, dengue hemorragic fever or dihydroxyfumarate\n", "citation": "@inproceedings{wen-etal-2020-medal,\n title = \"{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining\",\n author = \"Wen, Zhi and\n Lu, Xing Han and\n Reddy, Siva\",\n booktitle = \"Proceedings of the 3rd Clinical Natural Language Processing Workshop\",\n month = nov,\n year = \"2020\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/2020.clinicalnlp-1.15\",\n pages = \"130--135\",\n abstract = \"One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.\",\n}", "homepage": "/~https://github.com/BruceWen120/medal", "license": "", "features": {"abstract_id": {"dtype": "int32", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "location": {"feature": {"dtype": "int32", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "label": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "medal", "config_name": "default", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3573399948, "num_examples": 3000000, "dataset_name": "medal"}, "test": {"name": "test", "num_bytes": 1190766821, "num_examples": 1000000, "dataset_name": "medal"}, "validation": {"name": "validation", "num_bytes": 1191410723, "num_examples": 1000000, "dataset_name": "medal"}, "full": {"name": "full", "num_bytes": 15536883723, "num_examples": 14393619, "dataset_name": "medal"}}, "download_checksums": {"https://zenodo.org/record/4276178/files/train.csv": {"num_bytes": 3541556520, "checksum": "c5fef2feebd1ecd35b4fe7a0aec266b631c0ac511d4d6b685835328b1ffbf32d"}, "https://zenodo.org/record/4276178/files/test.csv": {"num_bytes": 1180152075, "checksum": "ad391a63449c2bbbdbdf8d1827da4c053607a8586f4162174ba4ccf13efd8f86"}, "https://zenodo.org/record/4276178/files/valid.csv": {"num_bytes": 1180795804, "checksum": "08a0a6c2ee40747744ec15675ab5dc1e2b04491ca951b14c15d8d7bf9d33694d"}, "https://zenodo.org/record/4276178/files/full_data.csv": {"num_bytes": 15158424679, "checksum": "70f1ad891bdf98a42395a8907b48284457ae36d17fcc5a0a9c65c0b6b45ecf8d"}}, "download_size": 21060929078, "post_processing_size": null, "dataset_size": 21492461215, "size_in_bytes": 42553390293}} \ No newline at end of file +{"default": {"description": "A large medical text dataset (14Go) curated to 4Go for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. For example, DHF can be disambiguated to dihydrofolate, diastolic heart failure, dengue hemorragic fever or dihydroxyfumarate\n", "citation": "@inproceedings{wen-etal-2020-medal,\n title = \"{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining\",\n author = \"Wen, Zhi and\n Lu, Xing Han and\n Reddy, Siva\",\n booktitle = \"Proceedings of the 3rd Clinical Natural Language Processing Workshop\",\n month = nov,\n year = \"2020\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/2020.clinicalnlp-1.15\",\n pages = \"130--135\",\n abstract = \"One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.\",\n}", "homepage": "/~https://github.com/BruceWen120/medal", "license": "", "features": {"abstract_id": {"dtype": "int32", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}, "location": {"feature": {"dtype": "int32", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "label": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "medal", "config_name": "default", "version": {"version_str": "4.0.0", "description": null, "major": 4, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3573399948, "num_examples": 3000000, "dataset_name": "medal"}, "test": {"name": "test", "num_bytes": 1190766821, "num_examples": 1000000, "dataset_name": "medal"}, "validation": {"name": "validation", "num_bytes": 1191410723, "num_examples": 1000000, "dataset_name": "medal"}, "full": {"name": "full", "num_bytes": 15536883723, "num_examples": 14393619, "dataset_name": "medal"}}, "download_checksums": {"https://zenodo.org/record/4482922/files/train.csv": {"num_bytes": 3541556520, "checksum": "c5fef2feebd1ecd35b4fe7a0aec266b631c0ac511d4d6b685835328b1ffbf32d"}, "https://zenodo.org/record/4482922/files/test.csv": {"num_bytes": 1180152075, "checksum": "ad391a63449c2bbbdbdf8d1827da4c053607a8586f4162174ba4ccf13efd8f86"}, "https://zenodo.org/record/4482922/files/valid.csv": {"num_bytes": 1180795804, "checksum": "08a0a6c2ee40747744ec15675ab5dc1e2b04491ca951b14c15d8d7bf9d33694d"}, "https://zenodo.org/record/4482922/files/full_data.csv": {"num_bytes": 15158424679, "checksum": "70f1ad891bdf98a42395a8907b48284457ae36d17fcc5a0a9c65c0b6b45ecf8d"}}, "download_size": 21060929078, "post_processing_size": null, "dataset_size": 21492461215, "size_in_bytes": 42553390293}} \ No newline at end of file From 48a38a38da1aa8c1a5605cbfe7ada07a49aaa151 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 8 Oct 2021 18:52:43 +0200 Subject: [PATCH 6/7] nits in dataset card --- datasets/medal/README.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/datasets/medal/README.md b/datasets/medal/README.md index 1508aef989a..7c47a29a2fc 100644 --- a/datasets/medal/README.md +++ b/datasets/medal/README.md @@ -72,25 +72,28 @@ English (en) ## Dataset Structure Each file is a table consisting of three columns: -* TEXT: The normalized content of an abstract -* LOCATION: The location (index) of each abbreviation that was substituted -* LABEL: The word at that was substituted at the given location +* text: The normalized content of an abstract +* location: The location (index) of each abbreviation that was substituted +* label: The word at that was substituted at the given location ### Data Instances -TEXT: -> a report is given on the recent discovery of outstanding immunological properties in ba ncyanoethyleneurea having a low molecular mass m experiments in ds CS bearing wistar rats have shown that ba at a dosage of only about percent ld mg kg and negligible lethality percent results in a REC rate of percent without hyperglycemia and in one test of percent with hyperglycemia under otherwise unchanged conditions the REF substance ifosfamide if a further development of cyclophosphamide applied without hyperglycemia in its most efficient dosage of percent ld mg kg brought about a recovery rate of percent at a lethality of percent contrary to ba min hyperglycemia caused no further improvement of the REC rate however this comparison is characterized by the fact that both substances exhibit two quite different complementary mechanisms of action leucocyte counts made T3 application of the said cancerostatics and dosages have shown a pronounced stimulation with ba and with ifosfamide the known suppression in the posttherapeutic interval usually found with standard cancerostatics in combination with the cited PI test for ba blood pictures then allow conclusions on the immunity status since if can be taken as one of the most efficient cancerostaticsthere is no other chemotherapeutic known up to now that has a more significant effect on the ds carcinosarcoma in rats these findings are of special importance finally the total amount of leucocytes and lymphocytes as well as their time behaviour was determined from the blood picture of tumourfree rats after iv application of ba the thus obtained numerical values clearly show that further research work on the prophylactic use of this substance seems to be necessary and very promising +An example from the train split is: -LOCATION: -> 24|49|68|113|137|172 - -LABEL: -> carcinosarcoma|recovery|reference|recovery|after|plaque +``` +{'abstract_id': 14145090, + 'text': 'velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfβ receptors and disrupts the tgfβ pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfβ signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfβ and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfβ expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in αsmapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfβ IB to its receptors and blocking ang iitgfβ downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfβ pathway in CFs', + 'location': [63], + 'label': ['transverse aortic constriction']} + ``` ### Data Fields -[More Information Needed] +The column types are: +* text: content of the abstract as a string +* location: index of the substitution as an integer +* label: substitued word as a string ### Data Splits From 758b21b55af4a1fb4078904ad66f6d26fff482b2 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 8 Oct 2021 18:53:13 +0200 Subject: [PATCH 7/7] add @xhlulu to the list of contributors --- datasets/medal/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datasets/medal/README.md b/datasets/medal/README.md index 7c47a29a2fc..cea3643965a 100644 --- a/datasets/medal/README.md +++ b/datasets/medal/README.md @@ -212,4 +212,4 @@ The original dataset was retrieved and modified from the [NLM website](https://w ### Contributions -Thanks to [@Narsil](/~https://github.com/Narsil) for adding this dataset. +Thanks to [@Narsil](/~https://github.com/Narsil) and [@xhlulu](/~https://github.com/xhlulu)) for adding this dataset.