-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding nli_tr dataset #787
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me !
Thanks for adding this one ;)
I left minor comments. Once they're resolved we can merge it :)
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Thank you @lhoestq for the time you take to review our pull request. We appreciate your help. We've made the changes you described. Hope that it is ready for being merged. Please let me know if you have any additional requests for revisions. |
datasets/nli_tr/nli_tr.py
Outdated
self.description = "The Natural Language Inference in Turkish (NLI-TR) is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MNLI) using Amazon Translate." | ||
self.homepage = "/~https://github.com/boun-tabi/NLI-TR" | ||
self.citation = """\ | ||
@inproceedings{budur-etal-2020-data, | ||
title = "Data and Representation for Turkish Natural Language Inference", | ||
author = "Budur, Emrah and | ||
\"{O}zçelik, Rıza and | ||
G\"{u}ng\"{o}r, Tunga", | ||
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", | ||
month = nov, | ||
year = "2020", | ||
address = "Online", | ||
publisher = "Association for Computational Linguistics", | ||
abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.", | ||
} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe I wasn't clear about description, homepage and citation.
To appear in the datasets hub page on huggingface.co those three fields actually need to be in global variables _DESCRIPTION, _CITATION and _HOMEPAGE.
As they're global variables you don't need to have them in NLITRConfig, and you can directly use those three variables in _info().
See squad.py for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for the confusion. I've revised the implementation accordingly. Thank you so much for the time you take for reviewing it.
…-tr-dataset # Conflicts: # datasets/nli_tr/nli_tr.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now thanks :)
Hello,
In this pull request, we have implemented the necessary interface to add our recent dataset NLI-TR. The datasets will be presented on a full paper at EMNLP 2020 this month. [arXiv link]
The dataset is the neural machine translation of SNLI and MultiNLI datasets into Turkish. So, we followed a similar format with the original datasets hosted in the HuggingFace datasets hub.
Our dataset is designed to be accessed as follows by following the interface of the GLUE dataset that provides multiple datasets in a single interface over the HuggingFace datasets hub.
Thanks for your help in reviewing our pull request.