Code, Data and models supporting the experiments in the ACL 2019 Paper: Unsupervised Question Answering by Cloze Translation.
Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, without using the SQuAD training data at all, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.
This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.
We make available a dataset of 4 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system.
The data can be downloaded here. The data is in the SQuAD v1 format, and contains:
Fold | # Paragraphs | # QA pairs |
---|---|---|
unsupervised_qa_train.json |
782,556 | 3,915,498 |
unsupervised_qa_dev.json |
1,000 | 4,795 |
unsupervised_qa_test.json |
1,000 | 4,804 |
Using this training data to fine-tune BERT-Large for reading comprehension will achieve over 50.0 F1 on the SQuAD V1.1 development set using an appropriate early stopping strategy on the unsupervised_qa dev set.
In addition the above data, this repository provides functionality to generate synthetic training data from user-provided documents
The code is built to run on top of UnsupervisedMT, and requires all of its its dependencies. Additional requirements are spaCy (for NER and noun chunking), attrs, and NLTK and allennlp (for constituency parsing). It was developed to run on Ubuntu Linux 18.04 and Python 3.7, with CUDA 9
(Optionally) Create a conda environment to keep things clean:
conda create -n uqa37 python=3.7 && conda activate uqa37
The recommended way to install is shown below, which should install and handle all dependencies:
# clone the repo
git clone /~https://github.com/facebookresearch/UnsupervisedQA.git
cd UnsupervisedQA
# install python dependencies:
pip install -r requirements.txt
# install UnsupervisedMT and its dependencies
./install_tools.sh
Four UNMT models are made available for download
- Sentence Cloze boundaries, Noun Phrase Answers
- Sentence Cloze boundaries, Named Entity Answers
- Sub-clause Cloze boundaries, Named Entity Answers
- Sub-cluase Cloze boundaries, Named Entity Answers, Wh Heuristics (best downstream performance)
The models can be downloaded using the script:
./download_models.sh
This will download all the models and unzip them to the appropriate directory. Each unzipped model is about 850MB, so total space requirement is 3.5GB.
You can generate reading comprehension training data using unsupervisedqa.generate_synthetic_qa_data
This script will allow you to generate unsupervised question answering data using the identity
, noisy cloze
or unsupervised NMT
methods explored in the paper,
as well as specifying several different configurations (i.e. whether to use subclause shortening, use named entity answers and whether to use the wh heuristic).
This script provides the following command line arguments:
usage: generate_synthetic_qa_data.py [-h] [--input_file_format {txt,jsonl}]
[--output_file_formats OUTPUT_FILE_FORMATS]
[--translation_method {identity,noisy_cloze,unmt}]
[--use_subclause_clozes]
[--use_named_entity_clozes]
[--use_wh_heuristic]
input_file output_file
Generate synthetic training data for extractive QA tasks without supervision
positional arguments:
input_file input file, see readme for formatting info
output_file Path to write generated data to, see readme for
formatting info
optional arguments:
-h, --help show this help message and exit
--input_file_format {txt,jsonl}
input file format, see readme for more info, default
is txt
--output_file_formats OUTPUT_FILE_FORMATS
comma-seperated list of output file formats, from
[jsonl, squad], an output file will be created for
each format. Default is 'jsonl,squad'
--translation_method {identity,noisy_cloze,unmt}
define the method to generate clozes -- either the
Unsupervised NMT method (unmt), or the identity or
noisy cloze baseline methods. UNMT is recommended for
downstream performance, but the noisy_cloze is
relatively stong on downstream QA and fast to
generate. Default is unmt
--use_subclause_clozes
pass this flag to shorten clozes with constituency
parsing instead of using sentence boundaries
(recommended for downstream performance)
--use_named_entity_clozes
pass this flag to use named entity answer prior
instead of noun phrases (recommended for downstream
performance)
--use_wh_heuristic pass this flag to use the wh-word heuristic
(recommended for downstream performance). Only
compatable with named entity clozes
The input format is specified by the --input_file format
argument, and can either be a .txt
file of paragraphs, one per line, for questions and answers to be generated from,
or a .jsonl
file with each line containing a json-serialised dict of the format {"text": text of paragraph, "paragraph_id" : your unique identifier for the paragraph}
The output format can be specified by the user using the --output_file_formats
argument. The user can choose between jsonl
and squad
format. Requesting the squad
format will output a file using the squad v1.1 format, ready to be plugged into downstream extractive QA tasks. The jsonl
format provides more metadata than the squad format, the fields are explained below:
{
"cloze_id": unique identifier for this datapoint
"paragraph": data on the paragraph this datapoint was generated from
"source_text": the text from the paragraph the cloze was generated from
"source_start": character index in paragraph where "source_text" starts
"cloze_text": the text of the cloze question the question is generated from
"answer_text": the answer text of the (cloze) question
"answer_start": the character index that the answer starts at in the paragraph
"constituency_parse": the constituency parse of the "source_text" if available, otherwise null,
"root_label": the node label of the root of the constituency parse if available, otherwise null,
"answer_type": The named entity label of the answer (if using named entity clozes) otherwise "NOUNPHRASE"
"question_text": the text of the natural question, translated from "cloze_text"
}
A working example to produce unsupervised NMT-translated questions using the model trained with wh heuristics, named entity answers, subclause shortening is below:
python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
--input_file_format "txt" \
--output_file_format "jsonl,squad" \
--translation_method unmt \
--use_named_entity_clozes \
--use_subclause_clozes \
--use_wh_heuristic
The repository requires a CUDA-enabled GPU (this is a requirement of UnsupervisedMT), but you can reduce the amount of GPU memory required
by adjusting the batch sizes. This can be done by modifying unsupervisedqa/configs.py
file, adjusting CONSTITUENCY_BATCH_SIZE
and UNMT_BATCH_SIZE
.
This repository only provides functionality to run pre-trained unsupervised question translation models in the paper. For users who want to train new question translation models, they should use the training functionality in UnsupervisedMT, or consider the newer and more powerful XLM repository.
To train question translation models in UnsupervisedMT, first prepare large corpora of cloze questions (potentially using the functionality in this repository) and a large corpus of natural questions. Preprocess these corpora by adapting UnsupervisedMT/NMT/get_data_enfr.sh, and train using the example script in UnsupervisedMT/README, with appropriate edits to the args (e.g en->cloze and fr->question) and paths.
Please cite [1] and [2] if you found the resources in this repository useful.
[1] P. Lewis, L. Denoyer, S. Riedel Unsupervised Question Answering by Cloze Translation
@inproceedings{lewis2019unsupervisedqa,
title={Unsupervised Question Answering by Cloze Translation},
author={Lewis, Patrick and Denoyer, Ludovic and Riedel, Sebastian},
booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2019}
}
[2] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation
@inproceedings{lample2018phrase,
title={Phrase-Based \& Neural Unsupervised Machine Translation},
author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2018}
}
See the LICENSE file for more details.
If you run into problems with installing dependencies (particularly allennlp) installing libffi may help:
apt-get install libffi6 libffi-dev