This page describes how to reproduce the doc2query document expansion experiments in the following paper:
- Rodrigo Nogueira, Wei Yang, Jimmy Lin, Kyunghyun Cho. Document Expansion by Query Prediction. arxiv:1904.08375
The basic idea is to train a model, that when given an input document, generates questions that the document might answer (or more broadly, queries for which the document might be relevant). These predicted questions (or queries) are then appended to the original documents, which are then indexed as before.
For a complete "from scratch" reproduction (in particularly, training the seq2seq model), see this code repo. Here, we run through how to reproduce the BM25+doc2query condition with our copy of the predicted queries.
Note that docTTTTTquery is an improved version of the doc2query model and has largely superseded this model. However, these results remain useful as a baseline.
Here's a summary of the datasets referenced in this guide:
File | Size | MD5 | Download |
---|---|---|---|
msmarco-passage-pred-test_topk10.tar.gz |
764 MB | 241608d4d12a0bc595bed2aff0f56ea3 |
[GitLab] |
paragraphCorpus.v2.0.tar.xz |
4.7 GB | a404e9256d763ddcacc3da1e34de466a |
[GitLab] |
trec-car-pred-test_topk10.tar.gz |
2.7 GB | b9f98b55e6260c64e830b34d80a7afd7 |
[GitLab] |
The GitLab repo is here if you want direct access.
To reproduce our doc2query results on the MS MARCO Passage Ranking Task, follow these instructions. Before going through this guide, it is recommended that you reproduce our BM25 baselines first.
To start, grab the predicted queries:
# Grab tarball:
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz -P collections/msmarco-passage
# Unpack tarball:
tar -xzvf collections/msmarco-passage/msmarco-passage-pred-test_topk10.tar.gz -C collections/msmarco-passage
Check out the file:
$ wc collections/msmarco-passage/pred-test_topk10.txt
8841823 536425855 2962345659 collections/msmarco-passage/pred-test_topk10.txt
These are the predicted queries based on our seq2seq model, based on top k sampling with 10 samples for each document in the corpus. There are as many lines in the above file as there are documents; all 10 predicted queries are concatenated on a single line.
Now let's create a new document collection by concatenating the predicted queries to the original documents:
python tools/scripts/msmarco/augment_collection_with_predictions.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl_expanded_topk10 \
--predictions collections/msmarco-passage/pred-test_topk10.txt --stride 1
To verify (and to track progress), the above script will generate a total of 9 JSON files, docs00.json
to docs08.json
.
After the script completes, we can index the expanded documents:
bin/run.sh io.anserini.index.IndexCollection \
-collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 6 \
-input collections/msmarco-passage/collection_jsonl_expanded_topk10 \
-index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
-storePositions -storeDocvectors -storeRaw
And perform retrieval:
python -m pyserini.search.lucene \
--index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
--topics collections/msmarco-passage/queries.dev.small.tsv \
--topics-format default \
--output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv \
--output-format msmarco \
--bm25 --k1 0.82 --b 0.68 --hits 1000
Alternatively, we can use the Java implementation of the above script, which is faster (taking advantage of multi-threaded retrieval with the -threads
option):
bin/run.sh io.anserini.search.SearchCollection \
-index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicReader TsvInt \
-output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv \
-format msmarco \
-hits 1000 \
-threads 8 \
-bm25 -bm25.k1 0.82 -bm25.b 0.68
Finally, to evaluate:
python tools/scripts/msmarco/msmarco_passage_eval.py \
collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.expanded-topk10.tsv
The output should be:
#####################
MRR @10: 0.2213412471005586
QueriesRanked: 6980
#####################
Note that these figures are slightly higher than the values reported in our arXiv paper (0.218) due to BM25 parameter tuning (see above) and an upgrade from Lucene 7.6 to Lucene 8.0 (experiments in the paper were run with Lucene 7.6).
One additional trick not explored in our arXiv paper is to weight the original document and predicted queries differently.
The augment_collection_with_predictions.py
script provides an option --original-copies
that duplicates the original text n times, which is an easy way to weight the original document by n.
For example --original-copies 2
would yield the following results:
#####################
MRR @10: 0.2287041774685029
QueriesRanked: 6980
#####################
So, this simple trick improves MRR by a bit over baseline doc2query.
We will now describe how to reproduce the TREC CAR results of our model BM25+doc2query presented in the paper.
To start, download the TREC CAR dataset and the predicted queries:
mkdir collections/trec_car
# Grab tarballs:
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/paragraphCorpus.v2.0.tar.xz -P collections/trec_car
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/trec-car-pred-test_topk10.tar.gz -P collections/trec_car
# Unpack tarballs:
tar -xf collections/trec_car/paragraphCorpus.v2.0.tar.xz -C collections/trec_car
tar -xf collections/trec_car/trec-car-pred-test_topk10.tar.gz -C collections/trec_car
Check out the file:
$ wc collections/trec_car/pred-test_topk10.txt
29794697 1767258740 11103530216 collections/trec_car/pred-test_topk10.txt
These are the predicted queries based on our seq2seq model, based on top k sampling with 10 samples for each document in the corpus. There are as many lines in the above file as there are documents; all 10 predicted queries are concatenated on a single line.
Now let's create a new document collection by concatenating the predicted queries to the original documents:
python src/main/python/trec_car/augment_collection_with_predictions.py \
--collection-path collections/trec_car/paragraphCorpus/dedup.articles-paragraphs.cbor \
--output-folder collections/trec_car/collection_jsonl_expanded_topk10 \
--predictions collections/trec_car/pred-test_topk10.txt --stride 1
To verify (and to track progress), the above script will generate a total of 30 JSON files, docs00.json
to docs29.json
.
After the script completes, we can index the expanded documents:
bin/run.sh io.anserini.index.IndexCollection \
-collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 30 \
-input collections/trec_car/collection_jsonl_expanded_topk10 \
-index indexes/trec_car/lucene-index.car17v2.0-expanded-topk10
And perform retrieval on the test queries:
sh target/appassembler/bin/SearchCollection -topicReader Car \
-index indexes/trec_car/lucene-index.car17v2.0-expanded-topk10 \
-topics tools/topics-and-qrels/topics.car17v2.0.benchmarkY1test.txt \
-output runs/run.car17v2.0.bm25.expanded-topk10.txt -bm25
Evaluation is performed with trec_eval
:
tools/eval/trec_eval.9.0.4/trec_eval -c -m map -m recip_rank \
tools/topics-and-qrels/qrels.car17v2.0.benchmarkY1test.txt \
runs/run.car17v2.0.bm25.expanded-topk10.txt
With the above commands, you should be able to reproduce the following results:
map all 0.1807
recip_rank all 0.2750
Note that this MAP is slightly higher than the arXiv paper (0.178) because we used TREC CAR corpus v2.0 in this experiment instead of corpus v1.5 used in the paper.
Reproduction Log*
- Results reproduced by @justram on 2019-08-09 (commit
5f098f
) - Results reproduced by @ronakice on 2019-08-13 (commit
5b29d16
) - Results reproduced by @edwinzhng on 2020-01-08 (commit
5cc923d
) - Results reproduced by @HangCui0510 on 2020-04-23 (commit
0ae567d
) - Results reproduced by @kelvin-jiang on 2020-05-25 (commit
b6e0367
) - Results reproduced by @lintool on 2020-11-09 (commit
94eae4
) - Results reproduced by @b8zhong on 2024-11-29 (commit
778968f
)