In our paper The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus we propose to use a web corpus as a universal, uncurated and unstructured knowledge source for multiple KI-NLP tasks at once.
We leverage an open web corpus coupled with strong retrieval baselines instead of a black-box, commercial search engine - an approach which facilitates transparent and reproducible research and opens up a path for future studies comparing search engines optimised for humans with retrieval solutions designed for neural networks. We use a subset of CCNet covering 134M documents split into 906M passages as the web corpus which we call Sphere.
In this repository we open source indices of Sphere both for the sparse retrieval baseline, compatible with Pyserini, and our best dense model compatible with distributed-faiss. We also provide instructions on how to evaluate the retrieval performance for both standard and newly introduced retrieval metrics, using the KILT API.
If you use the content of this repository in your research, please cite the following:
@article{DBLP:journals/corr/abs-2112-09924,
author = {Aleksandra Piktus and Fabio Petroni
and Vladimir Karpukhin and Dmytro Okhonko
and Samuel Broscheit and Gautier Izacard
and Patrick Lewis and Barlas Oguz
and Edouard Grave and Wen{-}tau Yih
and Sebastian Riedel},
title = {The Web Is Your Oyster - Knowledge-Intensive {NLP} against a Very
Large Web Corpus},
journal = {CoRR},
volume = {abs/2112.09924},
year = {2021},
url = {https://arxiv.org/abs/2112.09924},
eprinttype = {arXiv},
eprint = {2112.09924},
timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-09924.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
git clone git@github.com:facebookresearch/Sphere.git
cd Sphere
conda create -n sphere -y python=3.7 && conda activate sphere
pip install -e .
We open source pre-built Sphere indices:
- a Pyserini-compatible sparse BM25 index: sphere_sparse_index.tar.gz - 775.6 GiB
- a distributed-faiss-compatible dense DPR index: sphere_sparse_index.tar.gz - 1.2 TiB
You can download and unpack respective index files directly e.g. via the browser of wget
:
mkdir -p faiss_index
wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz
tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index
wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_dense_index.tar.gz
tar -xzvf faiss_index/sphere_dense_index.tar.gz -C faiss_index
Evaluation with KILT
We implement the retrieval metrics introduced in the paper:
- the
answer-in-context@k
, - the
answer+entity-in-context@k
, - as well as the
entity-in-input
ablation metric
within the KILT repository. Follow instruction below to perform and evaluate retrieval on KILT tasks for both sparse and dense Sphere indices.
pip install -e git+/~https://github.com/facebookresearch/KILT#egg=KILT
Download KILT data. Check out instructions in the KILT repo for more details.
mkdir -p data
python src/kilt/scripts/download_all_kilt_data.py
python src/kilt/scripts/get_triviaqa_input.py
pip install -e git+/~https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss
pip install -e git+/~https://github.com/facebookresearch/DPR@multi_task_training#egg=DPR
pip install spacy==2.1.8
python -m spacy download en
More details here.
python src/distributed-faiss/scripts/server_launcher.py \
--log-dir logs \
--discovery-config faiss_index/disovery_config.txt \
--num-servers 32 \
--num-servers-per-node 4 \
--timeout-min 4320 \
--save-dir faiss_index/ \
--mem-gb 500 \
--base-port 13034 \
--partition dev &
- The DPR_web model: dpr_web_biencoder.cp
- The configuration file: dpr_web_sphere.yaml
mkdir -p checkpoints
wget -P checkpoints http://dl.fbaipublicfiles.com/sphere/dpr_web_biencoder.cp
mkdir -p configs
wget -P configs https://dl.fbaipublicfiles.com/sphere/dpr_web_sphere.yaml
Subsequently update the following fields in the dpr_web_sphere.yaml
configuration file:
n_docs: 100 # the number of documents to retrieve per query
model_file: checkpoints/dpr_web_biencoder.cp # path to the downloaded model file
rpc_retriever_cfg_file: faiss_index/disovery_config.txt # path to the discovery config file used when launching the distributed-faiss server
rpc_index_id: dense # the name of the folder contaning dense index partitions
In order to perform retrieval from the dense index you first need to launch the distributed-faiss server as described above. You can control the KILT datasets you perform retrieval for by modifying respective config files, e.g. src/kilt/configs/dev_data.json
.
python src/kilt/scripts/execute_retrieval.py \
--model_name dpr_distr \
--model_configuration configs/dpr_web_sphere.yaml \
--test_config src/kilt/kilt/configs/dev_data.json \
--output_folder output/dense/
Our sparse index relies on Pyserini, and therfore requires an install of Java 11 to be available on the machine.
pip install jnius
pip install pyserini==0.9.4.0
Next, download the following file:
- The configuration file: bm25_sphere.json
mkdir -p configs
wget -P configs https://dl.fbaipublicfiles.com/sphere/bm25_sphere.json
Subsequently update the following field in the bm25_sphere.json
configuration file:
"k": 100, # the number of documents to retrieve per query
"index": "faiss_index/sparse", # path to the unpacked sparse BM25 index
python src/kilt/scripts/execute_retrieval.py \
--model_name bm25 \
--model_configuration configs/bm25_sphere.json \
--test_config src/kilt/kilt/configs/dev_data.json \
--output_folder output/sparse/
python src/kilt/kilt/eval_retrieval.py \
output/$index/$dataset-dev-kilt.jsonl \ # retrieval results - the output of running eval_retrieval.py
data/$dataset-dev-kilt.jsonl \ # gold KILT file (available for download in the KILT repo)
--ks="1,20,100"
Install and launch distributed-faiss
. More details on the distributed-faiss
server here.
pip install -e git+/~https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss
python src/distributed-faiss/scripts/server_launcher.py \
--log-dir logs/ \
--discovery-config faiss_index/disovery_config.txt \
--num-servers 32 \
--num-servers-per-node 4 \
--timeout-min 4320 \
--save-dir faiss_index/ \
--mem-gb 500 \
--base-port 13034 \
--partition dev &
For a minimal working example of querying the Sphere dense index, we propose to interact with the DPR model via transformers
API. To that end please install dependencies:
pip install transformers==4.17.0
Using the DPR checkpoing with transformers API requires reformatting the original checkpoint. You can download and unpack the transformers
-complatible DPR_web query encoder here:
mkdir -p checkpoints
wget -P checkpoints https://dl.fbaipublicfiles.com/sphere/dpr_web_query_encoder_hf.tar.gz
tar -xzvf checkpoints/dpr_web_query_encoder_hf.tar.gz -C checkpoints/
Alternatively, you can convert the dpr_web_biencoder.cp
model yourself using available scripts.
Then you can run the interactive demo:
python scripts/sphere_client_demo_hf.py \
--encoder checkpoints/dpr_web_query_encoder_hf \
--discovery-config faiss_index/disovery_config.txt \
--index-id dense
Sphere
is released under the CC-BY-NC 4.0 license. See the LICENSE
file for details.