This repo benchmarks Approximate Nearest Neighbor algorithms supported by Faiss for dense text retrieval (powered by SBERT).
- Install Python requirements: (Faiss & SBERT)
conda install -c pytorch faiss-cpu pip install sentence-transformers
- First download the MS-MARCO dataset (in BeIR format):
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip unzip msmarco.zip
- Embedd the queries and the documents with the SOTA TAS-B model:
python embed.py --data_path=./msmarco --ndocs=8841823 --nqueries=509962 --model_name="msmarco-distilbert-base-tas-b" --embedded_dir=./msmarco-embedded
- Run the benchmarking:
or run the individual experiment:
bash ./benchmark.sh
where thepython benchmark.py --eval_string "pq(384, 8)" --embedded_dir=./msmarco-embedded --output_dir=./msmarco-benchmarking
eval_string
defines how to build the Faiss index (please refer to the benchmark.py for more details and benchmark.sh for other experiments).
To save the effort of the step 2 and 3, one can also download our pre-computed embedding files:
mkdir msmarco-embedded
cd msmarco-embedded
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/ids.txt
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/qrels.json
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.queries.pkl
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.documents.pkl
Results are available at: Benchmarking-ANN Google sheet
Note: The testing computational environment is a shared DGX2 machine. So the time metrics may not be absolutely comparable.