Benchmarking ANN

This repo benchmarks Approximate Nearest Neighbor algorithms supported by Faiss for dense text retrieval (powered by SBERT).

Usage: Experiments on MS-MARCO

Install Python requirements: (Faiss & SBERT)

conda install -c pytorch faiss-cpu
pip install sentence-transformers

First download the MS-MARCO dataset (in BeIR format):

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip
unzip msmarco.zip

Embedd the queries and the documents with the SOTA TAS-B model:

python embed.py --data_path=./msmarco --ndocs=8841823 --nqueries=509962 --model_name="msmarco-distilbert-base-tas-b" --embedded_dir=./msmarco-embedded

Run the benchmarking:
```
bash ./benchmark.sh
```
or run the individual experiment:
```
python benchmark.py --eval_string "pq(384, 8)" --embedded_dir=./msmarco-embedded --output_dir=./msmarco-benchmarking
```
where the eval_string defines how to build the Faiss index (please refer to the benchmark.py for more details and benchmark.sh for other experiments).

Pre-computed embedding files & results

To save the effort of the step 2 and 3, one can also download our pre-computed embedding files:

mkdir msmarco-embedded
cd msmarco-embedded
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/ids.txt
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/qrels.json
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.queries.pkl
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.documents.pkl

Results are available at: Benchmarking-ANN Google sheet

Note: The testing computational environment is a shared DGX2 machine. So the time metrics may not be absolutely comparable.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
msmarco-benchmarking		msmarco-benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
benchmark.sh		benchmark.sh
embed.py		embed.py
embed.sh		embed.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking ANN

Usage: Experiments on MS-MARCO

Pre-computed embedding files & results

About

Releases

Packages

Languages

License

kwang2049/benchmarking-ann

Folders and files

Latest commit

History

Repository files navigation

Benchmarking ANN

Usage: Experiments on MS-MARCO

Pre-computed embedding files & results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages