Skip to content

Benchmarking Approximate Nearest Neighbor (ANN) algorithms for dense text retrieval.

License

Notifications You must be signed in to change notification settings

kwang2049/benchmarking-ann

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking ANN

This repo benchmarks Approximate Nearest Neighbor algorithms supported by Faiss for dense text retrieval (powered by SBERT).

Usage: Experiments on MS-MARCO

  1. Install Python requirements: (Faiss & SBERT)
    conda install -c pytorch faiss-cpu
    pip install sentence-transformers
    
  2. First download the MS-MARCO dataset (in BeIR format):
    wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip
    unzip msmarco.zip
  3. Embedd the queries and the documents with the SOTA TAS-B model:
    python embed.py --data_path=./msmarco --ndocs=8841823 --nqueries=509962 --model_name="msmarco-distilbert-base-tas-b" --embedded_dir=./msmarco-embedded
  4. Run the benchmarking:
    bash ./benchmark.sh
    or run the individual experiment:
    python benchmark.py --eval_string "pq(384, 8)" --embedded_dir=./msmarco-embedded --output_dir=./msmarco-benchmarking
    where the eval_string defines how to build the Faiss index (please refer to the benchmark.py for more details and benchmark.sh for other experiments).

Pre-computed embedding files & results

To save the effort of the step 2 and 3, one can also download our pre-computed embedding files:

mkdir msmarco-embedded
cd msmarco-embedded
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/ids.txt
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/qrels.json
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.queries.pkl
wget https://public.ukp.informatik.tu-darmstadt.de/kwang/benchmarking-ann/msmarco-embedded/embeddings.documents.pkl

Results are available at: Benchmarking-ANN Google sheet

Note: The testing computational environment is a shared DGX2 machine. So the time metrics may not be absolutely comparable.

About

Benchmarking Approximate Nearest Neighbor (ANN) algorithms for dense text retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published