Optimizing Bi-Encoder Embedders with `optimum-intel`

Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking.

We showcase a recipe to improve the performance (latency (batch size=1), throughput) of embedders using optimum-intel for quantizing a model to int8 and running the optimized model using IPEX backend.

The produced quantized models can be used with fastRAG QuantizedBiEncoderRanker and QuantizedBiEncoderRetriever

How to quantize?

Steps required for quantization:

Installing required python packages (mainly optimum-intel and intel-extension-for-transformers).
```
pip install -r requirements.txt
```
Quantizing a model by running python quantize_embedder.py --quantize:
- with provided calibration data and a model from Hugging Face model hub
Benchmarking a quantized model or a vanilla (non-quantized) model using quantize_embedder.py --benchmark:
- running an evaluation on a subset of Reranking or Retrieval tasks of MTEB benchmark suite.

Examples

Quantize BAAI/bge-small-en-v1.5 using 100 samples:

python quantize_embedder.py --quantize --model_name BAAI/bge-small-en-v1.5 --output_path quantized_model/ --sample_size 100

Benchmark a quantized model on the Reranking tasks of MTEB. Use --benchmark to run only benchmarking and --opt for benchmarking a quantized model:

python quantize_embedder.py --benchmark --opt --model_name quantized_model/ --task rerank

Running a quantized model

Running inference using a quantized model is similar to Hugging Face API. We use optimum-intel Auto models for loading a model.

Loading a model:

from optimum.intel import IPEXModel

model = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")

Inference with auto-mixed precision (bf16):

with torch.no_grad():
    outputs = model(**inputs)

Benchmarking speed and latency

Quantized and vanilla Hugging Face models can be benchmarks for latency (batch size = 1) and throughput using the provided benchmark_speed.py script.

The benchmarking script uses Aim to log experiment settings and metrics. To install Aim run the following command pip install aim; and aim up to launch the UI.

The script can benchmark the following model backends:

Vanilla PyTorch
IPEX w/ and w/o bf16
IPEX torch-script (traced model) w/ and w/o bf16
Optimum-intel quantized model with IPEX backend

Running instructions

The benchmarking script has several argument that define the benchmark:

--model-name: path to a quantized model or a Huggingface model name
--mode: inc, hf, ipex, ipex-ts; model types
--bf16: activate bf16 inference
--samples: the number of samples to run in the benchmark
--bs: batch size
--seq-len: the sequence length of each sample when running the benchmarks
--warmup: the number of warmup cycles to do before measuring latency/throughput

To effectively utilize the CPU resources when running on a Intel Xeon processors, we should limit the processes to run on a single socket. This can be done by using numactl. In addition, it is recommended to use TCMalloc for better performance when accessing commonly-used objects.

Using `numactl`

How to install:

sudo apt-get update
sudo apt-get install numactl

How to run:

-C : specify core indexes to use; for example (0-31) instructs to use cores 0 to 31 (32 in total)

numactl -C 0-31 -m 0 python script.py args ...

TCMalloc

Further info on TCMalloc is available here:

How to install TCMalloc.
Once installed, expose via an env: export LD_PRELOAD=/path/to/libtcmalloc.so

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optimizing Bi-Encoder Embedders with `optimum-intel`

How to quantize?

Examples

Running a quantized model

Benchmarking speed and latency

Running instructions

Using `numactl`

TCMalloc

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optimizing Bi-Encoder Embedders with optimum-intel

How to quantize?

Examples

Running a quantized model

Benchmarking speed and latency

Running instructions

Using numactl

TCMalloc

Optimizing Bi-Encoder Embedders with `optimum-intel`

Using `numactl`