llama : support reranking API endpoint and models #8555

ciekawy · 2024-07-18T07:44:30Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Support reranking API and models.

Motivation

Reranking is currently very common techniques used along with embeddings in RAG systems. Also there are models where same model instance can be used for both embeddings and reranking - that is great resource optimisation.

Possible Implementation

Reranking is relatively close to embeddings and there are models for both embed/rerank like bge-m3 - supported by llama.cpp with --embed. I'm guessing that one possible challenge/dilemma is that for inference and embed the OpenAI API schema is being used and OpenAI does not offer rerank API. I think currently there is Jina rerank API commonly used in other projects.
I think that in terms of actual reranking there should not be very complex as it is quite related to embedding calls.

ciekawy · 2024-07-18T07:48:42Z

I saw just one discussion opened for reranking #8216, and possibly two loosely related tickets - linking for visibility

#5403
#5954

and so slightly related but rather out of scope of this ticket is also support for more formats to be converted to gguf

rujialiu · 2024-07-18T13:22:42Z

I'm developing a lightweight (in terms of disk usage) local RAG application. Embedding/LLM is handled very well by llama.cpp, but reranker is a headable. My reranker of choice is 2GB in disk space (bge-reranker-v2-m3), which is bigger than embedding+LLM together. Huggingface's text-embedding-inference is fast, but it doesn't support any quatization (at least in an obvious way); infinity_emb supports onnx's int8 quantization but not lightweight. If llama.cpp supports reranker, I would definitely use it for all embedding/reranking/LLM.

ggerganov · 2024-07-18T14:04:58Z

I am not familiar with the concept of "reranking" - do you have some good resource, or can you explain it in simple terms here?

ciekawy · 2024-07-18T14:09:33Z

TL;DR Reranking involves taking a set of search results and reordering them based on a specific query to better match the query :)

here all is nicely described:
https://jina.ai/reranker/

rujialiu · 2024-07-19T01:39:39Z

We can also reduce token usage and hallucination by filtering out low-score documents before feeding to LLM, which is especially useful when developing tool-using agents: suppose you have 1000 built-in tools and don't want to pass all of them to LLM, then a good way is to use embedding to get, say, top-30 similar tools first and then use reranker to retrieve highly relavent tools only. Embedding + vector search is fast, but much less accurate than reranker, so this embedding+reranker+LLM workflow works very well in practice.

foldl · 2024-07-21T15:43:16Z

FYI: chatllm.cpp supports 2 re-ranker models, and RAG of course.

foldl · 2024-07-21T15:47:33Z

Re-ranking models output a score on a pair of a question and a text chunk, measuring how the chunk fit as an answer.

ggerganov · 2024-07-22T08:15:56Z

Got it. I assume there are some special tokens that are used to specify which text is the question and which text is the answer? And it seems that instead of a LM head, the model ends with a classification head. Is the attention non-causal?

foldl · 2024-07-22T08:39:07Z

In the case of XLMRobertaForSequenceClassification, used by bge-rereanker-m3, bce-reranker, etc, Q&A are encoded as:

~~bos question eos bos answer eos~~

It is non-causal.

The correct pattern is:

cls question sep sep answer sep

which is equivalent to in the case of BGE and BCE:

bos question eos eos answer eos

ciekawy · 2024-07-22T10:16:36Z

it may be worth having a look at the actual rerankers, and their config files

ggerganov · 2024-09-16T11:22:42Z

I'll give this a try

thiner · 2025-02-05T11:13:05Z

@ggerganov
I am running bge-rerank-v2-m3 model with the llama.cpp server b4641. The rerank api is working, but the score seems weird. it always returns "1" for the most matching one, that is not expected. Is this by design, or something wrong with my configuration?

ggerganov · 2025-02-05T11:19:09Z

Can you provide the commands that you are using?

This works for me:

./bin/llama-server \
    -m ../models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    -c 65536 -np 8 -b 8192 -ub 8192 -fa \
    --host 127.0.0.1 --port 8012 -lv 1 \
    --reranking

thiner · 2025-02-06T03:40:16Z

I copied your settings, and tested the API with below request:

{
    "model": "bge-reranker",
    "query": "A man is eating pasta.",
    "documents": [
        "A man is eating food.",
        "A man is eating a piece of bread.",
      "一个中国男人在吃面条",
        "The girl is carrying a baby.",
        "A man is riding a horse.",
        "A young girl is playing violin."
    ]
  }

The response:

{
  "id": null,
  "results": [
    {
      "index": 0,
      "relevance_score": 7.741800308227539
    },
    {
      "index": 1,
      "relevance_score": -2.33689022064209
    },
    {
      "index": 2,
      "relevance_score": 3.8466310501098633
    },
    {
      "index": 3,
      "relevance_score": -11.016427993774414
    },
    {
      "index": 4,
      "relevance_score": -10.9613037109375
    },
    {
      "index": 5,
      "relevance_score": -11.018434524536133
    }
  ],
  "meta": null
}

This seems normal. I saw the score "1" from Dify which is the system using the reranker model. It's very possible the problem of Dify. Thanks for your help.

BTW, most of the rerank API returns a score in range from 0 to 1. Can llama.cpp server implement this feature?

ggerganov · 2025-02-06T07:30:29Z

We can, I just thought it is something very simple that the clients can do on their end. But it's fine to have an option to do it on the server. PRs welcome.

foldl · 2025-02-06T07:46:36Z

@thiner FYI: You can apply the sigmoid function on relevance_score to get a score in range from 0 to 1: f(x) = 1 / (1 + exp(-x)).

thiner · 2025-02-06T08:41:35Z

@foldl thanks for your advice. But I program in Java only... Could you kind help to create a PR?

foldl · 2025-02-06T09:22:42Z

It's something like this in Java.

public static double sigmoid(double x) {
    return 1.0 / (1.0 + Math.exp(-x));
}

thiner · 2025-02-07T10:56:47Z

@foldl Thanks, but I meant could you implement the feature for llama.cpp server? Just like @ggerganov mentioned, maybe a new argument for starting llama.cpp server to enable the feature. I am using this rerank API in Dify, I am not able to(or not the correct way to) modify the source code of Dify.

foldl · 2025-02-07T11:08:00Z

@thiner Before a PR for this is landed, you can try scripting in Dify:

https://docs.dify.ai/guides/workflow/node/code

ciekawy added the enhancement New feature or request label Jul 18, 2024

ggerganov added this to ggml : roadmap Jul 18, 2024

ggerganov moved this to Todo in ggml : roadmap Jul 18, 2024

ggerganov changed the title ~~Feature Request: support reranking API endpoint and models~~ llama : support reranking API endpoint and models Jul 18, 2024

github-actions bot added the stale label Aug 22, 2024

ggerganov removed the stale label Aug 24, 2024

Nordln mentioned this issue Aug 26, 2024

Rerankers and Embeddings ollama/ollama#3749

Closed

ggerganov self-assigned this Sep 16, 2024

ggerganov moved this from Todo to In Progress in ggml : roadmap Sep 16, 2024

ggerganov mentioned this issue Sep 16, 2024

llama : add reranking support #9510

Merged

2 tasks

ggerganov closed this as completed Sep 29, 2024

ggerganov moved this from In Progress to Done in ggml : roadmap Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : support reranking API endpoint and models #8555

llama : support reranking API endpoint and models #8555

ciekawy commented Jul 18, 2024

ciekawy commented Jul 18, 2024

rujialiu commented Jul 18, 2024 •

edited

Loading

ggerganov commented Jul 18, 2024

ciekawy commented Jul 18, 2024

rujialiu commented Jul 19, 2024

foldl commented Jul 21, 2024

foldl commented Jul 21, 2024

ggerganov commented Jul 22, 2024

foldl commented Jul 22, 2024 •

edited

Loading

ciekawy commented Jul 22, 2024

ggerganov commented Sep 16, 2024

thiner commented Feb 5, 2025

ggerganov commented Feb 5, 2025

thiner commented Feb 6, 2025

ggerganov commented Feb 6, 2025

foldl commented Feb 6, 2025

thiner commented Feb 6, 2025

foldl commented Feb 6, 2025

thiner commented Feb 7, 2025 •

edited

Loading

foldl commented Feb 7, 2025

llama : support reranking API endpoint and models #8555

llama : support reranking API endpoint and models #8555

Comments

ciekawy commented Jul 18, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

ciekawy commented Jul 18, 2024

rujialiu commented Jul 18, 2024 • edited Loading

ggerganov commented Jul 18, 2024

ciekawy commented Jul 18, 2024

rujialiu commented Jul 19, 2024

foldl commented Jul 21, 2024

foldl commented Jul 21, 2024

ggerganov commented Jul 22, 2024

foldl commented Jul 22, 2024 • edited Loading

ciekawy commented Jul 22, 2024

ggerganov commented Sep 16, 2024

thiner commented Feb 5, 2025

ggerganov commented Feb 5, 2025

thiner commented Feb 6, 2025

ggerganov commented Feb 6, 2025

foldl commented Feb 6, 2025

thiner commented Feb 6, 2025

foldl commented Feb 6, 2025

thiner commented Feb 7, 2025 • edited Loading

foldl commented Feb 7, 2025

rujialiu commented Jul 18, 2024 •

edited

Loading

foldl commented Jul 22, 2024 •

edited

Loading

thiner commented Feb 7, 2025 •

edited

Loading