Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BERT embedding models #5423

Merged
merged 21 commits into from
Feb 11, 2024
Merged

Conversation

iamlemec
Copy link
Collaborator

@iamlemec iamlemec commented Feb 8, 2024

Following discussion in #2872, adds support for BERT model architecture. Built on top of various contributions from @skeskinen, @xyzhang626, and @cebtenzzre. Includes:

  • New WordPiece tokenizer llm_tokenize_wpm. Needed for slightly different behavior from SentencePiece. On conversion, vocab is mapped from ## subword scheme to prefix scheme to allow for unified vocab mappings.
  • New model fields bert.attention.causal that controls whether attention mask is causal or not (default is true). Also tokenizer.ggml.token_type_count which accounts for token type info, though these are tpyically ignored in actual computations.
  • Addition of build_bert for graph construction. This is fairly standard. The only difference is the pooling layer at the end. Currently it will pool the entire batch. Ideally, it could be made to pool only within sequence.

In terms of which models actually work, the main limitation is tokenization. I have tested with all-MiniLM-L6-v2 and BAAI/bge-*-*-v1.5 (small, base, and large plus en and zh) and they seem to work and the embedding numbers look similar to Huggingface implementations. The newer BAAI/bge-m3 uses a SentencePiece tokenizer, so it should be doable but I haven't tested it.

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
self.block_count = self.hparams["num_hidden_layers"]

def set_gguf_parameters(self):
# TODO(cebtenzzre): merge with parent class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: resolve this before merge

Copy link
Contributor

@iacore iacore Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you... have you forgotten about this...

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
iamlemec and others added 2 commits February 8, 2024 17:33
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Show resolved Hide resolved
@ggerganov
Copy link
Owner

In terms of which models actually work, the main limitation is tokenization.

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model. For example:

./build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "This is a ggml"

Tokenizes to:

101 -> [CLS]
2023 -> this
2003 -> is
1037 -> a
2290 -> ##g
19968 -> ##ml
102 -> [SEP]

Seems like a g is missing and also there is an extra concat ## in the 2290 token. So the tokenization might need some more work, but this can be improved later

llama.cpp Show resolved Hide resolved
@iamlemec
Copy link
Collaborator Author

iamlemec commented Feb 9, 2024

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model.

Ah yeah, that was a bug in bert.cpp that was fixed a few days ago. It's correct in this PR.

@iamlemec
Copy link
Collaborator Author

I have batched embedding working now (bert-batched). Basically just matmul an [n_tokens, n_tokens] pooling matrix at the end. It would make more sense for it to be [n_tokens, n_seq_max], but we don't actually know n_seq_max, so this is a worst case scenario. Embeddings can be fetched by seq_id just like with logits using get_embeddings_ith. Updated the embeddings example to split by lines and embed as separate sequences in one batch.

Should I push this to this PR or wait until this goes through and start a new one?

llama.cpp Outdated
Comment on lines 7270 to 7489
// the output is always the last tensor in the graph
struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1];
GGML_ASSERT(strcmp(res->name, "result_output") == 0);
// get logits and embeddings
struct ggml_tensor * res = ggml_graph_get_tensor(gf, "result_output");
struct ggml_tensor * embeddings = ggml_graph_get_tensor(gf, "result_norm");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ggml_graph_get_tensor is not recommended here because it will do a strcmp with the entire graph which can become noticeable in terms of speed. For now, we should be "poking" at the last few tensors to find what we need - not great, but will improve in the future

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the ggml_graph_get_tensor comment and merge. After that, we can look into batching support in separate PR

@iamlemec iamlemec merged commit 2891c8a into ggerganov:master Feb 11, 2024
46 of 54 checks passed
@ditsuke
Copy link
Contributor

ditsuke commented Feb 27, 2024

EDIT: I was able to convert all-MiniLM-L6-v2 with a fork of bert.cpp but the resulting gguf model doesn't load in llama.cpp.

You have to use the hf-to-gguf script that ships with this repo, bert.cpp's conversion script doesn't produce a llama.cpp-compatible version.

Yes, that's what I'm asking for: for the hf-to-gguf script here to support converting the all-mpnet-base-v2 model. It crashes at the moment with the NotImplemented error I describe in my earlier post.

Oh, I see. I think your last statement was meant for all-mpnet then, you should open a new issue.

@iamlemec
Copy link
Collaborator Author

It looks like all-mpnet has T5-style relative position embeddings. I don't think those are supported here yet.

@mofanke
Copy link

mofanke commented Mar 10, 2024

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

@cebtenzzre
Copy link
Collaborator

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

You could open a feature request if you haven't already.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
@mofanke
Copy link

mofanke commented Mar 14, 2024

#6007 already done

@hiepxanh
Copy link

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

@mofanke
Copy link

mofanke commented Mar 18, 2024

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha, I'm sorry for causing you confusion, i just mean i opened a feature request

howlger added a commit to howlger/djl that referenced this pull request Apr 1, 2024
In order to get support for BERT based sentence embedding models like BAAI/bge-base-en-v1.5, mixedbread-ai/mxbai-embed-large-v1,  or others, update llama.cpp from

b1696 (2023-12-12):
/~https://github.com/ggerganov/llama.cpp/releases/tag/b1696

to the current latest release

b2581 (2024-03-30):
/~https://github.com/ggerganov/llama.cpp/releases/tag/b2581

BERT support was added to llama.cpp in February 2024:
ggerganov/llama.cpp#5423
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
@jkgenser
Copy link

jkgenser commented Apr 2, 2024

Just tried google bert uncased and raised NotImplementedError: Architecture "BertForMaskedLM" not supported! I probably miss something here.
another model NotImplementedError: Architecture "BertForSequenceClassification" not supported!

These are BERT models that have been pretrained to create embeddings for individual words, but this PR is for BERT models that have been trained to generate embeddings for entire sentences and paragraphs, and those will not produce good results here.

The keyword you are looking for is "SBert", or Sentence Transformers in general (paper, website, HF).

nomic-embed-text-v1 is a good model to start with. Disclosure: I work for Nomic.

bge-base-en-v1.5 is another BERT of similar size.

So if I finetune as bert model for classification task, it would not work to convert it to GGML? I've been watching this work and really excited to be able to deploy my fine-tuned BERT models on llama.cpp

@beyondskyway
Copy link

#6007 already done

What do you mean, I think the PR is not support yet?
I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha, I'm sorry for causing you confusion, i just mean i opened a feature request

same with convert https://huggingface.co/maidalun1020/bce-embedding-base_v1/tree/main

@ggerganov
Copy link
Owner

Where is the reference implementation of XLMRobertaModel for models such as https://huggingface.co/intfloat/multilingual-e5-base/tree/main? Would like to add support for these

@mofosyne mofosyne added enhancement New feature or request model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 13, 2024
@iamlemec
Copy link
Collaborator Author

I think the original is here at fairseq: /~https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta. There's also an implementation in transformers: /~https://github.com/huggingface/transformers/tree/main/src/transformers/models/xlm_roberta. I've actually been looking into XLMRobertaModel to run BAAI/bge-m3, and comparing the transformers implementations, I think the model side is actually identical to BERT.

But there are differences in the tokenization that have driven me slightly mad trying to understand. The model file is called sentencepiece.bpe.model, but it appears to be an actual SentencePiece (unigram) style model, not BPE. Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

@ggerganov
Copy link
Owner

Thanks!

Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

Maybe the "clean_up_tokenization_spaces" parameter is controlling this behaviour?

https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json#L3

@sragrawal
Copy link

Hi All, is there any plan to support XLMRobertaModel? https://huggingface.co/intfloat/multilingual-e5-small works very well for multilingual embeddings for its size (https://huggingface.co/spaces/mteb/leaderboard). Please let me know if there if I should open a new issue for this.

@iamlemec
Copy link
Collaborator Author

@sragrawal I believe that Unigram support from #8089 will get us most of the way there on the XLMRoberta tokenizer (which is featured in this and others such as BAAI/bge-m3). The main thing is loading and using the trie structure stored in precompiled_charsmap. There may be some additional pretokenization stuff, but that should be easier to handle.

@grigohas
Copy link

grigohas commented Oct 9, 2024

Hello, is there a work flow on how to build and run bert through llama.cpp ?

@iacore
Copy link
Contributor

iacore commented Oct 9, 2024

Hello, is there a work flow on how to build and run bert through llama.cpp ?

I wrote about it here. Not sure what "workflow" you are referring to.

@grigohas
Copy link

Is there a way to use llama.cpp to generate text with bert ?

@iacore
Copy link
Contributor

iacore commented Nov 3, 2024

BERT is not an LLM, afaik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.