-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BERT embedding models #5423
Conversation
self.block_count = self.hparams["num_hidden_layers"] | ||
|
||
def set_gguf_parameters(self): | ||
# TODO(cebtenzzre): merge with parent class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: resolve this before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you... have you forgotten about this...
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
When I was playing with ./build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "This is a ggml" Tokenizes to:
Seems like a |
Ah yeah, that was a bug in |
I have batched embedding working now (bert-batched). Basically just matmul an Should I push this to this PR or wait until this goes through and start a new one? |
llama.cpp
Outdated
// the output is always the last tensor in the graph | ||
struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1]; | ||
GGML_ASSERT(strcmp(res->name, "result_output") == 0); | ||
// get logits and embeddings | ||
struct ggml_tensor * res = ggml_graph_get_tensor(gf, "result_output"); | ||
struct ggml_tensor * embeddings = ggml_graph_get_tensor(gf, "result_norm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ggml_graph_get_tensor
is not recommended here because it will do a strcmp
with the entire graph which can become noticeable in terms of speed. For now, we should be "poking" at the last few tensors to find what we need - not great, but will improve in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix the ggml_graph_get_tensor
comment and merge. After that, we can look into batching support in separate PR
Oh, I see. I think your last statement was meant for |
It looks like |
i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer" |
You could open a feature request if you haven't already. |
* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
#6007 already done |
What do you mean, I think the PR is not support yet? I try convert to day an see this one?
|
aha, I'm sorry for causing you confusion, i just mean i opened a feature request |
In order to get support for BERT based sentence embedding models like BAAI/bge-base-en-v1.5, mixedbread-ai/mxbai-embed-large-v1, or others, update llama.cpp from b1696 (2023-12-12): https://github.com/ggerganov/llama.cpp/releases/tag/b1696 to the current latest release b2581 (2024-03-30): https://github.com/ggerganov/llama.cpp/releases/tag/b2581 BERT support was added to llama.cpp in February 2024: ggerganov/llama.cpp#5423
* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
So if I finetune as bert model for classification task, it would not work to convert it to GGML? I've been watching this work and really excited to be able to deploy my fine-tuned BERT models on llama.cpp |
same with convert https://huggingface.co/maidalun1020/bce-embedding-base_v1/tree/main |
Where is the reference implementation of |
I think the original is here at fairseq: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta. There's also an implementation in But there are differences in the tokenization that have driven me slightly mad trying to understand. The model file is called |
Thanks!
Maybe the https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json#L3 |
Hi All, is there any plan to support XLMRobertaModel? https://huggingface.co/intfloat/multilingual-e5-small works very well for multilingual embeddings for its size (https://huggingface.co/spaces/mteb/leaderboard). Please let me know if there if I should open a new issue for this. |
@sragrawal I believe that Unigram support from #8089 will get us most of the way there on the |
Hello, is there a work flow on how to build and run bert through llama.cpp ? |
I wrote about it here. Not sure what "workflow" you are referring to. |
Is there a way to use llama.cpp to generate text with bert ? |
BERT is not an LLM, afaik. |
Following discussion in #2872, adds support for BERT model architecture. Built on top of various contributions from @skeskinen, @xyzhang626, and @cebtenzzre. Includes:
llm_tokenize_wpm
. Needed for slightly different behavior from SentencePiece. On conversion, vocab is mapped from##
subword scheme to▁
prefix scheme to allow for unified vocab mappings.bert.attention.causal
that controls whether attention mask is causal or not (default istrue
). Alsotokenizer.ggml.token_type_count
which accounts for token type info, though these are tpyically ignored in actual computations.build_bert
for graph construction. This is fairly standard. The only difference is the pooling layer at the end. Currently it will pool the entire batch. Ideally, it could be made to pool only within sequence.In terms of which models actually work, the main limitation is tokenization. I have tested with
all-MiniLM-L6-v2
andBAAI/bge-*-*-v1.5
(small
,base
, andlarge
plusen
andzh
) and they seem to work and the embedding numbers look similar to Huggingface implementations. The newerBAAI/bge-m3
uses a SentencePiece tokenizer, so it should be doable but I haven't tested it.