Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batched embeddings #5466

Merged
merged 5 commits into from
Feb 13, 2024
Merged

Conversation

iamlemec
Copy link
Collaborator

This allows for efficient high volume embedding. Changes include:

  • Final pooling layer now sums by sequence id to compute correct embedding for batches with multiples sequences
  • Pooling layer can be toggled with do_pooling. Default is true but false may be useful for ColBERT approaches
  • Bring back non-causal attention mask, which got lost in the creation of llama_set_inputs going to alloc v3
  • Embeddings can accessed individually by seq_id with llama_get_embeddings_ith
  • Updated embedding example to split by newline and group into batches by default

Performance looks to be on par with ONNX for CPU/GPU, at least for relatively large models such as bge-base where tokenization is not a bottleneck.

@ggerganov ggerganov merged commit 03bf161 into ggerganov:master Feb 13, 2024
48 of 54 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants