Support batched embeddings #5466

iamlemec · 2024-02-12T18:32:55Z

This allows for efficient high volume embedding. Changes include:

Final pooling layer now sums by sequence id to compute correct embedding for batches with multiples sequences
Pooling layer can be toggled with do_pooling. Default is true but false may be useful for ColBERT approaches
Bring back non-causal attention mask, which got lost in the creation of llama_set_inputs going to alloc v3
Embeddings can accessed individually by seq_id with llama_get_embeddings_ith
Updated embedding example to split by newline and group into batches by default

Performance looks to be on par with ONNX for CPU/GPU, at least for relatively large models such as bge-base where tokenization is not a bottleneck.

…mple

* batched embedding: pool outputs by sequence id. updated embedding example * bring back non-causal attention * embd : minor improvements * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

iamlemec added 2 commits February 12, 2024 12:02

batched embedding: pool outputs by sequence id. updated embedding exa…

1549493

…mple

bring back non-causal attention

f281d76

cebtenzzre mentioned this pull request Feb 12, 2024

Add support for Nomic Embed #5468

Merged

ggerganov added 3 commits February 13, 2024 13:52

embd : minor improvements

b650d4c

Merge branch 'master' into HEAD

39d3704

llama : minor

f4cccb7

ggerganov approved these changes Feb 13, 2024

View reviewed changes

ggerganov merged commit 03bf161 into ggerganov:master Feb 13, 2024
48 of 54 checks passed

AndreBerzun mentioned this pull request Feb 19, 2024

Embedding model support ollama/ollama#327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support batched embeddings #5466

Support batched embeddings #5466

iamlemec commented Feb 12, 2024

Support batched embeddings #5466

Support batched embeddings #5466

Conversation

iamlemec commented Feb 12, 2024