Can someone explain how embeddigns are working in llama.cpp? #7087

rsoika · 2024-05-05T09:03:59Z

rsoika
May 5, 2024

Hi,
I am working with llama.cpp (Python) and the Mistral 7B Instruct model. All works fine so far.
Now I wonder what are embeddings and how to use them?

As far as I understand embeddings are used to support the LLM with additional context (e.g. data fetched from an internal database).
I also see that calling something like:

print( llm.create_embedding(["What is your name?", "Hi my name is Anna."]))

will generate an vector with a lot of numbers.

But can someone explain how to use embeddings to tell my llm to use them in a inference life cycle?
I expected some code example like

llm = Llama( model_path=my-model_path,   embedding=True   )
embeddings = llm.create_embedding(["What is your name?", "Hi my name is Anna."])
result = llm("What is your name?", max_tokens=max_tokens )

But it does not look like the create_embedding has any effect.

gardner · 2024-05-06T09:58:12Z

gardner
May 6, 2024

This probably isn't the right place to ask this question. Embeddings are more of a general machine learning topic than llama specific.

The create_embedding() function doesn't "embed" information in the model. It just computes the vector from the string using the tokenizer and then returns the vector array of embeddings.

You can read more about how they work:

You would use embeddings returned from the model in a precomputed dataset or perhaps in a vector database to find similar text chunks. You probably would use a specialized model, like BGE, to generate embeddings for vector search though. Check out the llama-index RAG tutorials to learn more about how this works.

Good luck and welcome to LLMs 👍

1 reply

rsoika May 6, 2024
Author

Ok thanks for your answer. My understanding was that I can generate Embeddings (for example based on Datarecords form a Database) to extend the context used by the LLM to for inference my Prompt input? Is this not a usecase of embeddings?

Currently I am working on a complex Prompt that uses Few-Shot Learning and I assumed that embeddings can help here.

iamlemec · 2024-05-07T19:49:20Z

iamlemec
May 7, 2024
Collaborator

Agree with the above, just want to add for emphasis: typically your embedding model will be different from your text generation model. Though embedding models and generative language models share similar archictectures, embedding models are usually much smaller.

So to adapt the example you gave, you'd do something like:

mod_emb = Llama(model_path=EMBEDDING_MODEL_PATH, embedding=True)
embeds = mod_emb.create_embedding(["My name is Anna."])

Then store this in a vector database of some sort, and later it query them with the embedding for "What is your name?" and get the relevant text (hopefully "My name is Anna."). Put that in CONTEXT then run your LLM with something like:

mod_gen = Llama(model_path=GENERATION_MODEL_PATH)
prompt = CONTEXT + '\n\n' + "What is your name?"
result = mod_gen(prompt, max_tokens=max_tokens)

There are ways to improve this substantially described in the above links, but this is the basic idea.

2 replies

teleprint-me May 7, 2024

You can use the same model for embeddings in llama.cpp. I know because I did and it worked. Whether it was intended to be that way is a good question. It kind of makes sense that it does work because the embedding space is just finding the angel between the vectors and projecting the score as the resulting similarity. I probably phrased this inappropriately, so feel free to correct me if I made a mistake.

gardner May 7, 2024

They will "work", it's just that there are embedding specific models that will perform better for vector store similarity search / retrieval. Check out the leaderboard!

adamamer20 · 2024-05-24T22:25:55Z

adamamer20
May 24, 2024

@gardner @iamlemec But how exactly are these embeddings computed? Are these the last layer hidden states? The question in #3643 has been left unanswered. From the source code I understood that it's the llama_get_embeddings function that gets called which retrieves embd from the context ctx but i cannot find where the assignment occurs.

5 replies

teleprint-me May 24, 2024

It's cosine similarity. The input is computed and the output is the resulting embedding space.

adamamer20 May 24, 2024

Sorry, can you clarify a little bit more?

For example, here's the implementation of the Phi3 model on pytorch:

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (rotary_emb): Phi3RotaryEmbedding()
          (o_proj): QuantLinear()
          (qkv_proj): QuantLinear()
        )
        (mlp): Phi3MLP(
          (activation_fn): SiLU()
          (down_proj): QuantLinear()
          (gate_up_proj): QuantLinear()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

I can use LLama.embed with Phi3 but which is the layer on which the embeddings are returned? Is it the first one without attention scores (embed_tokens) or the last one before the head (norm)? it's important because all the attention scores are only included in the last layer

iamlemec May 25, 2024
Collaborator

The embedding outputs would be the same as the output of 'norm' in this case. Basically 'last_hidden_state' in huggingface terms. Kinda confusing that embedding has come to mean two slightly different things nowadays.

If you're looking through the 'llama.cpp' source, the main computation occurs in the 'build_*' functions, and the unpooled embedding outputs are labeled either 'result_norm' or 'result_embd'.

rsoika May 25, 2024
Author

But would this mean that embeddings are the programmatically concept of enclosing a part of my prompt in a BOS and EOS token?

adamamer20 May 25, 2024

@iamlemec thank you! Much clearer now.
@rsoika You can embed any text sequence (supported by context length) with the LLM. The BOS/EOS or the system/user/assistant markers are useful markers because they affect the meaning for the other tokens in the sequence through the transformer blocks and thus affect the value of the final embeddings. But you can embed even a multi-turn conversation with multiple BOS/EOS. "Embedding" simply means turning a text sequence/token into a meaningful vector representation (eg. with Phi3 a vector in 3072 dimensions).

boshng95 · 2024-05-28T14:03:03Z

boshng95
May 28, 2024

I've noticed that when running the similar model using different libraries, such as sentence_transformers and llama-cpp-python, I receive different dimensions of vectors. For example, sentence_transformers returns a vector dimension of (4096, 1), while llama-cpp-python returns a dimension of (5, 4096). I understand that the 5 in the output from llama-cpp-python represents the number of tokens in the sentence.

I would like to further understand the differences between these libraries:

Which type or which dimension of vector is the best for storing and retrieving from a vector database for LLM purposes?
Why do these libraries generate different outputs for the vector even when I am using a similar model?

import numpy
from sentence_transformers import SentenceTransformer
    
text = 'Hello World!'

model = SentenceTransformer('Salesforce/SFR-Embedding-Mistral')
embeddings = model.encode(text)
print(embeddings.shape)
# return (4096,)

from llama_cpp import Llama
from llama_cpp import LLAMA_POOLING_TYPE_NONE

model = Llama(
  model_path='./model/ggml-sfr-embedding-mistral-q8_0.gguf',
  embedding=True,
  verbose=False,
  pooling_type=LLAMA_POOLING_TYPE_NONE
)

vector = model.embed(text)
print(numpy.array(vector).shape)
# return (4, 4096)

3 replies

adamamer20 May 29, 2024

Hi @boshng95,

I took a look at the paper that Salesforce/SFR-Embedding-Mistral is based on. To represent a sentence, they append an EOS token to the text and take the last hidden layer embedding of the EOS token at the end of the query + document.

Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then feed them into the LLM to obtain the query and document embeddings (hq+, hd+) by taking the last layer [EOS] vector.

sentence-transformers automatically retrieves this token embedding for you. llama_cpp does not; it returns all token embeddings, so you need to take the last vector in the returned embedding matrix (eos_vector = vector[-1, :]).

The model you are using is optimized for using the EOS token vector to represent queries + documents, and this is what is usually done for most models. However, note that with some models you might have better performance with other techniques (e.g., global average/max pooling across all tokens).

boshng95 Jun 2, 2024

import numpy
from sentence_transformers import SentenceTransformer
    
text = 'Hello World!'

model = SentenceTransformer('Salesforce/SFR-Embedding-Mistral')
embeddings = model.encode(text)
print(embeddings.shape)
# return (4096,)
print(embeddings[0:5])
# return [ 7.3445563 -0.5116432 -4.0153623 -4.660994   3.149148 ]

from llama_cpp import Llama
from llama_cpp import LLAMA_POOLING_TYPE_NONE

model = Llama(
  model_path='./model/ggml-sfr-embedding-mistral-q8_0.gguf',
  embedding=True,
  verbose=False,
  pooling_type=LLAMA_POOLING_TYPE_NONE
)

vector = model.embed(text)
eos_vector = vector[-1]
print(numpy.array(eos_vector ).shape)
# return (4096,)
print(eos_vector[0:5])
# return [-1.0704020261764526, -5.214768886566162, -1.0387461185455322, -0.17429119348526, 1.7007412910461426]

@adamamer20 I really appreciate the clear explanation and the sharing of the published paper behind the used model. I tried making a little change in the code, and it turns out it is showing the same dimension. However, this led me to a follow-up question. When I compared the first six values of the vector dimension, they returned quite different values between the two libraries, as shown in the code. I am expecting both libraries to return the same or similar values for the vector dimension. Could you help clarify this?

adamamer20 Jun 2, 2024

There are two things at play here.

1. Model Differences:
You are comparing different models. Although they have the same architecture, the SentenceTransformer model is not quantized, whereas the Llama model is quantized to the 8th bit. This quantization means they have different parameter values, leading to numerical differences. Some quantization techniques apply post-training adjustments, causing the resulting embeddings to have different values even if their dimensions are the same. This means the dimensions of the vectors may not directly express the same concepts.

2. Vector Normalization:
The vectors are not normalized. Normalizing a vector can help with comparison because it reduces the impact of parameter heterogeneity. You can normalize by dividing the vector by its norm to give it a length of 1:

vector = vector / np.linalg.norm(vector)

Alternatively, you can set the normalize=True flag in model.embed from llama_cpp. After normalization, compare the vectors using cosine similarity (which is the dot product of the normalized embeddings):

similarity = np.dot(vec_sent_transf, vec_llama_cpp)

Normalization helps but does not directly map vectors to the same dimensions, so differences might still exist. To compare embeddings from different models, especially if they have different dimensions, standardize them so they have a mean of 0 and a standard deviation of 1:

embeddings = # your embeddings array of shape (n, 4096)
mean = np.mean(embeddings, axis=0)
std = np.std(embeddings, axis=0)
standardized_embeddings = (embeddings - mean) / std

# Normalize the standardized embeddings
final_embeddings = standardized_embeddings / np.linalg.norm(standardized_embeddings, axis=1, keepdims=True)

This standardization should make the embeddings more comparable, though there will always be some level of stochasticity involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can someone explain how embeddigns are working in llama.cpp? #7087

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Can someone explain how embeddigns are working in llama.cpp? #7087

Replies: 4 comments · 11 replies

rsoika May 6, 2024 Author

iamlemec May 7, 2024 Collaborator

iamlemec May 25, 2024 Collaborator

rsoika May 25, 2024 Author

Replies: 4 comments 11 replies

rsoika May 6, 2024
Author

iamlemec
May 7, 2024
Collaborator

iamlemec May 25, 2024
Collaborator

rsoika May 25, 2024
Author