-
Notifications
You must be signed in to change notification settings - Fork 367
Conversation
llama-cli/src/main.rs
Outdated
// Try other words: 'dog', 'cat', 'potato', '$' -> To see decreasingly lower dot product values. | ||
let dog2 = model.tokenize(&vocab, "dog", false).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm doing here is feeding the following two sentences through the transformer:
- "My favourite animal is the dog"
- "I just adopted a cute dog"
Afterwards, I retrieve the embeddings for the last token (dog), and compute their similarity with a simple dot product.
Then, I tried changing the second sentence from 'dog' to 'cat', 'potato', '$' respectively, and the semantic similarity dropped accordingly, with $ ranking the lowest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@setzer22 will feed prmpt before eval has different embeddings compared to eval all tokens together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hlhr202 The embeddings wouldn't be affected, but you shouldn't call evaluate
with the whole prompt like that for a couple reasons:
- A call to evaluate runs all the tokens you give it as a batch, meaning it requires increased memory usage. For very long prompts, this could become very expensive.
- The output value will return the output embeddings for every token that you fed through
eval
. This means you would be retrieving a lot more embedding data than for just the word "dog".
This is why the test code uses feed_prompt
first, to set up the context, and then makes a call to evaluate
with a single token to retrieve the embeddings for a single word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@setzer22 I just understand your comments here. this means we can only extract embeddings for a single part of words (which may also have hidden information mixed with context of the whole sentence). that should a little bit different with OpenAI's embedding function. what i understand for openai's embedding, is for the whole sentence but at the same time returned in a fixed size of tensor... that is quite beyond my knowledge though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I guess I might find possible ways to implement such 'sentence embedding', I will try add some special end token and extract the hidden layer once the end token evaluated. not sure if it works, but it must worth a try.
LGTM once the other review feedback's sorted out. For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?) Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later. |
It would be just nice for me to expose such get embedding function as in crate library. actually i do not care much about cli exposing. what I v seen llama.cpp they provide a parameter --embedding for output purpose. but they still did not find out a way to expose it though. thats why i still cannot get the embedding from their cli currently. |
@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings. |
yes absolutely, cuz i m porting llama-rs to llama-node, so i just need library pub function exposing. |
I already addressed the review feedback and removed the ad-hoc test code. So I take it a good plan now would be to merge this as-is and have embedding extraction as a low-level feature of llama-rs, but simply not expose it to the CLI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, ready to merge after the comment's fixed
Since I added the |
If there's demand, I'm happy to do so - just not sure what the output format should be. JSON array or newline-delimited floats? |
Is it a lot of data? You could probably just print in the normal Rust debug format which should look like a comma separated list if it's in a This is the related issue: ggerganov/llama.cpp#224 (there was actually only one person who wanted it as an option) |
It is quite a lot of data for comfortably printing to stdout. It's 4096 floats per token. Not that it wouldn't work, but it's a bit uncomfortable. |
Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted. |
I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++. |
vector is around 4096 length for one token, not very suitable for being well printed in CLI. I guess you need to call it through rust api. |
I'm open to adding a way for the CLI to output embeddings if people find this is an interesting use case. The main blocker here is that the use case is not clear to me and thus I can't figure out the right API and output format. What we need here, is someone who understands how embeddings in an LLM like LLaMA work, has a clear use case for extracting them and can tell us how would they expect an API like this to work. If anyone wants to open an issue with a clear description of what we need to provide, I'd be happy to add an implementation 🙂 |
@setzer22 I have made a new embedding extraction example. can check it here https://github.com/hlhr202/llama-node/blob/develop/packages/core/example/semantic-compare/compare.py |
I'm working on a large dense-vector embedding database (about 2 million data points from books), which is currently using OpenAI's Ada embeddings (~1600 dimensions). I can do a comparison of performance between those and the 4k LLaMa embeds if needed.
From an ops perspective, ideally one could provide a batch input and get a batch output (just like OpenAI's API) via CLI. The format doesn't matter much - it can be JSONL or a binary format. I'd personally recommend sticking to those two, since that is supported by most VSS databases (e.g. Redis RediSearch). |
My use case here is if you have a sets of documents, and if you can get the embeddings of those documents, whenever a new question comming in, you can embed the new questions and find the most relevant documents to send a long with you prompt. So basically you can have a natural Q&A chat bot based on your own data. |
Implements #56.
I ported the llama.cpp code to allow extracting word embeddings and logits from a call to
evaluate
. I validated this using anad_hoc_test
(currently hard-coded inmain
) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an
EvaluateOutputRequest
struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e.feed_prompt
,infer_next_token
). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower levelevaluate
function when they need to retrieve this kind of information?On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?
Finally, should we consider exposing this to
llama-cli
at all?