Embedding extraction #72

setzer22 · 2023-03-24T20:29:32Z

Implements #56.

I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate. I validated this using an ad_hoc_test (currently hard-coded in main) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.

This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt, infer_next_token). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate function when they need to retrieve this kind of information?

On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?

Finally, should we consider exposing this to llama-cli at all?

llama-cli/src/main.rs

setzer22 · 2023-03-24T20:33:02Z

llama-cli/src/main.rs

+    // Try other words: 'dog', 'cat', 'potato', '$' -> To see decreasingly lower dot product values.
+    let dog2 = model.tokenize(&vocab, "dog", false).unwrap();


What I'm doing here is feeding the following two sentences through the transformer:

"My favourite animal is the dog"

"I just adopted a cute dog"

Afterwards, I retrieve the embeddings for the last token (dog), and compute their similarity with a simple dot product.

Then, I tried changing the second sentence from 'dog' to 'cat', 'potato', '$' respectively, and the semantic similarity dropped accordingly, with $ ranking the lowest.

@setzer22 will feed prmpt before eval has different embeddings compared to eval all tokens together?

@hlhr202 The embeddings wouldn't be affected, but you shouldn't call evaluate with the whole prompt like that for a couple reasons:

A call to evaluate runs all the tokens you give it as a batch, meaning it requires increased memory usage. For very long prompts, this could become very expensive.

The output value will return the output embeddings for every token that you fed through eval. This means you would be retrieving a lot more embedding data than for just the word "dog".

This is why the test code uses feed_prompt first, to set up the context, and then makes a call to evaluate with a single token to retrieve the embeddings for a single word.

@setzer22 I just understand your comments here. this means we can only extract embeddings for a single part of words (which may also have hidden information mixed with context of the whole sentence). that should a little bit different with OpenAI's embedding function. what i understand for openai's embedding, is for the whole sentence but at the same time returned in a fixed size of tensor... that is quite beyond my knowledge though

well I guess I might find possible ways to implement such 'sentence embedding', I will try add some special end token and extract the hidden layer once the end token evaluated. not sure if it works, but it must worth a try.

llama-rs/src/lib.rs

philpax · 2023-03-25T15:02:46Z

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

hlhr202 · 2023-03-25T16:03:32Z

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

It would be just nice for me to expose such get embedding function as in crate library. actually i do not care much about cli exposing. what I v seen llama.cpp they provide a parameter --embedding for output purpose. but they still did not find out a way to expose it though. thats why i still cannot get the embedding from their cli currently.
I only test a few cases with comparing to openai's embedding. should be some difference, but i think that is caused by different model.

KerfuffleV2 · 2023-03-25T16:11:10Z

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

hlhr202 · 2023-03-25T16:15:07Z

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

yes absolutely, cuz i m porting llama-rs to llama-node, so i just need library pub function exposing.
it doesnt make sense to expose embeddings in cli anyway.

setzer22 · 2023-03-26T10:50:05Z

I already addressed the review feedback and removed the ad-hoc test code. So I take it a good plan now would be to merge this as-is and have embedding extraction as a low-level feature of llama-rs, but simply not expose it to the CLI?

philpax

LGTM, ready to merge after the comment's fixed

KerfuffleV2 · 2023-03-26T13:46:55Z

Since I added the --dump-prompt-tokens option, you can probably guess I like exposing information. :) I know people asked about being able to show the embeddings with llama.cli, so it does seem like there's some kind of demand for it in a CLI.

philpax · 2023-03-26T13:50:01Z

If there's demand, I'm happy to do so - just not sure what the output format should be. JSON array or newline-delimited floats?

KerfuffleV2 · 2023-03-26T13:56:46Z

Is it a lot of data? You could probably just print in the normal Rust debug format which should look like a comma separated list if it's in a Vec or something. That should be pretty easy to transform to other formats without need to write extra code or pull in dependencies.

This is the related issue: ggml-org/llama.cpp#224 (there was actually only one person who wanted it as an option)

setzer22 · 2023-03-26T15:21:15Z

Is it a lot of data?

It is quite a lot of data for comfortably printing to stdout. It's 4096 floats per token. Not that it wouldn't work, but it's a bit uncomfortable.

KerfuffleV2 · 2023-03-26T15:29:50Z

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

rpbrokaw · 2023-04-02T05:01:28Z

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

hlhr202 · 2023-04-02T08:27:45Z

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

vector is around 4096 length for one token, not very suitable for being well printed in CLI. I guess you need to call it through rust api.

setzer22 · 2023-04-02T08:45:39Z

I'm open to adding a way for the CLI to output embeddings if people find this is an interesting use case. The main blocker here is that the use case is not clear to me and thus I can't figure out the right API and output format.

What we need here, is someone who understands how embeddings in an LLM like LLaMA work, has a clear use case for extracting them and can tell us how would they expect an API like this to work. If anyone wants to open an issue with a clear description of what we need to provide, I'd be happy to add an implementation 🙂

hlhr202 · 2023-04-02T16:03:28Z

@setzer22 I have made a new embedding extraction example. can check it here https://github.com/hlhr202/llama-node/blob/develop/packages/core/example/semantic-compare/compare.py
I noticed that llama.cpp they use "\n" as end token, so I also do the same. it is quite close to openai's text-embedding-ada-002 result

turbo · 2023-04-03T01:35:32Z

I'm working on a large dense-vector embedding database (about 2 million data points from books), which is currently using OpenAI's Ada embeddings (~1600 dimensions). I can do a comparison of performance between those and the 4k LLaMa embeds if needed.

as a clear use case for extracting them and can tell us how would they expect an API like this to work

From an ops perspective, ideally one could provide a batch input and get a batch output (just like OpenAI's API) via CLI. The format doesn't matter much - it can be JSONL or a binary format. I'd personally recommend sticking to those two, since that is supported by most VSS databases (e.g. Redis RediSearch).

merlinvn · 2023-05-08T14:56:16Z

My use case here is if you have a sets of documents, and if you can get the embeddings of those documents, whenever a new question comming in, you can embed the new questions and find the most relevant documents to send a long with you prompt. So basically you can have a natural Q&A chat bot based on your own data.

setzer22 added 2 commits March 24, 2023 20:46

Add code to extract embeddings

843be5d

Add ad_hoc_test for embeddings

f12e70e

setzer22 commented Mar 24, 2023

View reviewed changes

setzer22 mentioned this pull request Mar 24, 2023

Hi, is there any plans for word embeddings? #56

Closed

KerfuffleV2 reviewed Mar 24, 2023

View reviewed changes

llama-rs/src/lib.rs Outdated Show resolved Hide resolved

philpax reviewed Mar 25, 2023

View reviewed changes

llama-rs/src/lib.rs Outdated Show resolved Hide resolved

philpax mentioned this pull request Mar 25, 2023

WIP: Refactor Cli #74

Closed

3 tasks

setzer22 added 3 commits March 26, 2023 12:44

Use Tensor::read_data

6e68b0b

Adjust safety comment

5c70570

Remove ad-hoc test

38a6322

Clippy & fmt

04e83f5

philpax approved these changes Mar 26, 2023

View reviewed changes

Fix comment that ended halfway

280f89f

Merge branch 'main' into feat/extract_embeddings

6873a61

philpax merged commit a067431 into main Mar 26, 2023

philpax deleted the feat/extract_embeddings branch July 16, 2023 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding extraction #72

Embedding extraction #72

setzer22 commented Mar 24, 2023 •

edited

Loading

setzer22 Mar 24, 2023

hlhr202 Mar 25, 2023

setzer22 Mar 26, 2023 •

edited

Loading

hlhr202 Mar 27, 2023

hlhr202 Mar 27, 2023

philpax commented Mar 25, 2023

hlhr202 commented Mar 25, 2023

KerfuffleV2 commented Mar 25, 2023

hlhr202 commented Mar 25, 2023

setzer22 commented Mar 26, 2023

philpax left a comment

KerfuffleV2 commented Mar 26, 2023

philpax commented Mar 26, 2023

KerfuffleV2 commented Mar 26, 2023 •

edited

Loading

setzer22 commented Mar 26, 2023 •

edited

Loading

KerfuffleV2 commented Mar 26, 2023

rpbrokaw commented Apr 2, 2023

hlhr202 commented Apr 2, 2023 •

edited

Loading

setzer22 commented Apr 2, 2023 •

edited

Loading

hlhr202 commented Apr 2, 2023

turbo commented Apr 3, 2023

merlinvn commented May 8, 2023 •

edited

Loading

		// Try other words: 'dog', 'cat', 'potato', '$' -> To see decreasingly lower dot product values.
		let dog2 = model.tokenize(&vocab, "dog", false).unwrap();

Embedding extraction #72

Embedding extraction #72

Conversation

setzer22 commented Mar 24, 2023 • edited Loading

setzer22 Mar 24, 2023

Choose a reason for hiding this comment

hlhr202 Mar 25, 2023

Choose a reason for hiding this comment

setzer22 Mar 26, 2023 • edited Loading

Choose a reason for hiding this comment

hlhr202 Mar 27, 2023

Choose a reason for hiding this comment

hlhr202 Mar 27, 2023

Choose a reason for hiding this comment

philpax commented Mar 25, 2023

hlhr202 commented Mar 25, 2023

KerfuffleV2 commented Mar 25, 2023

hlhr202 commented Mar 25, 2023

setzer22 commented Mar 26, 2023

philpax left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Mar 26, 2023

philpax commented Mar 26, 2023

KerfuffleV2 commented Mar 26, 2023 • edited Loading

setzer22 commented Mar 26, 2023 • edited Loading

KerfuffleV2 commented Mar 26, 2023

rpbrokaw commented Apr 2, 2023

hlhr202 commented Apr 2, 2023 • edited Loading

setzer22 commented Apr 2, 2023 • edited Loading

hlhr202 commented Apr 2, 2023

turbo commented Apr 3, 2023

merlinvn commented May 8, 2023 • edited Loading

setzer22 commented Mar 24, 2023 •

edited

Loading

setzer22 Mar 26, 2023 •

edited

Loading

KerfuffleV2 commented Mar 26, 2023 •

edited

Loading

setzer22 commented Mar 26, 2023 •

edited

Loading

hlhr202 commented Apr 2, 2023 •

edited

Loading

setzer22 commented Apr 2, 2023 •

edited

Loading

merlinvn commented May 8, 2023 •

edited

Loading