Obtaining an embeddings vector for a larger text #2712

s-trooper · 2023-08-22T13:20:51Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I would like to obtain an embedding vector for larger texts, e.g. 4K, 8K or more.

Current Behavior

I get an error:
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)

Environment and Context

When I create a text file with 197 lines of "Hello World", like:

Hello World
Hello World
...

I get the embedding vector as expected.
However, when I add just one more line, I receive the error not enough space in the context's memory pool.
Yet, my RAM/VRAM is being used less than 15%!

I know there are many issues related to this error, but I haven't found any solution for embeddings.

Physical (or virtual) hardware you are using, e.g. for Linux:
- CPU i7-9700K @ 3.60GHz
- GPU RTX 3090 TI VRAM 24 GB
- RAM 80 GB
Operating System, e.g. for Linux:
- Windows 11

Failure Information (for bugs)

ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)

Steps to Reproduce

Create a text file named "text-of-2367-bytes.txt" containing over 198 lines of "Hello World".
.\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt

Failure Logs

Example run with the Windows command embedding

.\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt 
main: build = 1010 (cb1c072)
main: seed  = 1692704725
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
llama.cpp: loading model from .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9493 MB
llama_new_context_with_model: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)

The text was updated successfully, but these errors were encountered:

slaren · 2023-08-22T13:27:09Z

The embedding example does not respect the batch size and passes the entire prompt to llama_eval. This should be fixed to split the prompt in multiple batches if needed. In the meanwhile, #2684 will allow you to increase the batch size to the same size as the prompt, at the cost of higher memory usage.

s-trooper · 2023-08-22T14:54:33Z

Thank you very much. It works with that patch! 👍
If one looks this up, one can now simply increase the context size to utilize more RAM, e.g.
./embedding.exe -c 4096

dspasyuk · 2023-08-23T23:47:57Z

Hi guys @s-trooper @slaren, sorry to ask this question here but after converting my text to vectors what do I do with the outputs, does it get saved somewhere? How can I use it with llama.cpp I cannot seem to find much information about embedding using the llama.cpp, many thanks in advance. I generate output the following way: ./llama.cpp/embedding -ngl 0 -c 4096 -m ../models/vicuna-7b-v1.5.ggmlv3.q5_1.bin -f ~/test.txt

s-trooper · 2023-08-24T06:13:03Z

Hello @deonis1, on Windows, when I redirect/write the output to a file, only the vector is written to the file and not the informational text. I can't test it on Linux, but try it out yourself.
./llama.cpp/embedding -ngl 0 -c 4096 -m ../models/vicuna-7b-v1.5.ggmlv3.q5_1.bin -f ~/test.txt > ~/test.embedding

I assume many people use "llama-cpp-python" for embedding. I haven't been able to get it to work myself yet. But if you can, here's the API:

from llama_cpp import Llama

llm = Llama(model_path=r".\models\ggml-vic13b-q5_1.bin", embedding=True)  
output = llm.create_embedding(open("./embedding-test.txt").read())
emb_vector = output['data'][0]['embedding']

dspasyuk · 2023-08-24T08:48:37Z

Hi @s-trooper, thank you for the reply. I use pure C or node-js, I am building a small chat application for local llms (https://github.com/deonis1/llcui) in nodejs and wanted to incorporate embedding. Looks like server application in llama.cpp supports embedding but I have not tried it yet. I might need to dig through the code to see what makes it tick.

slaren mentioned this issue Aug 22, 2023

embedding : evaluate prompt in batches #2713

Merged

slaren closed this as completed in #2713 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obtaining an embeddings vector for a larger text #2712

Obtaining an embeddings vector for a larger text #2712

s-trooper commented Aug 22, 2023

slaren commented Aug 22, 2023

s-trooper commented Aug 22, 2023

dspasyuk commented Aug 23, 2023 •

edited

Loading

s-trooper commented Aug 24, 2023 •

edited

Loading

dspasyuk commented Aug 24, 2023

Obtaining an embeddings vector for a larger text #2712

Obtaining an embeddings vector for a larger text #2712

Comments

s-trooper commented Aug 22, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

slaren commented Aug 22, 2023

s-trooper commented Aug 22, 2023

dspasyuk commented Aug 23, 2023 • edited Loading

s-trooper commented Aug 24, 2023 • edited Loading

dspasyuk commented Aug 24, 2023

dspasyuk commented Aug 23, 2023 •

edited

Loading

s-trooper commented Aug 24, 2023 •

edited

Loading