Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining an embeddings vector for a larger text #2712

Closed
4 tasks done
s-trooper opened this issue Aug 22, 2023 · 5 comments · Fixed by #2713
Closed
4 tasks done

Obtaining an embeddings vector for a larger text #2712

s-trooper opened this issue Aug 22, 2023 · 5 comments · Fixed by #2713

Comments

@s-trooper
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I would like to obtain an embedding vector for larger texts, e.g. 4K, 8K or more.

Current Behavior

I get an error:
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)

Environment and Context

When I create a text file with 197 lines of "Hello World", like:

Hello World
Hello World
...

I get the embedding vector as expected.
However, when I add just one more line, I receive the error not enough space in the context's memory pool.
Yet, my RAM/VRAM is being used less than 15%!

I know there are many issues related to this error, but I haven't found any solution for embeddings.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

    • CPU i7-9700K @ 3.60GHz
    • GPU RTX 3090 TI VRAM 24 GB
    • RAM 80 GB
  • Operating System, e.g. for Linux:

    • Windows 11

Failure Information (for bugs)

ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)

Steps to Reproduce

  1. Create a text file named "text-of-2367-bytes.txt" containing over 198 lines of "Hello World".
  2. .\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt

Failure Logs

Example run with the Windows command embedding

.\llama-master-cb1c072-bin-win-cublas-cu11.7.1-x64\embedding.exe -ngl 80 -c 2048 -m .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin -f .\text-of-2367-bytes.txt 
main: build = 1010 (cb1c072)
main: seed  = 1692704725
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
llama.cpp: loading model from .\models\wizard-vicuna-13b-uncensored-superhot-8k.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  582.00 MB (+ 1600.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9493 MB
llama_new_context_with_model: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
ggml_new_object: not enough space in the context's memory pool (needed 12747504, available 12747472)
@slaren
Copy link
Member

slaren commented Aug 22, 2023

The embedding example does not respect the batch size and passes the entire prompt to llama_eval. This should be fixed to split the prompt in multiple batches if needed. In the meanwhile, #2684 will allow you to increase the batch size to the same size as the prompt, at the cost of higher memory usage.

@s-trooper
Copy link
Author

Thank you very much. It works with that patch! 👍
If one looks this up, one can now simply increase the context size to utilize more RAM, e.g.
./embedding.exe -c 4096

@dspasyuk
Copy link
Contributor

dspasyuk commented Aug 23, 2023

Hi guys @s-trooper @slaren, sorry to ask this question here but after converting my text to vectors what do I do with the outputs, does it get saved somewhere? How can I use it with llama.cpp I cannot seem to find much information about embedding using the llama.cpp, many thanks in advance. I generate output the following way: ./llama.cpp/embedding -ngl 0 -c 4096 -m ../models/vicuna-7b-v1.5.ggmlv3.q5_1.bin -f ~/test.txt

@s-trooper
Copy link
Author

s-trooper commented Aug 24, 2023

Hello @deonis1, on Windows, when I redirect/write the output to a file, only the vector is written to the file and not the informational text. I can't test it on Linux, but try it out yourself.
./llama.cpp/embedding -ngl 0 -c 4096 -m ../models/vicuna-7b-v1.5.ggmlv3.q5_1.bin -f ~/test.txt > ~/test.embedding

I assume many people use "llama-cpp-python" for embedding. I haven't been able to get it to work myself yet. But if you can, here's the API:

from llama_cpp import Llama

llm = Llama(model_path=r".\models\ggml-vic13b-q5_1.bin", embedding=True)  
output = llm.create_embedding(open("./embedding-test.txt").read())
emb_vector = output['data'][0]['embedding']

@dspasyuk
Copy link
Contributor

Hi @s-trooper, thank you for the reply. I use pure C or node-js, I am building a small chat application for local llms (https://github.com/deonis1/llcui) in nodejs and wanted to incorporate embedding. Looks like server application in llama.cpp supports embedding but I have not tried it yet. I might need to dig through the code to see what makes it tick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants