-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama2 #488
Comments
I can confirm that Llama 13b model works. However, I am having bizarre errors when I try to use it together with langchain to make embeddings, it gets inf tokens per second and goes for ages. @abetlen I suggest that you should also test the embeddings creations for 4K context works smoothly as I often get errors of too many tokens even with 2K tokens splitting. |
@antonkulaga will investigate, does this issue only come up with the embeddings? It could be an upstream issue in llama.cpp |
Testing out 70b (quantized) on an M1 max with 64GB of RAM: ins] In [2]: from llama_cpp import Llama
[ins] In [3]: MODEL_PATH = "./llama2/70b-v2-q4_0.bin"
[ins] In [4]: model = Llama(model_path=MODEL_PATH)
llama.cpp: loading model from ./llama2/70b-v2-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 24576
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024
llama_load_model_from_file: failed to load model
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = Llama(model_path=MODEL_PATH)
File .venv/lib/python3.9/site-packages/llama_cpp/llama.py:305, in Llama.__init__(self, model_path, n_ctx, n_parts, n_gpu_layers, seed, f16_kv, logits_all, vocab_only, use_mmap, use_mlock, embedding, n_threads, n_batch, last_n_tokens_size, lora_base, lora_path, low_vram, tensor_split, rope_freq_base, rope_freq_scale, verbose)
300 raise ValueError(f"Model path does not exist: {model_path}")
302 self.model = llama_cpp.llama_load_model_from_file(
303 self.model_path.encode("utf-8"), self.params
304 )
--> 305 assert self.model is not None
307 self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
309 assert self.ctx is not None
AssertionError: Seems to be expecting the wrong shape:
Note that this works just fine with # convert model to ggml
python convert.py --outfile models/70B-v2/ggml-model-f16.bin --outtype f16 ../llama2/llama/llama-2-70b/
# quantize it to q4_0
./quantize ./models/70B-v2/ggml-model-f16.bin ./models/70B-v2/ggml-model-q4_0.bin q4_0
# inference runs fine
./main -m ./models/70B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap --ignore-eos -n 64 -t 8 -gqa 8 Extra infoInstalled with |
Currently not getting good results with llama2-13b. They are actually quite off and frequently become ramblings. See #596 The model exposed through Fast API has better responses. Though still not as good as the responses from I'm not exactly sure which hyperparameters would need to be tuned. |
llama2 13b models are working for me, but 22b e.g. (llama2-22b-gplatty.ggmlv3.q5_K_M.bin and others) segfaults with "ggml_new_object: not enough space in the context's memory pool (needed 13798672, available 12747472)" when I send more than just a handful of tokens. No issues with older 30b (n_ctx 2048) models, only l2 22b. Any debug info that could be useful? I'm using llama-cpp-python through text-generation-webui. |
The text was updated successfully, but these errors were encountered: