Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

Closed
xcottos opened this issue Dec 16, 2023 · 4 comments · Fixed by #3333
Closed

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

xcottos opened this issue Dec 16, 2023 · 4 comments · Fixed by #3333
Assignees

Comments

@xcottos
Copy link

xcottos commented Dec 16, 2023

Hi everybody,

I am trying to fine-tune a llama-2-13B-chat model and I think I did everything correctly but I still cannot apply my lora.

What I did was:

  1. I converted the llama2 weights into hf format using this: (https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)

python convert_llama_weights_to_hf.py --input_dir ../models/llama-2-13b-chat --output_dir ../models/llama-2-13b-chat/llama-2-13b-chat-hf --model_size 13B

  1. Then I ran the convert.py and converted it in fp32

python convert.py ../models/llama-2-13b-chat/llama-2-13b-chat-hf --outtype f32 --outfile ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin

  1. then I quantised it (I am using q5_k_m and not q4_0)

./quantize ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin q5_k_m

At this point I test the models with ./main and they work perfectly.

  1. So I created a dataset trying to orientate myself in many contradictory answers about the format and at the end I opted for the one that seemed the most used:

<s>[INST] <<SYS>>
In the context of physics
<</SYS>>

What is quantum entanglement? [/INST] Quantum entanglement is the phenomenon that occurs when a group of particles are generated, interact, or share in such a way that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance.


(I didn't put any </s> at the end because for some reason the loss became nan after less than 10 iterations)

  1. I started the fine-tuning and left it for almost 12h for completing 1 epoch

./finetune --model-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --train-data ../datasets/FineTune/train_llamacpp.txt --threads 26 --sample-start "<s>" --ctx 512 -ngl 32

  1. Then I tested the model

./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --lora-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin --lora ../models/llama-2-13b-chat/ggml-lora-LATEST-f32.gguf --color -p "What is entanglement in physics?"

I always get as last lines of the logs

.....
llama_apply_lora_from_file_internal: unsupported tensor dimension 1
llama_init_from_gpt_params: error: failed to apply lora adapter
ggml_metal_free: deallocating
main: error: unable to load model

I am running everything on an M1 Max with 64GB of Ram and 1 GPU with 32 cores

What can be the problem? I tried already different things but no success, that's why I'm writing here...

Thank you for any help

Luca

@slaren slaren self-assigned this Dec 16, 2023
@slaren
Copy link
Collaborator

slaren commented Dec 16, 2023

Can you test if the lora works with #3333?

@slaren slaren linked a pull request Dec 16, 2023 that will close this issue
@xcottos
Copy link
Author

xcottos commented Dec 16, 2023

Hi slaren,

I tried it and it moved forward, now I am receiving this:

ggml_new_object: not enough space in the context's memory pool (needed 983635488, available 849346560)
zsh: segmentation fault ./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q4_0.bin

in the meanwhile I'm trying converting the model to fp16, quantise it to q4_0 and retraining it until the first checkpoint (it usually takes more or less 1h)

Thank you for your support!

below the full output right after the beginning of loading lora (where my previous attempts stopped):
......
llama_apply_lora_from_file_internal: applying lora adapter from '../llama.cpp/ggml-lora-LATEST-f32.gguf' - please wait ...
llama_apply_lora_from_file_internal: r = 4, alpha = 4, scaling = 1.00
llama_apply_lora_from_file_internal: allocating 810 MB for lora temporary buffer
llama_apply_lora_from_file_internal: loading base model from '../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin'
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight f16 [ 5120, 5120, 1, 1 ]
..........
..........
llama_model_loader: - tensor 358: blk.39.ffn_down.weight f16 [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 362: output.weight f16 [ 5120, 32000, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llama-2-13b-chat
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
ggml_new_object: not enough space in the context's memory pool (needed 983635488, available 849346560)
zsh: segmentation fault ./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q4_0.bi

@slaren
Copy link
Collaborator

slaren commented Dec 16, 2023

Thanks for testing! That issue should be fixed now, the loras created by finetune are larger than I realized.

@xcottos
Copy link
Author

xcottos commented Dec 16, 2023

it worked! thanks slaren, I hope it's merged as soon as possible with the main branch

appreciated your fast response and solution!

Luca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants