Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

xcottos · 2023-12-16T13:41:54Z

Hi everybody,

I am trying to fine-tune a llama-2-13B-chat model and I think I did everything correctly but I still cannot apply my lora.

What I did was:

I converted the llama2 weights into hf format using this: (https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)

python convert_llama_weights_to_hf.py --input_dir ../models/llama-2-13b-chat --output_dir ../models/llama-2-13b-chat/llama-2-13b-chat-hf --model_size 13B

Then I ran the convert.py and converted it in fp32

python convert.py ../models/llama-2-13b-chat/llama-2-13b-chat-hf --outtype f32 --outfile ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin

then I quantised it (I am using q5_k_m and not q4_0)

./quantize ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin q5_k_m

At this point I test the models with ./main and they work perfectly.

So I created a dataset trying to orientate myself in many contradictory answers about the format and at the end I opted for the one that seemed the most used:

<s>[INST] <<SYS>>
In the context of physics
<</SYS>>

What is quantum entanglement? [/INST] Quantum entanglement is the phenomenon that occurs when a group of particles are generated, interact, or share in such a way that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance.

(I didn't put any </s> at the end because for some reason the loss became nan after less than 10 iterations)

I started the fine-tuning and left it for almost 12h for completing 1 epoch

./finetune --model-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --train-data ../datasets/FineTune/train_llamacpp.txt --threads 26 --sample-start "<s>" --ctx 512 -ngl 32

Then I tested the model

./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q5_k_m.bin --lora-base ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin --lora ../models/llama-2-13b-chat/ggml-lora-LATEST-f32.gguf --color -p "What is entanglement in physics?"

I always get as last lines of the logs

.....
llama_apply_lora_from_file_internal: unsupported tensor dimension 1
llama_init_from_gpt_params: error: failed to apply lora adapter
ggml_metal_free: deallocating
main: error: unable to load model

I am running everything on an M1 Max with 64GB of Ram and 1 GPU with 32 cores

What can be the problem? I tried already different things but no success, that's why I'm writing here...

Thank you for any help

Luca

slaren · 2023-12-16T14:43:31Z

Can you test if the lora works with #3333?

xcottos · 2023-12-16T15:37:16Z

Hi slaren,

I tried it and it moved forward, now I am receiving this:

ggml_new_object: not enough space in the context's memory pool (needed 983635488, available 849346560)
zsh: segmentation fault ./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q4_0.bin

in the meanwhile I'm trying converting the model to fp16, quantise it to q4_0 and retraining it until the first checkpoint (it usually takes more or less 1h)

Thank you for your support!

below the full output ......
llama_apply_lora_from_file_internal: llama_apply_lora_from_file_internal: llama_apply_lora_from_file_internal: llama_apply_lora_from_file_internal: llama_model_loader: llama_model_loader: - tensor 0: llama_model_loader: - tensor 1: ..........
..........
llama_model_loader: - tensor 358: llama_model_loader: - tensor 359: llama_model_loader: - tensor 360: llama_model_loader: - tensor 361: llama_model_loader: - tensor 362: llama_model_loader: llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - type f32: llama_model_loader: - type ggml_new_object: not zsh: segmentation fault right after the beginning of loading lora (where my previous attempts stopped):
applying lora adapter from '../llama.cpp/ggml-lora-LATEST-f32.gguf' - please wait ...
r = 4, alpha = 4, scaling = 1.00
allocating 810 MB for lora temporary buffer
loading base model from '../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin'
loaded meta data with 22 key-value pairs and 363 tensors from ../models/llama-2-13b-chat/llama-2-13b-chat-hf-f32.bin (version GGUF V3 (latest))
token_embd.weight f16 [ 5120, 32000, 1, 1 ]
blk.0.attn_q.weight f16 [ 5120, 5120, 1, 1 ]
blk.39.ffn_down.weight f16 [ 13824, 5120, 1, 1 ]
blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
output_norm.weight f32 [ 5120, 1, 1, 1 ]
output.weight f16 [ 5120, 32000, 1, 1 ]
Dumping metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = llama
general.name str = llama-2-13b-chat
llama.context_length u32 = 2048
llama.embedding_length u32 = 5120
llama.block_count u32 = 40
llama.feed_forward_length u32 = 13824
llama.rope.dimension_count u32 = 128
llama.attention.head_count u32 = 40
llama.attention.head_count_kv u32 = 40
llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama.rope.freq_base f32 = 10000.000000
general.file_type u32 = 1
tokenizer.ggml.model str = llama
tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
tokenizer.ggml.bos_token_id u32 = 1
tokenizer.ggml.eos_token_id u32 = 2
tokenizer.ggml.unknown_token_id u32 = 0
tokenizer.ggml.add_bos_token bool = true
tokenizer.ggml.add_eos_token bool = false
81 tensors
f16: 282 tensors
enough space in the context's memory pool (needed 983635488, available 849346560)
./main -i -m ../models/llama-2-13b-chat/llama-2-13b-chat-hf-quantized_q4_0.bi

slaren · 2023-12-16T16:06:26Z

Thanks for testing! That issue should be fixed now, the loras created by finetune are larger than I realized.

xcottos · 2023-12-16T17:28:20Z

it worked! thanks slaren, I hope it's merged as soon as possible with the main branch

appreciated your fast response and solution!

Luca

slaren self-assigned this Dec 16, 2023

slaren linked a pull request Dec 16, 2023 that will close this issue

lora : add support for non-llama models #3333

Merged

slaren closed this as completed in #3333 Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

xcottos commented Dec 16, 2023 •

edited

Loading

slaren commented Dec 16, 2023

xcottos commented Dec 16, 2023 •

edited

Loading

slaren commented Dec 16, 2023

xcottos commented Dec 16, 2023

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

Failed to apply lora adapter after fine-tuning a llama-2-13B-chat with ./finetune #4499

Comments

xcottos commented Dec 16, 2023 • edited Loading

slaren commented Dec 16, 2023

xcottos commented Dec 16, 2023 • edited Loading

slaren commented Dec 16, 2023

xcottos commented Dec 16, 2023

xcottos commented Dec 16, 2023 •

edited

Loading

xcottos commented Dec 16, 2023 •

edited

Loading