-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to fine tune it? #8
Comments
Fine-tuning is not possible at the moment. You can fine-tune a model with some other implementation and then convert it and use it with https://github.com/ggerganov/whisper.cpp/tree/master/models#fine-tuned-models |
Hey Georgi, just curious, why fine-tuning is not possible, technically speaking? Let's ignore CUDA right now, but assumed it would work CPU only, what is missing? Thank you! |
We need the |
Thx, after I asked question here I researched and understood that your software provides an interface not the tooling for fine tuning. Was noob back the time I asked this. |
Training directly with ggml would be really nice. I had to add another ggml operation GGML_OP_ADD_AT as counterpart for GGML_VIEW in the backward pass. This duplicated the code for add functions. Maybe the offset parameter can just be moved to the regular add functions which can then be used for ADD_AT. Was not sure about the performance of doing it this way, so I just duplicated the functions for now. I will continue with the rest and test it with this repos test_grad and make a pull request when I think it is ready. |
Only GGML_ROPE is missing for llama, and GGML_OP_GET_ROWS, but this is only required for training the tokenizer embeddings. So far the gradients are untested, that will come next right after rope backward is implemented. Unfortunately a bunch of new operations had to be added: Add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need the operations' forward pass implementations of the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. |
@xaedes |
I successfully tested every backward pass except rms_norm. When rms_norm also works I will be pushing it in a proper branch of my llama fork as I currently have python bindings in my working branch. When necessary I can also rewrite that stuff to integrate with the mentioned refactoring, but I will make it work here first^^ |
@xaedes have you looked into Windows Subsystem for Linux? |
@danforbes Yep, but for easier development with the setup I am used to I wanted it to work in other environment and didn't want to get lost on fiddling with platform stuff^^ |
@ggerganov I have now successfully tested all backward passes necessary for llama. https://github.com/xaedes/llama.cpp/tree/training-integrate List of all new operations that I had to add:
Notable other changes:
Next I will look into making an example for training a baby llama or a small LoRa finetune on some late layer. Could be that I find some still undiscovered issues during this^^ |
I guess
No strong preference
Don't think it is important. At some point I was thinking that one of the methods is better than the other, but I think in the end they give pretty much the same performance. Amazing work! |
Got a baby llama model trained from scratch to output sin signal: After training with one call to ggml_opt with default settings for ADAM and only one example its output is better, but still not really good, of course after only seeing one example^^. |
Ha, I didn't realize we can simply train with mathematical functions. Was always thinking we need to get some text data into this. Ok, I understand the idea - the cost function is |
Training directly with sin, etc., given the required ggml operations, would also be possible, but here I just tokenized sinus output float [-1.0,+1.0] to token id[0..n_vocab-1]. The cost function I used in this first throw is probably not good - just took what I used in test_opt.c for a first test^^ |
Not sure what is normally used in practice. |
Ohh, switching from adam to lbfgs produces MUCH better results! best samples before optimization:
best samples after optimization:
When optimized with adam, best samples after optimization:
|
Yeah, I have always wondered why ADAM is considered state-of-the-art |
Maybe there lingers a bug in opt_adam somewhere? Anyway just sticking with lbfgs for now, that sinus looked real good :) |
Adam or AdamW? The latter should be preferred... |
@xaedes how's your progress on the training? Is it ready for some tests? I have datasets and hardware sitting around, I'd be happy to take it for a test drive and deliver some stats on the performance. Also you know, assuming you can train it for anything, it might be a more interesting development than people realize. If you can finetune on CPU at some kind of reasonable speed it means you can use the technology in a different context. For example, you could have an application built with a thread running that is continuously training incrementally whatever the user adds to it, and also adding their previously conversations with it timestamped. That would mean the user could add a github repository that would be ingested, add their own code, or they could say "hey, remember yesterday we were talking about xyz, I just had a thought..." It would be a breakthrough. GPU training has to be configured for the hardware, but CPU training can be run on anything in the background. No need to install dependencies, no need to overheat the room. You could use it in an app and train on transcripts of phone calls, so the user could ask about previous conversations they've had "what time did Sarah say we were meeting?". Gamechanger. I also suspect we're going to see next generation CPUs bridging the gap even more. |
Found some bugs along the way that needed some time to fix...^^ In the first tests the gradient did not actually get propagated to all model parameters. At first I also trained it to predict the current token instead of the next token and wondered quite some time why it would only generate flat lines, despiting hitting the target logits very good during training. Now llama from scratch generating endless sinus wave works correctly :) https://github.com/xaedes/llama.cpp/commits/train-example Training on multiple examples now also works. Just calling ggml_opt with low max iterations in a loop and properly cleaning up created tensors for loop was enough. But it generates some unnecessary overhead by recreating the whole computation graphs forward and backward each time. With more refactoring of ggml_opt we could just reuse the forward and backward computation graphs and optimizer state. I experimented a bit with it but decided against using it for now, because I did not want to touch the ggml_opt functions unless absolutely necessary. A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well. Changing the target logits from 0 & 1 to -1 & +1 greatly improved the training. I tried cross-entropy loss on the softmaxed probabilities instead of sum of squared logit errors, but it was consistently worse. I did not look into training a LoRa finetune yet, but the necessary machinery for that seems to be working.
Ok then I change it to use that name and prepare a pull request with training from scratch example, before I got lost any longer on the LoRa finetune, that can come next. Just to make sure there is no misunderstanding what the GGML_OP_ACC function does. The corresponding function signature currently looks like below.
|
@xaedes you are a hero! I'm going to address some of those things you mentioned, but I might just be highlighting my ignorance, because while I have a lot of experience with old-school classification, I've not been working for a few years and am only just updating myself with how transformers work. (1) Recreating the computation graphs has always seemed inefficient to me, but it does the same thing during inference, no? If so, a solution here could be a breakthrough for faster inference. It would have to be a perfect reproduction though, not an approximation, because otherwise you would not be training for the same result (although its possible that it only needs to be approximate.) It's always struck me as odd that the full computation is done for every token, again and again. Especially because it's obvious that if the model can write "and" as the next word to anything at all, it already calculated in some sense what was going to come next. That's just wasted the way these models currently work. Some people say it's just a next token predictor, but after my struggle to understand how it works, I see now that that's not true - it does in some sense, understand. (2) Training on multiple examples, there should be a sweet spot in theory. However, when you're doing this on natural language you always have a batch because there's a ton of tokens to predict for every one new "document" trained on. BTW, for ingesting a document every word is masked except the first, and for instruct or finetuning generally the input/question is not masked and the output/answer is masked. There is also the question of left vs right hand padding, and the padding token. The padding token for LLaMa is 0, as can be seen here. However, it doesn't have a default left or right hand padding because there is no padding in the base model, every batch was always 2048 length. In the finetuning implementations online, people are using left or right, there is no standard. I recommend left padding because it ensures that the attention mechanism can focus on the actual text without being affected by padding tokens that have no meaningful content. In contrast, right padding could potentially introduce noise in the attention scores, as the model needs to learn to ignore the padding tokens. (3) Changing logits target from 0&1 to -1&1, in theory that should make no difference between it's immediately softmaxed after this no? If it makes a difference then I would first ask how many bits you're storing that with? Potentially the reason it's helping is because of the fact you're trying to predict a sin wave, and so any lossyness is going to normalize -1&1 towards 0, which is better for your sin wave (it's because you have an even distribution, whereas you would not have an even distribution with natural language.) Whatever the reason it's an indication of a problem elsewhere. (4) Cross-entropy vs sum of squared logits. Becareful here because you're training a text-based sin wave generator, right? Cross-entropy is recommended for transformers & natural language, but your test is not doing that. Your sin wave test can't really be used as an example of what natural language looks like, so I wouldn't try optimizing for it. I can provide you with datasets if you need. (5) LoRA is the future for sure, because it allows incremental finetunes over and over. But yeah, gotta get it working first, it's already impressive! (6) Are you using SELU activation function? That's what recommended for LLaMa. Also look into flash attention, I've talked a few people about this and it's considered the best attention mechanism for LLaMa by a country mile. I understand @ggerganov tested this for inference and found it not really any better, but for training at least I've anecdotally heard 20x speed and 90% memory improvement. More specifically it reduces the memory requirements for longer context lengths to O(n). |
At least with kv_cache a lot of computation can be avoided during interference.
For this case the training should be able to make use of the kv_cache with n_past = 2. Your notes regarding the padding are interesting, I will keep them in mind when engaging with actual model lora finetunes! After some further tests, having other issues resolved and now with parallel batched training, I find that adam and cross entropy works as good if not sometimes better than lbfgs and squared error sum. As adam provides an easier parameter for a learning schedule (still todo, some very firsts tests with exp-decay where meh) I will probably focus more on that. @loretoparisi suggested that AdamW should be prefered. I don't know which one we use, probably without W, but after short skimming it should be just computing a different scalar somewhere. It would make sense to try implement and test it for training if we are not already using it. Do you mean SELU instead of SwiGLU? SwiGLU, internally using SILU, as used in the paper and the official llama inference code, is also used in llama.cpp. Flash attention could be really interesting when it helps improve training performance that much! The ggml forward pass implementation looks way less intimidating than I remember it from the paper. Maybe I should look into implementing the backward pass for that as well some time. |
Here is the flash attention that I've tried without gaining any performance: ggerganov/llama.cpp#778 As a side note, today I was intrigued by the "multi-query" attention paper that uses |
hahahaha me too I was wondering if the multi-query attention was possible! class MultiQueryAttentionLayer(nn.Module):
def __init__(self, hid_dim, n_heads, dropout, device):
super().__init__()
assert hid_dim % n_heads == 0
self.hid_dim = hid_dim
self.n_heads = n_heads
self.head_dim = self.hid_dim // self.n_heads
self.fc_q = nn.Linear( self.hid_dim, self.hid_dim)
self.fc_k = nn.Linear( self.hid_dim, self.head_dim)
self.fc_v = nn.Linear(self.hid_dim, self.head_dim)
self.fc_o = nn.Linear(self.hid_dim, self.hid_dim)
self.dropout = nn.Dropout(dropout)
self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
def forward(self, query, key, value, mask = None):
batch_size = query.shape[0]
#query = [batch size, query len, hid dim]
#key = [batch size, key len, hid dim]
#value = [batch size, value len, hid dim]
Qbank = self.fc_q(query).view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
Kbank = self.fc_k(key).view(batch_size, -1, 1, self.head_dim).permute(0, 2, 3, 1)
Vbank = self.fc_v(value).view(batch_size, -1, 1, self.head_dim).permute(0, 2, 1, 3)
#Qbank = [batch size, n heads, query len, head dim]
#Kbank = [batch size, 1, head dim, key len]
#Vbank = [batch size, 1, value len, head dim]
energy = torch.matmul(Qbank, Kbank) / self.scale
#energy = [batch size, n heads, query len, key len]
if mask is not None:
energy = energy.masked_fill(mask == 0, -1e10)
attention = F.softmax(energy, dim = -1)
#attention = [batch size, n heads, query len, key len]
x = torch.matmul(self.dropout(attention), Vbank)
x = x.permute(0, 2, 1, 3).contiguous()
x = x.view(batch_size, -1, self.hid_dim)
#x = [batch size, seq len, hid dim]
x = self.fc_o(x)
return x, attention |
Oh Wow! Interestingly there is a more recent Multi-Query Attention implementation by MosaicLM team for the MPT 7B here I did not know they were using Multi-Query attention actually for the MPT models did you? class MultiQueryAttention(nn.Module):
"""Multi-Query self attention.
Using torch or triton attention implemetation enables user to also use
additive bias.
"""
def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
super().__init__()
self.attn_impl = attn_impl
self.clip_qkv = clip_qkv
self.qk_ln = qk_ln
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.softmax_scale = softmax_scale
if self.softmax_scale is None:
self.softmax_scale = 1 / math.sqrt(self.head_dim)
self.attn_dropout_p = attn_pdrop
self.Wqkv = nn.Linear(d_model, d_model + 2 * self.head_dim, device=device)
fuse_splits = (d_model, d_model + self.head_dim)
self.Wqkv._fused = (0, fuse_splits)
if self.qk_ln:
layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
self.q_ln = layernorm_class(d_model, device=device)
self.k_ln = layernorm_class(self.head_dim, device=device)
if self.attn_impl == 'flash':
self.attn_fn = flash_attn_fn
elif self.attn_impl == 'triton':
self.attn_fn = triton_flash_attn_fn
warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
elif self.attn_impl == 'torch':
self.attn_fn = scaled_multihead_dot_product_attention
if torch.cuda.is_available():
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
else:
raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
self.out_proj._is_residual = True
def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
qkv = self.Wqkv(x)
if self.clip_qkv:
qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
(query, key, value) = qkv.split([self.d_model, self.head_dim, self.head_dim], dim=2)
key_padding_mask = attention_mask
if self.qk_ln:
dtype = query.dtype
query = self.q_ln(query).to(dtype)
key = self.k_ln(key).to(dtype)
if past_key_value is not None:
if len(past_key_value) != 0:
key = torch.cat([past_key_value[0], key], dim=1)
value = torch.cat([past_key_value[1], value], dim=1)
past_key_value = (key, value)
if attn_bias is not None:
attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
(context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights, multiquery=True)
return (self.out_proj(context), attn_weights, past_key_value) |
@xaedes I see the baby-llama has been merged into the master branch and draws a pretty sin wave. What's the intention for this going forward? Is it just a proof of concept and you're happy, or do you intend to expand it to the point of realistically being able to train an LLM from scratch. I've been working on an optimal tokenizer (I've just completed an ungreedy version and putting that up in a few days once the vocabs are built) and text normalizer to support that. The idea of having a CPU-based training that I could train from scratch on my own tokenizer is appealing to me. My goal is to have training running 24/7 on a server just slowly learning forever, whilst saving it's state out regularly. How far away are we from that dream? |
@alasdairforsythe I am still working on the text training. https://github.com/xaedes/llama.cpp/tree/text-from-scratch To make it work at all with 32001 sized vocabulary I had to improve training performance quite some bit by replacing some slow functions and avoiding the need of creating huge intermediate matrices. Got a small example working to train from scratch a small model on some text. Works okay, but still not really happy with cross entropy loss function. Often lands in bad local minima where it does not really go out of. But I found a bug recentently in my cross entropy loss function, fixing it may improve that. Squared error converges a lot faster, but I suspect it also overfits greatly to the given examples, because it essentially directly trains to a specific target probability distribution (what I define in the examples) instead of slowly training it the distribution of the actual whole dataset. Training a real sized llama from scratch with 4M batch size like they did in original paper will probably require lots of memory and runtime.. Batch size > 1 seems to be absolutely necessary for good training, but it multiplies the memory and runtime. But I might look into LoRa finetuning first before further pursuing the batch size issue, because for finetuning smaller batch sizes are supposed to be ok. The other suggestion of multi-query attention sounds very interesting for training from scratch, might look into that, doesn't look too hard to implement and test. |
@xaedes you should initialize the parameters of the model to small values, to converge faster. I know you use a normal distribution, which lands them more in the small value range, but most models use -0.02 - 0.02, or something similar.
maybe tweak learn rate or something? |
Regarding batch size: usually when these are trained they're using multiple GPUs so each "batch size" is broken down into multiple micro batches, each GPU processes one micro batch and then these are merged into the single batch. But an advantage of doing it on CPU is using system RAM, which I assume you have a lot of? I can lend you a 256GB RAM server for testing if you need. From my understanding its the activations/intermediates that are increasing the memory usage during higher batch size. Is that right? You only need to have enough memory for 2 sets of upstream gradients at any point in time, and the rest of it is the fixed size of the model. Aren't the rules different because you're doing this on CPU? A significant bottleneck for the GPU is transferring the data from system memory to GPU memory, but you're not using GPU, and you're not multiplying all those matrices all in parallel. So technically they don't need to all be in memory at the same time. IO may be fast enough to load them in and out, seeing as your operations are largely sequential? Like I said, I'm just learning, so I maybe talking rubbish. But to state the obvious: what stops you from saving these to disk and then later loading them back again? Or at the very least you could load in the activations for 1 layer at a time, since you would only need the activations for each layer one at a time. You could memory map it. Or, seeing as you know ahead of time how much you need to load in and when, you might be able to do better than memory mapping. You could read an array of them statically cast directly into a memory location that is already defined as whatever the struct is. One thread could be loading in the next batch of intermediates whilst you're "processing" this one. Bottleneck might not even be IO, but if it is, it's not necessarily worse than making a sacrifice somewhere else. If the IO is the bottleneck, gradient checkpointing reduces memory significantly by recalculating some of those values, which would be exactly what you want in that context. You could even have it as an option, depending on the IO speed. I'd also say it's not unreasonable to "expect" NVMe SSDs if there is not enough RAM. It is 2023. If I'm out of my depth and wasting your time, just me let me know. |
@xaedes I've been pondering this problem whilst attempting to understand more about the problem, and I've come up with the following:
That's essentially it, right? Data can be compressed, reduced, recreated, stored or not needed. What else can you do with it? Given that IO is the unused factor so far, that seems to be where there are obvious "free" gains. To do it well I would suspect means building a little memory-management system. If the structure were arranged so that data that are used together are stored together in memory, they could be easily written out and read in asynchronously. There could be a defined number of these buffers that contain the working data. You know exactly how much data you need to store that must be accessed at the same time, so from that you can determine the correct size for a buffer. And since you know also the total memory requirements for the model size, etc. and you know dynamically how much memory is available on the machine, it's easy to dynamically calculate the number of these "buffers" that would be used, based on those figures. It means that there can be a user-defined peak memory usage, from which you calculate the number of these buffers. On the forward pass you would fill a buffer and send the pointer & identifier to the memory manager, who would asynchronously save it and send you back a pointer to a free buffer, or return a nil pointer if none are available (at which point you could either wait or do something else.) And during backpropagation you send a pointer of the buffer you want to "free" and it sends you either a pointer to the next loaded data, or if it's still reading it in, then a nil pointer, at which point you can revert to do something else, such as recreating the data if that's practical. If the struct is trivial and it begins in an aligned position, you can write it without any additional buffer or serialization by statically casting it to an array of bytes, and vice-versa to get it back. If IO speed is the bottleneck, there are specialized compression formats, such as Snappy, which was designed for live compression and decompression for the purpose of reducing IO bottleneck. You could in fact use Snappy, as a CPU/IO tradeoff, enabling it automatically if the memory manager detects that IO is the bottleneck, which is determined by counting the number of times you try to write a buffer when all the buffers are still being written, or read a buffer when they're still being read. If at least 20% of the requests are met with a failure (no buffer available), switch that boolean flag and from then on use Snappy (or perhaps a compression format better suited for floats) to compress and decompress the data, which will probably halve the IO in exchange for greater CPU load, which would be exactly what you want in that circumstance. |
@Green-Sky Good point, I divided by sqrt of number of dimensions as suggested elsewhere and it really helped improving initial convergence. @alasdairforsythe Thanks for your input. I think using memmap files as backend for ggml contexts with large mem requirements would give a lot of the features that you described. So maybe we should try that at some time. Maybe gradient accumulation by just looping over different data and summing the gradients used for the optimizer is faster when it can avoid the swapping. I fixed the cross entropy loss function and now it works as it should. Overall I am pretty happy with the current state, it actually learns to generate plausible text. Trained on genesis 1 for 64x16 iterations (256 n_emb, 4 n_layer, 32 n_ctx, 16 n_batch):
Will soon make a llama pull request with an example how to train a small llama compatible (i.e. loadable by main) model from scratch on custom text data. After the pull request I will continue by experimenting with LoRa finetuning, multi-query attention, flash attention, gradient accumulation & memmap based ctx to train with larger batch sizes. |
One possible application of these "baby" LLaMA models is for "Speculative sampling": ggerganov/llama.cpp#630 (comment) A paper claims about 2x faster inference can be achieve with such approach: https://arxiv.org/abs/2302.01318 |
I'm looking at the code right now! The author of picoGPT (gpt-2 inference in pure NumPy) implemented the Speculative Sampling in Python and tested it on GPT-2 achieving a 2x speed up |
It's worth to note that Sophia could be a valid alternative to AdamW optimizer, code is now available: |
The "scratch buffer" mechanism is something in this direction. It's not ideal and can be improved in many ways.
|
Hey @xaedes, how is it going with Lora fine-tuning? Would be so cool to have this. Thanks for the great work! |
also curious for an update on this, it's still extremely relevant for so many people to have lora/qlora support on metal. |
Hi there, sorry for the long wait! I was on vacation for a few weeks and am now back working on this :) Memory usage improvements (mainly gradient checkpointing & opt-adam improvements) for training are done, for which I will now start to make a pull request on llama repo - lots of changes from master to merge... Development for LORA finetuning will then start based on this. |
LORA finetuning 3B model seems to mostly work now: Bigger models probably as well, they just need more RAM. |
@xaedes any way today to use ggml with a GPT-J quantized model (using ggml) and a LoRa adapter trained using huggingface? |
@xaedes thanks a lot for the great work! The question is: do I need to build an inference graph, which is basically just and encoder -> latent space, without re-initializing the model, such that I use the model's weights and biases? I am asking because it is unclear to me how the inference is done in your code. thanks |
I am a noob. Can you describe how I can fine-tune it with your program? It is possible? Maybe some articles.
The text was updated successfully, but these errors were encountered: