-
Notifications
You must be signed in to change notification settings - Fork 12k
finetune.cpp command-line arg #13873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should better keep that change as it time to get more feedbacks/approval.
Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.
My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.
I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.
I don't know, sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the previous perplexity-specific arguments are needed.
For adding an SDG optimizer, add a new ggml op like |
yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req? |
Either way would be fine with me as long as there are at no point broken or unfinished features on master. |
e752031
to
e689af8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to the next PR(s).
add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw, including string->id mapping, preparatory to work to support SGD these are in common args a set of optimizer options active only for the new FINETUNE example (but we drop all the previous finetune.cpp PERPLEXITY options which we're told are unused/accidental) perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h as proposed in ggml-org#13835
you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it |
Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR. |
Sorry, I didn't read the
part. |
7534bbf
to
48a16bf
Compare
Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :) |
Second (actual usable SGD) commit is 48a16bf (also shows above here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?
common/arg.cpp
Outdated
std::set<std::string> executables = { | ||
"llama-batched", | ||
"llama-batched-bench", | ||
"llama-bench", | ||
"llama-cli", | ||
"llama-convert-llama2c-to-ggml", | ||
"llama-cvector-generator", | ||
"llama-embedding", | ||
"llama-eval-callback", | ||
"llama-export-lora", | ||
"llama-gen-docs", | ||
"llama-gguf", | ||
"llama-gguf-hash", | ||
"llama-gguf-split", | ||
"llama-gritlm", | ||
"llama-imatrix", | ||
"llama-infill", | ||
"llama-mtmd-cli", | ||
"llama-llava-clip-quantize-cli", | ||
"llama-lookahead", | ||
"llama-lookup", | ||
"llama-lookup-create", | ||
"llama-lookup-merge", | ||
"llama-lookup-stats", | ||
"llama-parallel", | ||
"llama-passkey", | ||
"llama-perplexity", | ||
"llama-q8dot", | ||
"llama-quantize", | ||
"llama-qwen2vl-cli", | ||
"llama-retrieval", | ||
"llama-run", | ||
"llama-save-load-state", | ||
"llama-server", | ||
"llama-simple", | ||
"llama-simple-chat", | ||
"llama-speculative", | ||
"llama-speculative-simple", | ||
"llama-tokenize", | ||
"llama-tts", | ||
"llama-vdot" | ||
}; | ||
std::set<std::string> executables = { "llama-batched", | ||
"llama-batched-bench", | ||
"llama-bench", | ||
"llama-cli", | ||
"llama-convert-llama2c-to-ggml", | ||
"llama-cvector-generator", | ||
"llama-embedding", | ||
"llama-eval-callback", | ||
"llama-export-lora", | ||
"llama-finetune", | ||
"llama-gen-docs", | ||
"llama-gguf", | ||
"llama-gguf-hash", | ||
"llama-gguf-split", | ||
"llama-gritlm", | ||
"llama-imatrix", | ||
"llama-infill", | ||
"llama-mtmd-cli", | ||
"llama-llava-clip-quantize-cli", | ||
"llama-lookahead", | ||
"llama-lookup", | ||
"llama-lookup-create", | ||
"llama-lookup-merge", | ||
"llama-lookup-stats", | ||
"llama-parallel", | ||
"llama-passkey", | ||
"llama-perplexity", | ||
"llama-q8dot", | ||
"llama-quantize", | ||
"llama-qwen2vl-cli", | ||
"llama-retrieval", | ||
"llama-run", | ||
"llama-save-load-state", | ||
"llama-server", | ||
"llama-simple", | ||
"llama-simple-chat", | ||
"llama-speculative", | ||
"llama-speculative-simple", | ||
"llama-tokenize", | ||
"llama-tts", | ||
"llama-vdot" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this formatting change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will try. recall that when i add something to a line you get clang-format using the project style file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please fix your environment, no one is forcing you to do that.
@@ -770,7 +814,7 @@ void ggml_opt_eval(ggml_opt_context_t opt_ctx, ggml_opt_result_t result) { | |||
// beta1, beta2 after applying warmup | |||
const float beta1h = 1.0f/(1.0f - powf(opt_pars.adamw.beta1, opt_ctx->iter)); | |||
const float beta2h = 1.0f/(1.0f - powf(opt_pars.adamw.beta2, opt_ctx->iter)); | |||
|
|||
const float keep = 1.0f - opt_pars.adamw.alpha * opt_pars.adamw.wd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer steps are going to be I/O bound and optimizing compute is not going to make a meaningful difference for the runtime of the steps, for the runtime of the total probram it's completely negligible. So please revert this change, I think the other variant is easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not likely to matter, but it's 1. per parameter per epoch (ok, does seem unimportant now that I think further) and 2. i'm not confident cuda CC optimizes this and was hoping to learn more - would seem possible that w/o this we're loading repeatedly two floats instead of one - and mostly 3. this is exactly following precedent established for beta1h and beta2h, which are stored in the tensor just as i stored this quantity.
Anyway, totally willing, just curious what you think about the existing practice of saving beta1h and beta2h in light of this opinion that we're not compute bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i checked it out - doesn't seem to change runtime noticeably as you predicted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern with the code is the amount of effort needed to maintain it, particularly when it comes to debugging and asserting that the code on master works correctly. It is quite likely that I will at some point be in a situation where a user reports bad training results and I will not know whether that is the due to a bug in ggml or due to bad hyperparamters or something similar. So it is very important to me that the data layout is consistent across multiple levels.
The correct way to implement the micro-optimization of pre-computing a parameter derived from the human-interpretable parameters is as follows:
- Pass the human-interpretable parameters to
ggml_opt_step_adamw
/ggml_opt_step_sdg
. - In the CUDA host code, pre-compute some derived parameters from the human-interpretable parameters.
- Change the CUDA device code to accept the derived parameters instead.
The way CUDA works is that the CPU schedules the GPU kernels in a CUDA stream and then waits for said stream to finish all kernels. Scheduling the kernels is of course much faster and it doesn't matter how fast you are as long as you are fast enough to keep the GPU busy. So adding a bit of overhead to the scheduling has essentially no impact on the runtime of a CUDA program even if you do it once per CUDA kernel launch instead of once per epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining all that, the bottom line for me is that you were right and the micro-optimization has no visible benefit in this case.
914f336
to
d8c6dd2
Compare
ok, per request we are back to calling get_opt_pars(ud) twice per epoch - shouldn't be noticeable and i apologize for the churn |
62f86f9
to
51867fe
Compare
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result: observed 11gb gpu ram (45 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (getting the right learning rate for SGD is trickier than for adamw - too high and you overshoot+oscillate, too low and you waste compute slowly approaching convergence) SGD (or adamw) quickly reach 99%+ train accuracy. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence also, note that logical batch size > physical batch (gradient accumulation) seems unsupported for optimization (limited to physical , unlike in ppx - also limited to ctx-size). training quality/convergence could be improved by implementing (at cost of some memory, but you can make that up by using a much smaller physical batch for a net memory savings). presumably it's physical batch that should be limited to ctx-size? see llama_context::opt_epoch new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit cache computed per-epoch optimizer opts (formerly were computed twice per) add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. make ggml_opt_init aware of the optimization method since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no arg to set such a policy yet) 100 lines of wikipedia train: train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05 val: ... loss=3.91926±nan acc=58.40±2.18% on more training data (500 lines), additional catastrophic forgetting before train reaches 99.9% accuracy: train: data=0000140/0000140 loss=0.02611±0.00077 acc=99.82±0.02% t=00:00:45 val: data=0000008/0000008 loss=4.11112±0.22526 acc=46.36±0.78% increasing batch+ctx sizes to 1536 (double what fits in memory for adamw) gets apparently better validation but that could be an artifact of continuing training from previous weights, i.e. what's train vs val probably depends on batch size. also amusing - faster due to larger batch even though larger context would be slower?: train: data=0000045/0000045 loss=0.01722±0.00103 acc=99.90±0.01% t=00:00:40 val: data=0000003/0000003 loss=1.96829±1.09488 acc=72.44±0.66%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it's something very minor, please let me resolve conversations since that is how I like to track the TODOs (from my side) in a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not autoformat files in the same PR where you make functional changes. It creates a lot of unnecessary work for maintainers. As I said, please fix your environment to avoid doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy re: resolve.
as i said, the intention is to autoformat only the new code i add. if i accidentally changed other lines and they were affected, i'm happy to revert
.clang-format
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you changing this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was attempting to comply with your request to not change the formatting of a long initializer list of string literals. in my view the best way to reduce formatting make-work is to have the .clang-format match the codebase better. can revert
examples/training/finetune.cpp
Outdated
if (optimizer_params.optimizer == GGML_OPT_OPTIMIZER_SGD) { | ||
double was = (double) optimizer_params.common.alpha; | ||
double by = 1e2; | ||
double to = was * by; | ||
LOG_INF("sgd multiplying -lr by %.3g (no momentum) from -lr: %.2g to %.2g\n", by, was, to); | ||
optimizer_params.common.alpha = to; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't apply some arbitrary multiplier to the learning rate. If you want to use different defaults for AdamW and SDG the correct way to do it would be to leave it at some placeholder value and to then replace it based on optimizer type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want me to distinguish default vs explicit -lr so we can have sensible defaults for both methods, or are you just wanting this removed, period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed it - users will just have to figure out appropriate -lr for the method themselves
common/arg.cpp
Outdated
std::set<std::string> executables = { | ||
"llama-batched", | ||
"llama-batched-bench", | ||
"llama-bench", | ||
"llama-cli", | ||
"llama-convert-llama2c-to-ggml", | ||
"llama-cvector-generator", | ||
"llama-embedding", | ||
"llama-eval-callback", | ||
"llama-export-lora", | ||
"llama-gen-docs", | ||
"llama-gguf", | ||
"llama-gguf-hash", | ||
"llama-gguf-split", | ||
"llama-gritlm", | ||
"llama-imatrix", | ||
"llama-infill", | ||
"llama-mtmd-cli", | ||
"llama-llava-clip-quantize-cli", | ||
"llama-lookahead", | ||
"llama-lookup", | ||
"llama-lookup-create", | ||
"llama-lookup-merge", | ||
"llama-lookup-stats", | ||
"llama-parallel", | ||
"llama-passkey", | ||
"llama-perplexity", | ||
"llama-q8dot", | ||
"llama-quantize", | ||
"llama-qwen2vl-cli", | ||
"llama-retrieval", | ||
"llama-run", | ||
"llama-save-load-state", | ||
"llama-server", | ||
"llama-simple", | ||
"llama-simple-chat", | ||
"llama-speculative", | ||
"llama-speculative-simple", | ||
"llama-tokenize", | ||
"llama-tts", | ||
"llama-vdot" | ||
}; | ||
std::set<std::string> executables = { "llama-batched", | ||
"llama-batched-bench", | ||
"llama-bench", | ||
"llama-cli", | ||
"llama-convert-llama2c-to-ggml", | ||
"llama-cvector-generator", | ||
"llama-embedding", | ||
"llama-eval-callback", | ||
"llama-export-lora", | ||
"llama-finetune", | ||
"llama-gen-docs", | ||
"llama-gguf", | ||
"llama-gguf-hash", | ||
"llama-gguf-split", | ||
"llama-gritlm", | ||
"llama-imatrix", | ||
"llama-infill", | ||
"llama-mtmd-cli", | ||
"llama-llava-clip-quantize-cli", | ||
"llama-lookahead", | ||
"llama-lookup", | ||
"llama-lookup-create", | ||
"llama-lookup-merge", | ||
"llama-lookup-stats", | ||
"llama-parallel", | ||
"llama-passkey", | ||
"llama-perplexity", | ||
"llama-q8dot", | ||
"llama-quantize", | ||
"llama-qwen2vl-cli", | ||
"llama-retrieval", | ||
"llama-run", | ||
"llama-save-load-state", | ||
"llama-server", | ||
"llama-simple", | ||
"llama-simple-chat", | ||
"llama-speculative", | ||
"llama-speculative-simple", | ||
"llama-tokenize", | ||
"llama-tts", | ||
"llama-vdot" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please fix your environment, no one is forcing you to do that.
ggml/include/ggml-opt.h
Outdated
struct { | ||
float alpha; // learning rate | ||
float beta1; | ||
float beta2; | ||
float beta1; // adamw | ||
float beta2; // adamw | ||
float eps; // epsilon for numerical stability | ||
float wd; // weight decay for AdamW, use 0.0f to disable | ||
float wd; // weight decay for SGD or AdamW, use 0.0f to disable | ||
} adamw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more important to use the exact same data layout for the structs in ggml-opt.cpp
and the tensors for the corresponding operations defined in ggml.h
than to deduplicate a few of the parameters.
1bc35c3
to
77d786f
Compare
re: 'I think it's more important to use the exact same data layout for the structs in ggml-opt.cpp and the tensors for the corresponding operations defined in ggml.h than to deduplicate a few of the parameters.' There is something slightly strange IMO in relying on "at position 4 you will find the weight decay, just as it is in this ggml-opt.h adamw struct". The part that feels off specifically is the use of the raw constant 4 in the backends, as opposed to some kind of cast to struct * that makes the requirement legible (see, we just memcpy, it has to be the same layout), or the use of a named integer constant/enum i_adamw_wd = 4, at which point layout shouldn't matter. To keep the layout correspondence you want, it seems i have to drop the 'common' idea for keeping adamw params out of sgd. I'm happy to do this (common was in response to what i thought was your request) - please confirm. |
It would be completely fine for you to also define structs that provide a safer way to retrieve the optimizer parameters. The scenario that is relevant for me is this: I have a debugger open and want to inspect the contents of some tensor. It is more convenient for me when I can inspect the memory of the tensor by simply casting it to float vs. having to cast it to some specific struct. If the memory layout is inconsistent across multiple points in the program that adds an additional thing that I have to keep track of during debugging (which is usually the most time-consuming part of my work).
Yes, please do that. |
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
any update? the memory savings are good and the behavior is what you'd expect from SGD |
add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD
these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)
perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h
as proposed in
#13835