finetune: SGD optimizer, more CLI args #13873

graehl · 2025-05-28T20:26:00Z

add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD

these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)

perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h

as proposed in
#13835

graehl · 2025-05-28T20:31:41Z

perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:

is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?
is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.
were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.
could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

WilliamTambellini

you should better keep that change as it time to get more feedbacks/approval.

JohannesGaessler · 2025-05-28T21:08:35Z

is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?

Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.

is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.

My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.

were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.

I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.

could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

I don't know, sorry.

WilliamTambellini · 2025-05-28T21:12:28Z

@ggerganov

JohannesGaessler

None of the previous perplexity-specific arguments are needed.

common/arg.cpp

common/common.h

ggml/include/ggml-opt.h

JohannesGaessler · 2025-05-28T21:26:17Z

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

graehl · 2025-05-29T16:15:21Z

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req?

JohannesGaessler · 2025-05-29T16:34:10Z

Either way would be fine with me as long as there are at no point broken or unfinished features on master.

matiaslin

Looking forward to the next PR(s).

graehl · 2025-05-29T18:56:23Z

you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it

JohannesGaessler · 2025-05-29T19:20:55Z

Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR.

JohannesGaessler · 2025-05-29T19:25:03Z

Sorry, I didn't read the

only on lines changed in the PR

part.

graehl · 2025-05-30T16:59:33Z

Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :)
pretty new to the github interface honestly so let me know if this needs to be two separate PRs one for each commit or if it's reasonable to just review both commits here (obv. better to merge separately, first doesn't break any behavior, second impacts the finetune cmdline default learning rate but that should hurt no one)

graehl · 2025-05-30T17:01:47Z

Second (actual usable SGD) commit is 48a16bf (also shows above here)

WilliamTambellini

Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?

common/arg.cpp

examples/training/finetune.cpp

ggml/include/ggml-opt.h

ggml/src/ggml-opt.cpp

tests/test-backend-ops.cpp

graehl · 2025-08-05T15:06:58Z

Don't think there's anything I can currently do (please be specific if I'm mistaken, I'm new).

Rebase YOUR branch to master(then force push to your branch), see 0cc4m's changes, cherry pick 0cc4m(or rebase him changes to your's changes)

Thanks for spelling this out, that was easy - didn't squash so we can keep occam's contrib. separate but it's all rebased and you should see it here.

JohannesGaessler · 2025-08-06T08:04:32Z

I'm not aware of anything I can do on my end to get this merged (is someone waiting on me that I'm unaware of?).

As I said, please use the human-readable parameters, and only the human-readable parameters, as the ones being passed to ggml_opt_step_sgd. If you are short on time I can take over and finish up this PR (or someone else can if they want to).

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs)

graehl · 2025-08-07T02:52:53Z

I'm not aware of anything I can do on my end to get this merged (is someone waiting on me that I'm unaware of?).

As I said, please use the human-readable parameters, and only the human-readable parameters, as the ones being passed to ggml_opt_step_sgd. If you are short on time I can take over and finish up this PR (or someone else can if they want to).

This is fine and done now, but I cannot be confident the vulkan end of things is correct after the change (I just haven't read up on how the vulkan API works, at all).

0cc4m · 2025-08-07T06:41:40Z

You can change what you want, once things are ready I'll do a proper review of the Vulkan parts and make sure they are okay.

JohannesGaessler · 2025-08-07T09:09:47Z

From my end I would consider this PR now essentially good to merge. So unless there is something else that is left to do I will make some cosmetic changes and rely on @0cc4m to fix Vulkan if necessary. After that I will approve and merge.

graehl · 2025-08-07T18:17:47Z

You can change what you want, once things are ready I'll do a proper review of the Vulkan parts and make sure they are okay.

I didn't change anything at all in vulkan - it's all greek to me :) Do take a look. Perhaps the tests weren't really running on vulkan (I had disabled them since I didn't have an impl). The change is that the op params tensor [1] is now sgd.wd instead of 1 - sgd.wd*sgd.alpha. ([0] is just sgd.alpha)

0cc4m · 2025-08-07T19:40:55Z

Yeah, no worries. Here's a diff that does that change on the Vulkan shader, and removes two unnecessary preprocessor steps.

diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
index 3d5e1d98f..6426dedee 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
@@ -1,9 +1,6 @@
 #version 450
 
 #include "generic_head.comp"
-#include "types.comp"
-
-#extension GL_EXT_control_flow_attributes : enable
 
 layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;
 
@@ -19,7 +16,7 @@ void main() {
     }
 
     const float alpha = data_params[0];
-    const float keep  = data_params[1];
+    const float keep = 1.f - alpha * data_params[1];
 
     data_x[i] = data_x[i] * keep - alpha * data_grad[i];
 }

If you apply that the CI should pass again.

JohannesGaessler · 2025-08-11T22:22:54Z

I made some changes and pushed them to johannesgaessler/finelayer. The changes are almost entirely cosmetic, the biggest functional change is that I'm using ggml_format_name in ggml-opt.cpp instead of manual string manipulation (as I should have done from the start). Training of LLaMA 3.2 1b is currently not working but that is not caused by this PR.

graehl · 2025-08-13T01:19:23Z

I believe I successfully applied both. If anything else can be done to get this merged let me know.

0cc4m · 2025-08-13T07:55:39Z

Please fix the build issue in the pipelines.

JohannesGaessler · 2025-08-13T08:02:12Z

The build failures are my fault. I don't know why, but for some reason std::powf is not available everywhere. I pushed a fix to my repository which just uses std::pow again, it's not performance relevant. As I eluded to in an added comment, the code should be refactored to use logf and expf anyways.

0cc4m

The Vulkan part is fine.

JohannesGaessler · 2025-08-14T10:04:52Z

Thanks for the work and the persistence, everyone. For bookkeeping I changed the title/commit message to also mention SGD.

ggerganov · 2025-08-14T11:29:20Z

I think this PR broke the SYCL build:

https://github.com/ggml-org/ci/blob/results/llama.cpp/22/67133881dd1425bd60d541c04a0e2bfd5a9090/ggml-6-x86-sycl/stdall#L688

Maybe just need to update the "supports" function

WilliamTambellini · 2025-08-14T16:29:07Z

Tks @JohannesGaessler

* examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) * Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alpha*wd * minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

graehl requested a review from JohannesGaessler as a code owner May 28, 2025 20:26

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels May 28, 2025

WilliamTambellini approved these changes May 28, 2025

View reviewed changes

JohannesGaessler reviewed May 28, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

common/common.h Outdated Show resolved Hide resolved

ggml/include/ggml-opt.h Outdated Show resolved Hide resolved

graehl force-pushed the finelayer branch 2 times, most recently from e752031 to e689af8 Compare May 29, 2025 17:07

matiaslin reviewed May 29, 2025

View reviewed changes

graehl force-pushed the finelayer branch from e689af8 to aa59aa3 Compare May 29, 2025 18:42

graehl force-pushed the finelayer branch from 3f6b262 to b3be58d Compare May 30, 2025 08:04

github-actions bot added build Compilation issues testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels May 30, 2025

graehl force-pushed the finelayer branch 3 times, most recently from 7534bbf to 48a16bf Compare May 30, 2025 16:57

WilliamTambellini suggested changes May 30, 2025

View reviewed changes

JohannesGaessler reviewed May 30, 2025

View reviewed changes

graehl force-pushed the finelayer branch from 48a16bf to 96c3988 Compare May 30, 2025 18:59

graehl and others added 2 commits August 6, 2025 19:32

Vulkan: Implement GGML_OP_OPT_STEP_SGD

12d5b75

graehl force-pushed the finelayer branch from c172f59 to 19e7409 Compare August 7, 2025 02:34

tests: Fix OPT_STEP_SGD test-backend-ops

71ffb4b

graehl force-pushed the finelayer branch from 19e7409 to 189504e Compare August 7, 2025 02:50

SGD op param store weight-decay and not 1-alpha*wd

6efae31

graehl force-pushed the finelayer branch from 189504e to 6efae31 Compare August 7, 2025 02:52

JohannesGaessler and others added 2 commits August 12, 2025 18:18

minor + cosmetic changes

283401d

fix vulkan sgd

bb6d2e7

try CI fix

51f11bb

0cc4m approved these changes Aug 14, 2025

View reviewed changes

JohannesGaessler merged commit 5cdb27e into ggml-org:master Aug 14, 2025
47 checks passed

JohannesGaessler changed the title ~~finetune.cpp command-line arg~~ finetune: SGD optimizer, more CLI args Aug 14, 2025

JohannesGaessler mentioned this pull request Aug 14, 2025

test-opt: fix backend support check #15317

Merged

netrunnereve mentioned this pull request Aug 21, 2025

Misc. bug: Vulkan test-opt test_gradient_accumulation failures on AMD #15491

Closed

finetune: SGD optimizer, more CLI args #13873

finetune: SGD optimizer, more CLI args #13873

Uh oh!

Conversation

graehl commented May 28, 2025

Uh oh!

graehl commented May 28, 2025

Uh oh!

WilliamTambellini left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented May 28, 2025

Uh oh!

WilliamTambellini commented May 28, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented May 28, 2025

Uh oh!

graehl commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

matiaslin left a comment

Choose a reason for hiding this comment

Uh oh!

graehl commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

graehl commented May 30, 2025

Uh oh!

graehl commented May 30, 2025

Uh oh!

WilliamTambellini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graehl commented Aug 5, 2025

Uh oh!

JohannesGaessler commented Aug 6, 2025

Uh oh!

graehl commented Aug 7, 2025

Uh oh!

0cc4m commented Aug 7, 2025

Uh oh!

JohannesGaessler commented Aug 7, 2025

Uh oh!

graehl commented Aug 7, 2025

Uh oh!

0cc4m commented Aug 7, 2025

Uh oh!

JohannesGaessler commented Aug 11, 2025

Uh oh!

graehl commented Aug 13, 2025

Uh oh!

0cc4m commented Aug 13, 2025

Uh oh!

JohannesGaessler commented Aug 13, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WilliamTambellini left a comment •

edited

Loading