-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add effective_batch_size
to auto-adjust gradient accumulation
#3533
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
"The effective batch size is the total number of samples used to compute a single gradient update " | ||
"to the model weights. This differs from `batch_size` by taking `gradient_accumulation_steps` and number " | ||
"of training worker processes into account. In practice, " | ||
"`effective_batch_size = batch_size * gradient_accumulation_steps * num_workers`. " | ||
"If 'auto', the effective batch size is derivied implicitly from `batch_size`, but if set explicitly, then " | ||
"one of `batch_size` or `gradient_accumulation_steps` must be set to something other than 'auto', and " | ||
"consequently will be set following the formula given above." | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a very clear description of how this works
if not self.config_obj.trainer.can_tune_batch_size(): | ||
# Models like GBMs don't have batch sizes to be tuned | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth logging a message here to indicate this very clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path always gets executed regardless of user config, so I wouldn't add a message. Would likely confuse the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is an awesome change.
One thing I wanted to confirm is, what is the exact behavior when batch_size, effective_batch_size and gradient_accumulation_steps are set to auto
? Is it that we don't update the batch_size and grad_accumulation_steps when we make the first call to self.config_obj.trainer.update_batch_size_grad_accum(num_workers)
, then actually perform batch size tuning if batch size tuning is auto to set it to an actual number, and now that we have workers + batch size, we recompute batch size (which doesn't change) but update gradient accumulation steps to 1 while leaving effective batch size to auto? but in the case that effective batch size is not auto, we calculate it because we have batch_size, num workers and the desired effective batch size?
@arnavgarg1 if everything is set to |
For LLM fine-tuning, it's often the case that the batch size per GPU (
batch_size
in Ludwig) is very small (1 or 2), but the ideal batch size for model convergence is 32 or so. In these cases, we want to use gradient accumulation to compensate for the low batch size. Additionally, adding training workers with DeepSpeed further increases the effective batch size, meaning the user needs to do a lot of quick math to figure out how to set thegradient_accumulation_steps
.This PR adds a new trainer param called
effective_batch_size
that auto-adjusts gradient accumulation based on this value.