Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add effective_batch_size to auto-adjust gradient accumulation #3533

Merged
merged 29 commits into from
Aug 23, 2023
Merged

Conversation

tgaddair
Copy link
Collaborator

For LLM fine-tuning, it's often the case that the batch size per GPU (batch_size in Ludwig) is very small (1 or 2), but the ideal batch size for model convergence is 32 or so. In these cases, we want to use gradient accumulation to compensate for the low batch size. Additionally, adding training workers with DeepSpeed further increases the effective batch size, meaning the user needs to do a lot of quick math to figure out how to set the gradient_accumulation_steps.

This PR adds a new trainer param called effective_batch_size that auto-adjusts gradient accumulation based on this value.

@github-actions
Copy link

github-actions bot commented Aug 15, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   1h 6m 47s ⏱️ - 2m 38s
34 tests ±0  29 ✔️ ±0    5 💤 ±0  0 ±0 
88 runs  ±0  72 ✔️ ±0  16 💤 ±0  0 ±0 

Results for commit 49d264a. ± Comparison against base commit 090918d.

♻️ This comment has been updated with latest results.

Comment on lines +193 to +200
"The effective batch size is the total number of samples used to compute a single gradient update "
"to the model weights. This differs from `batch_size` by taking `gradient_accumulation_steps` and number "
"of training worker processes into account. In practice, "
"`effective_batch_size = batch_size * gradient_accumulation_steps * num_workers`. "
"If 'auto', the effective batch size is derivied implicitly from `batch_size`, but if set explicitly, then "
"one of `batch_size` or `gradient_accumulation_steps` must be set to something other than 'auto', and "
"consequently will be set following the formula given above."
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is a very clear description of how this works

Comment on lines +799 to +801
if not self.config_obj.trainer.can_tune_batch_size():
# Models like GBMs don't have batch sizes to be tuned
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth logging a message here to indicate this very clearly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path always gets executed regardless of user config, so I wouldn't add a message. Would likely confuse the user.

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is an awesome change.

One thing I wanted to confirm is, what is the exact behavior when batch_size, effective_batch_size and gradient_accumulation_steps are set to auto? Is it that we don't update the batch_size and grad_accumulation_steps when we make the first call to self.config_obj.trainer.update_batch_size_grad_accum(num_workers), then actually perform batch size tuning if batch size tuning is auto to set it to an actual number, and now that we have workers + batch size, we recompute batch size (which doesn't change) but update gradient accumulation steps to 1 while leaving effective batch size to auto? but in the case that effective batch size is not auto, we calculate it because we have batch_size, num workers and the desired effective batch size?

@tgaddair
Copy link
Collaborator Author

@arnavgarg1 if everything is set to auto (the default), then it's equivalent to the current behavior where we just set the batch size to maximize GPU utilization and then set gradient accumulation to 1.

@tgaddair tgaddair merged commit 8d4c96b into master Aug 23, 2023
@tgaddair tgaddair deleted the total-bs branch August 23, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants