Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support logging_ratio, save_ratio, and eval_ratio (like for warmup_ratio) #23171

Closed
konstantinjdobler opened this issue May 5, 2023 · 6 comments · Fixed by #23235
Closed

Comments

@konstantinjdobler
Copy link
Contributor

konstantinjdobler commented May 5, 2023

Feature request

I would love if TrainingArguments and the Huggingface Trainer would support logging_ratio, save_ratio, and eval_ratio arguments (complementing logging_steps, save_steps, and eval_steps). If the *_ratio argument is set to e.g. 0.1, logging/saving/eval would be done every 0.1 * total_training_steps. This is already done for warmup_ratio and warmup_steps.

Motivation

When dealing with many different tasks and datasets, it can be frustrating to have to calculate different appropriate logging_steps etc. for each individual dataset. This proposal would enable a unified, simple and concise way to solve this problem.

Your contribution

I realize this might not be trivial to fully integrate, but hopefully, we can take warmup_steps and warmup_ratio as a reference. Depending on how deep the required changes are, I can also submit a PR (with some pointers on what to look out for).

@sgugger
Copy link
Collaborator

sgugger commented May 5, 2023

We already have 96 training arguments though, and that would make three more of all users to learn :-/

@konstantinjdobler
Copy link
Contributor Author

konstantinjdobler commented May 5, 2023

Another option would be to allow float inputs for logging_steps, save_steps, eval_steps and interpret all inputs <1.0 as a ratio (and throw an error if inputs >1.0 are not integers).

But yes there is a tradeoff with too much complexity

@sgugger
Copy link
Collaborator

sgugger commented May 5, 2023

I would prefer that solution actually, even if the naming is not perfect.

@konstantinjdobler
Copy link
Contributor Author

Going over the code a bit, it seems like we would have to wait until after this if statement in _inner_training_loop to set the correct values based on the max_steps that gets calculated there + some additional guards in the __post_init__ of the TrainingArguments.

if has_length(train_dataloader):
len_dataloader = len(train_dataloader)
num_update_steps_per_epoch = len_dataloader // args.gradient_accumulation_steps
num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
num_examples = self.num_examples(train_dataloader)
if args.max_steps > 0:
max_steps = args.max_steps
num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
args.max_steps % num_update_steps_per_epoch > 0
)
# May be slightly incorrect if the last batch in the training dataloader has a smaller size but it's
# the best we can do.
num_train_samples = args.max_steps * total_train_batch_size
else:
max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
num_train_epochs = math.ceil(args.num_train_epochs)
num_train_samples = self.num_examples(train_dataloader) * args.num_train_epochs
elif args.max_steps > 0: # Rely on max_steps when dataloader does not have a working size
max_steps = args.max_steps
# Setting a very large number of epochs so we go as many times as necessary over the iterator.
num_train_epochs = sys.maxsize
num_update_steps_per_epoch = max_steps
num_examples = total_train_batch_size * args.max_steps
num_train_samples = args.max_steps * total_train_batch_size
else:
raise ValueError(
"args.max_steps must be set to a positive value if dataloader does not have a length, was"
f" {args.max_steps}"
)

At first glance, it looks like the setting of when things get logged/saved/evaluated in DefaultFlowCallback should work out-of-the-box with this change.
I'm willing to contribute the changes once I find some time, does the general plan sound reasonable?

@sgugger
Copy link
Collaborator

sgugger commented May 6, 2023

Yes it does. I'll be looking forward to your PR!

@lava18
Copy link

lava18 commented Feb 5, 2024

Can someone clarify if this ratio means the % of training steps per epoch or the total training steps? If the latter, how do we know preemptively the total number of epochs (or total number of training steps) that the model is going to train for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants