Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[trainer] a consistent way to limit the number of items #9801

Closed
stas00 opened this issue Jan 26, 2021 · 30 comments
Closed

[trainer] a consistent way to limit the number of items #9801

stas00 opened this issue Jan 26, 2021 · 30 comments
Labels

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 26, 2021

🚀 Feature request

We have:

  1. finetune_trainer.py has
    n_train: Optional[int] = field(default=-1, metadata={"help": "# training examples. -1 means use all."})
    n_val: Optional[int] = field(default=-1, metadata={"help": "# validation examples. -1 means use all."})
    n_test: Optional[int] = field(default=-1, metadata={"help": "# test examples. -1 means use all."})
  1. some other run_ scripts use --n_obs
  2. --max_steps in the main trainer - which works only on the train_dataset - no ability to limit items on eval_dataset

Requests/Questions:

  1. How does one use --max_steps if one needs to use a different number of items for train and eval?
  2. Can we have a consistent way across examples to do this same thing?

Thank you.

@sgugger

@stas00 stas00 changed the title [trainer] ability to limit the number of items [trainer] a consistent way to limit the number of items Jan 26, 2021
@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

Mmm, which scripts use n_obs? I don't remember seeing this one the official maintained examples.

--max_steps is different from n_train/n_val/n_test: --max_steps runs training for max_steps, using the full training set. --n_train restrains the training set to its first n_train samples. The first has its place inside Trainer for obvious reason, the second is part of the processing of the training (or eval/test) dataset so I don't think this has its place in Trainer.

As for a consistent way to do this in all examples, it doesn't really matter in non seq2seq scripts as their evaluation runs quite fast. I imagine those arguments were introduces in the seq2seq script originally because its evaluation is super long. We can add them with a need-to basis on other datasets, but I haven't felt the need to do this.

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Mmm, which scripts use n_obs? I don't remember seeing this one the official maintained examples.

all seq2seq/run_*py

--max_steps is different from n_train/n_val/n_test: --max_steps runs training for max_steps, using the full training set. --n_train restrains the training set to its first n_train samples. The first has its place inside Trainer for obvious reason, the second is part of the processing of the training (or eval/test) dataset so I don't think this has its place in Trainer.

right, so this confusion leads to an incorrect benchmark. that's what I thought last night but it was too late to see.
#9371 (comment)

We need a way to be able to truncate the dataset to an identical size and then compare say 1-gpu vs 2-gpu benchmark on the same total number of input objects.

So how do we currently do that with other scripts that aren't finetune_trainer.py?

As for a consistent way to do this in all examples, it doesn't really matter in non seq2seq scripts as their evaluation runs quite fast. I imagine those arguments were introduces in the seq2seq script originally because its evaluation is super long. We can add them with a need-to basis on other datasets, but I haven't felt the need to do this.

fast? try run_clm.py on gpt2/wiki - it's multiple hours
e.g. see: #9371 (comment)

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

all seq2seq/run_*py

Those are not official maintained examples except for the new run_seq2seq. No one has really touched them since Sam left and they are in need for cleanup ;-)

fast? try run_clm.py on gpt2/wiki - it's multiple hours e.g. see: #9371 (comment)

You are pointing to a comment that does not contain any evaluation. So I stand by what I say. Evaluation on wikitext-2 runs in a couple of seconds.

We need a way to be able to truncate the dataset to an identical size and then compare say 1-gpu vs 2-gpu benchmark on the same total number of input objects.

Like I said, if it's needed it can be added.

So how do we currently do that with other scripts that aren't finetune_trainer.py?

By opening a PR adding this ;-)

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Thank you for clarifying which is which, @sgugger

OK, so what should we call a new flag in HF Trainer that would be an equivalent of --n_train? or use the same?

Do you suggest it should be train-specific?

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

I think it should be in the scripts, not the Trainer, as it's part of the preprocessing. I don't think it should be train-specific, we can do eval/test like in the finetune_trainer script.

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

but then we have to change all the scripts. Why not have an option to truncate the dataset at trainer level and solve it at once for all scripts?

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

Because it doesn't have much to do with the Trainer itself IMO. It's like putting all the arguments of all the scripts about tokenization in the Trainer, it doesn't really make sense as the Trainer is supposed to take the lead after the data preprocessing.

Let's see if @LysandreJik and @patrickvonplaten think differently maybe?

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

This makes sense, then perhaps having a Trainer-subclass that all scripts can tap into?

Also may I suggest that --max_steps is an ambiguous argument as it tells the user nothing about whether this is per gpu or per the whole thing?

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

The documentation says number of training steps. I don't see how the number GPU intervenes here as a training step is the full combination of forward, backward (perhaps multiple times if gradient accumulation is activated) and optimizer step.

One training step can have a different number of training samples depending on the number of GPUs, but also depending on the batch size, gradient accumulation steps etc. This information is logged at the beginning of training (logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}") in Trainer.train)

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Right, so what you're saying is that --max_steps is just the wrong tool for the truncating job and we need an explicit --use-that-many-total-train-records.

Honestly, I have been staring at all these different trainer options for a long time now and I still get confused at which is which, and which are impacted by number of gpus and which aren't. Every time this happens I have to go through the source code to see how it's used and then I get it. To me some of these arg names are hard to make sense of in the multi-gpu vs single gpu env.

  • --per_device_train_batch_size is loud and clear.
  • --max_steps is not.

I propose we use total and per_device prefix for any cl arg that behaves differently depending on the number of gpus.

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

The problem is that this then is a breaking change. I'm not necessarily super fond of the name max_steps myself but I'm not sure it's worth going through the trouble of a deprecation cycle for this one.

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Do you think it's actually used a lot?

I agree with avoiding break changes, but since we are trying to make the API intuitive, such changes in the long run will benefit a much larger community than the annoyance it'd cause to those who use it right now.

I think the main issue we have here is that all these proposals to renames happen dynamically. But instead I think it'd make sense for a group of us to sit down, review all the cl args and do a single adjustment. Surely, this won't guarantee that in the future we won't find we missed something, but it's definitely better than doing it a little bit at a time, which is much more annoying.

In some previous projects for such things we also had a back-compat mode, which ones enabled supported a whole bunch of old ways until the user was ready to make the shift to the new code. Surely a rename of a cl arg could be easily supported by such feature. So here, instead of a deprecation cycle per item the approach is to keep anything old around but only if it's loaded from a helper module. So that the main code remains clean of deprecated things. This was in a different programming environment where it was developer, so I will have to think how to do the same here.

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

Note that this is not just a CI arg rename, since TrainingArguments is also a public class users may very well directly use in their code (you need to instantiate one each time you use a Trainer). We can certainly have a discussion around the arguments and decide which one we want to rename, though it should be in a separate issue. We're starting to derail this one ;-)

And from the issues, I'd say that half the users use num_train_epochs and half use max_steps to control the length of their training, so it is used a lot.

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Thank you for flagging that we are diverging from the topic at hand, @sgugger
As you suggested I opened a new one: #9821

And thank you for confirming that these are used a lot.

@stas00
Copy link
Contributor Author

stas00 commented Jan 26, 2021

Because it doesn't have much to do with the Trainer itself IMO. It's like putting all the arguments of all the scripts about tokenization in the Trainer, it doesn't really make sense as the Trainer is supposed to take the lead after the data preprocessing.

Let's see if @LysandreJik and @patrickvonplaten think differently maybe?

So for the benefit of reviewers, and to bring us back to the focus of this Issue. I proposed to have a cl arg that will truncate the dataset (train, others?) (total!) across all example scripts.

@sgugger, correctly suggested that perhaps this shouldn't belong to Trainer, and then I suggested that perhaps there should be a sub-class that does such nice little tweaks consistently across all example scripts, rather than manually replicating the same code and which often leads to scripts diverging.

Plus, @sgugger points out that examples/seq2seq/run*.py haven't yet been converted to the new way.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jan 29, 2021

I always thought that max_steps defines the total number of weight update steps (which is then not really influenced by other parameters such as number of GPUs or gradient_accumalation_steps or whatever). To me it defines: "How often do I want to update my weights?" or am I wrong here?. Think the name is clear and does not need to be changed, the documentation could be updated with a sentence that makes clear that max_steps = number of weight updates. Also, I use this arg quite often when training and think it's important to keep.

I agree with @sgugger here that I think a --max_num_train_samples arg (or whatever the name) should not go into the trainer, but should be added to all examples scripts. It's actually incredibly easy to do this with datasets:

ds = load_dataset("crime_and_punish", split="train")
ds = ds.select(range(arg.max_num_train_samples))

I'm totally fine with having this as another cl arg for the scripts, but don't think it's the responsibility of the trainer.

@patil-suraj
Copy link
Contributor

patil-suraj commented Jan 29, 2021

I agree with Sylvain and Patrick about max_steps.

And for controlling the number of examples, this should go in scripts than Trainer, as we do all the pre-processing in the scripts. We could add two arguments to DataTrainingArguments in every script.
--max_train_samples = number of training examples
--max_val_samples = number of validation examples

These args are already there in the new run_seq2seq.py script.

@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2021

Thank you for your input, guys. Your suggestions work for me.

We could add two arguments to DataTrainingArguments in every script.
--max_train_samples = number of training examples
--max_val_samples = number of validation examples

These args are already there in the new run_seq2seq.py script.

but not in other run_*.py scripts.

and then we have test too - at least in finetune_trainer.py

I proposed to have a Trainer subclass that implements this for all scripts vs repeating the same cl arg definition and code in every script a new (and forgetting to sync some) - could you please address that?


The other slight confusion across some scripts is val vs eval - it's inconsistent - some reports say val others eval - train/val/test are splits and are orthogonal to train/evaluate/predict - and while they are the same for train, the rest are just confusing, since you can have predict for val split and evaluate for test split. Should we discuss this in a separate issue?

@sgugger
Copy link
Collaborator

sgugger commented Jan 29, 2021

I proposed to have a Trainer subclass that implements this for all scripts vs repeating the same cl arg definition and code in every script a new (and forgetting to sync some) - could you please address that?

I don't think this is a good idea personally. The goal of the scripts is to provide examples for our users. Having examples that don't use the main object of the library is counterproductive. It's one other instance where we have to bear the burden of duplicate code to make the user experience easier IMO.

The other slight confusion across some scripts is val vs eval - it's inconsistent - some reports say val others eval - train/val/test are splits and are orthogonal to train/evaluate/predict - and while they are the same for train, the rest are just confusing, since you can have predict for val split and evaluate for test split. Should we discuss this in a separate issue?

I think this is mostly finetune_trainer (and maybe run_seq2seq2 since I may have copied some names) not using the same terminology as the other scripts in this instance. So those two scripts should get aligned with the rest on this matter. Again, let's keep the examples simple (I feel like I'm repeating this all day long but they are just examples we cannot have scripts that will solve every use case and trying to do so make them un-understandable for our users) and match train/eval/test with what is done (training/evaluation/predict).

@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2021

I proposed to have a Trainer subclass that implements this for all scripts vs repeating the same cl arg definition and code in every script a new (and forgetting to sync some) - could you please address that?

I don't think this is a good idea personally. The goal of the scripts is to provide examples for our users. Having examples that don't use the main object of the library is counterproductive. It's one other instance where we have to bear the burden of duplicate code to make the user experience easier IMO.

You're correct. I didn't think of that.

So we have a conflict here between example scripts and them being used for more than that.

I, for one, need a solid set of scripts to do:

  1. integration validation
  2. benchmarking

In the absence of these I have been heavily relying on the example scripts. And this is probably where the conflict is.

So I keep on bringing this up - should we have a set of scripts that are not examples, but real production work horses and we treat them as such? Perhaps they can have much less functionality but do it consistently across different domains and simple?

Perhaps, instead of run_(foo|bar|tar).py it's one script that can tap into any of these domains and then it can have a simple identical cl args. And all we change is model names and most other args are almost the same.

The other slight confusion across some scripts is val vs eval - it's inconsistent - some reports say val others eval - train/val/test are splits and are orthogonal to train/evaluate/predict - and while they are the same for train, the rest are just confusing, since you can have predict for val split and evaluate for test split. Should we discuss this in a separate issue?

I think this is mostly finetune_trainer (and maybe run_seq2seq2 since I may have copied some names) not using the same terminology as the other scripts in this instance. So those two scripts should get aligned with the rest on this matter. Again, let's keep the examples simple (I feel like I'm repeating this all day long but they are just examples we cannot have scripts that will solve every use case and trying to do so make them un-understandable for our users) and match train/eval/test with what is done (training/evaluation/predict).

You're absolutely correct, please see my response in the comment above.

@sgugger
Copy link
Collaborator

sgugger commented Jan 29, 2021

So I keep on bringing this up - should we have a set of scripts that are not examples, but real production work horses and we treat them as such? Perhaps they can have much less functionality but do it consistently across different domains and simple?

If the basic examples do not suffice, then yes, definitely.

@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2021

But we are walking in circles. If these are examples and they are treated as examples, these aren't tools to be relied upon. I hope you can see the irony...

I need a solid tool that will not change its API, start doing all the benchmarks in it so that we could go back to benchmarks from 6 months or a year ago and be able to run those and re-check.

@sgugger
Copy link
Collaborator

sgugger commented Jan 29, 2021

I'm not sure why you say we are walking in circles. I just dais yes to having benchmark-specific scripts if the examples do not have all the functionality you need.

@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2021

I see what you mean. But you asked a tricky question - can I figure out how to the use the example scripts to meet my needs - mostly yes - but then every time I ask for something that ensures consistency, you say - but the audience is wrong - it should be for users. And I say, yes, of course, you're right. And we end up nowhere. Do you see where the circle is?

Ideally there should be just one benchmarking tool that can handle any model (or at least the majority of them) and support the different tasks and it probably won't need all the possible flags the various scripts have. If that makes sense.

I was using finetune_trainer.py for many things, but then a user asks to validate/benchmark/integrate a model not supported by that script, so I go into that subdomain in examples and things aren't the same there. And I know we are trying to make the example scripts consistent, but the example of this Issue I know for a fact that when one manually copies the same feature across scripts they are bound to become inconsistent. At least that's the experience with transformers so far.

Complaining and frustration expression aside - perhaps we could start with one best script that you think is a good model and then making it non-examples and to start transforming it to support a multitude of tasks/models/features? Would that be a good way to move forward?

@sgugger
Copy link
Collaborator

sgugger commented Jan 29, 2021

The issue is derailing a bit as I think adding the max_train_samples etc to all scripts has been validated (and is useful to quickly test the example is running on the user data).

If you want to look at a benchkmarking script, I think a good starting point is run_glue for fine-tuning on text classification, run_mlm for language modeling. Those are more for BERT-like models than seq2seq models however. finetune_trainer is aimed at being deprecated and once run_seq2seq has all its features, it can be the one good script to be based on for all things seq2seq.

@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2021

The issue is derailing a bit as I think adding the max_train_samples etc to all scripts has been validated (and is useful to quickly test the example is running on the user data).

Excellent!

If you want to look at a benchkmarking script, I think a good starting point is run_glue for fine-tuning on text classification, run_mlm for language modeling. Those are more for BERT-like models than seq2seq models however. finetune_trainer is aimed at being deprecated and once run_seq2seq has all its features, it can be the one good script to be based on for all things seq2seq.

I feel I'm not managing to successfully communicate the need here. I will let it go for now.

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

@stas00
Copy link
Contributor Author

stas00 commented Mar 6, 2021

This is getting resolved by #10551

@LeopoldACC
Copy link

I always thought that max_steps defines the total number of weight update steps (which is then not really influenced by other parameters such as number of GPUs or gradient_accumalation_steps or whatever). To me it defines: "How often do I want to update my weights?" or am I wrong here?. Think the name is clear and does not need to be changed, the documentation could be updated with a sentence that makes clear that max_steps = number of weight updates. Also, I use this arg quite often when training and think it's important to keep.

I agree with @sgugger here that I think a --max_num_train_samples arg (or whatever the name) should not go into the trainer, but should be added to all examples scripts. It's actually incredibly easy to do this with datasets:

ds = load_dataset("crime_and_punish", split="train")
ds = ds.select(range(arg.max_num_train_samples))

I'm totally fine with having this as another cl arg for the scripts, but don't think it's the responsibility of the trainer.

hi,I want to use the crime_and_punish dataset to do evaluation on model reformer,which task code should I use?

@stas00
Copy link
Contributor Author

stas00 commented Apr 13, 2021

@LeopoldACC, it looks like you posted your question in a very unrelated discussion. Please try https://discuss.huggingface.co/. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants