-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] a consistent way to limit the number of items #9801
Comments
Mmm, which scripts use
As for a consistent way to do this in all examples, it doesn't really matter in non seq2seq scripts as their evaluation runs quite fast. I imagine those arguments were introduces in the seq2seq script originally because its evaluation is super long. We can add them with a need-to basis on other datasets, but I haven't felt the need to do this. |
all
right, so this confusion leads to an incorrect benchmark. that's what I thought last night but it was too late to see. We need a way to be able to truncate the dataset to an identical size and then compare say 1-gpu vs 2-gpu benchmark on the same total number of input objects. So how do we currently do that with other scripts that aren't
fast? try |
Those are not official maintained examples except for the new
You are pointing to a comment that does not contain any evaluation. So I stand by what I say. Evaluation on wikitext-2 runs in a couple of seconds.
Like I said, if it's needed it can be added.
By opening a PR adding this ;-) |
Thank you for clarifying which is which, @sgugger OK, so what should we call a new flag in HF Trainer that would be an equivalent of --n_train? or use the same? Do you suggest it should be train-specific? |
I think it should be in the scripts, not the Trainer, as it's part of the preprocessing. I don't think it should be train-specific, we can do eval/test like in the finetune_trainer script. |
but then we have to change all the scripts. Why not have an option to truncate the dataset at trainer level and solve it at once for all scripts? |
Because it doesn't have much to do with the Trainer itself IMO. It's like putting all the arguments of all the scripts about tokenization in the Trainer, it doesn't really make sense as the Trainer is supposed to take the lead after the data preprocessing. Let's see if @LysandreJik and @patrickvonplaten think differently maybe? |
This makes sense, then perhaps having a Trainer-subclass that all scripts can tap into? Also may I suggest that |
The documentation says number of training steps. I don't see how the number GPU intervenes here as a training step is the full combination of forward, backward (perhaps multiple times if gradient accumulation is activated) and optimizer step. One training step can have a different number of training samples depending on the number of GPUs, but also depending on the batch size, gradient accumulation steps etc. This information is logged at the beginning of training ( |
Right, so what you're saying is that Honestly, I have been staring at all these different trainer options for a long time now and I still get confused at which is which, and which are impacted by number of gpus and which aren't. Every time this happens I have to go through the source code to see how it's used and then I get it. To me some of these arg names are hard to make sense of in the multi-gpu vs single gpu env.
I propose we use |
The problem is that this then is a breaking change. I'm not necessarily super fond of the name |
Do you think it's actually used a lot? I agree with avoiding break changes, but since we are trying to make the API intuitive, such changes in the long run will benefit a much larger community than the annoyance it'd cause to those who use it right now. I think the main issue we have here is that all these proposals to renames happen dynamically. But instead I think it'd make sense for a group of us to sit down, review all the cl args and do a single adjustment. Surely, this won't guarantee that in the future we won't find we missed something, but it's definitely better than doing it a little bit at a time, which is much more annoying. In some previous projects for such things we also had a back-compat mode, which ones enabled supported a whole bunch of old ways until the user was ready to make the shift to the new code. Surely a rename of a cl arg could be easily supported by such feature. So here, instead of a deprecation cycle per item the approach is to keep anything old around but only if it's loaded from a helper module. So that the main code remains clean of deprecated things. This was in a different programming environment where it was developer, so I will have to think how to do the same here. |
Note that this is not just a CI arg rename, since And from the issues, I'd say that half the users use |
So for the benefit of reviewers, and to bring us back to the focus of this Issue. I proposed to have a cl arg that will truncate the dataset (train, others?) (total!) across all example scripts. @sgugger, correctly suggested that perhaps this shouldn't belong to Trainer, and then I suggested that perhaps there should be a sub-class that does such nice little tweaks consistently across all example scripts, rather than manually replicating the same code and which often leads to scripts diverging. Plus, @sgugger points out that |
I always thought that I agree with @sgugger here that I think a ds = load_dataset("crime_and_punish", split="train")
ds = ds.select(range(arg.max_num_train_samples)) I'm totally fine with having this as another cl arg for the scripts, but don't think it's the responsibility of the |
I agree with Sylvain and Patrick about And for controlling the number of examples, this should go in scripts than These args are already there in the new |
Thank you for your input, guys. Your suggestions work for me.
but not in other and then we have I proposed to have a Trainer subclass that implements this for all scripts vs repeating the same cl arg definition and code in every script a new (and forgetting to sync some) - could you please address that? The other slight confusion across some scripts is |
I don't think this is a good idea personally. The goal of the scripts is to provide examples for our users. Having examples that don't use the main object of the library is counterproductive. It's one other instance where we have to bear the burden of duplicate code to make the user experience easier IMO.
I think this is mostly |
You're correct. I didn't think of that. So we have a conflict here between example scripts and them being used for more than that. I, for one, need a solid set of scripts to do:
In the absence of these I have been heavily relying on the example scripts. And this is probably where the conflict is. So I keep on bringing this up - should we have a set of scripts that are not examples, but real production work horses and we treat them as such? Perhaps they can have much less functionality but do it consistently across different domains and simple? Perhaps, instead of
You're absolutely correct, please see my response in the comment above. |
If the basic examples do not suffice, then yes, definitely. |
But we are walking in circles. If these are examples and they are treated as examples, these aren't tools to be relied upon. I hope you can see the irony... I need a solid tool that will not change its API, start doing all the benchmarks in it so that we could go back to benchmarks from 6 months or a year ago and be able to run those and re-check. |
I'm not sure why you say we are walking in circles. I just dais yes to having benchmark-specific scripts if the examples do not have all the functionality you need. |
I see what you mean. But you asked a tricky question - can I figure out how to the use the example scripts to meet my needs - mostly yes - but then every time I ask for something that ensures consistency, you say - but the audience is wrong - it should be for users. And I say, yes, of course, you're right. And we end up nowhere. Do you see where the circle is? Ideally there should be just one benchmarking tool that can handle any model (or at least the majority of them) and support the different tasks and it probably won't need all the possible flags the various scripts have. If that makes sense. I was using Complaining and frustration expression aside - perhaps we could start with one best script that you think is a good model and then making it non-examples and to start transforming it to support a multitude of tasks/models/features? Would that be a good way to move forward? |
The issue is derailing a bit as I think adding the If you want to look at a benchkmarking script, I think a good starting point is |
Excellent!
I feel I'm not managing to successfully communicate the need here. I will let it go for now. |
This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions. If you think this still needs to be addressed please comment on this thread. |
This is getting resolved by #10551 |
hi,I want to use the crime_and_punish dataset to do evaluation on model reformer,which task code should I use? |
@LeopoldACC, it looks like you posted your question in a very unrelated discussion. Please try https://discuss.huggingface.co/. Thank you. |
🚀 Feature request
We have:
finetune_trainer.py
hasrun_
scripts use--n_obs
--max_steps
in the main trainer - which works only on the train_dataset - no ability to limit items on eval_datasetRequests/Questions:
--max_steps
if one needs to use a different number of items for train and eval?Thank you.
@sgugger
The text was updated successfully, but these errors were encountered: