-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Trainer] add --max_train_samples --max_val_samples --max_test_samples #10437
Comments
Yes, that would be a nice refactor in |
@bhadreshpsavani, would you like to try this slightly more complex task? Step 1 is to take Step 2 - no step, every other script should just work with these now-Trainer-level cl args. and then later it'd be great to have the metrics updated with the actual number of samples run, like it's done manually right now in |
Ya @stas00, |
Awesome! Please don't hesitate to ask question if you run into any uncertainties. Thank you! |
I agree with the proposed solution here. But we pre-process the datasets in scripts before passing them to the |
Hi @patil-suraj, if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples)) But ya, it will select sample from processed dataset only |
Ah, in that case we sadly can't really add this in the |
Hi @sgugger, |
Yes, I think it's the best solution. |
Hi @stas00, In other scripts like I also want to ask that in many scripts we are only doing train and eval things, Not test/predict things. Should we also create a separate issue for that? we might not be adding |
The test thing should definitely have its separate issue (and |
Good analysis, @bhadreshpsavani!
That was invalid porting of the original. As you can see the original aggregated all the metrics and returned them:
I think all these return metrics from Alternatively, we could have the trainer also store all the metrics not only on the disc but internally, and so the last command of
@sgugger, what do you think - should we just not return anything from
Another good catch. Probably for now just skip |
Filed: #10482 It might be easier to sort out test/predict first then, as it'd make the copy-n-paste of all 3 cl arg flags easier. But either way works. |
The metrics returned are mainly for the tests AFAIK, so we can remove that behavior if the tests are all adapted to load the file where the metrics are stored. |
OK, let's sync everything then to remove the inconsistent return metrics. Just please be aware that the example tests transformers/examples/test_examples.py Lines 100 to 102 in b013842
So instead should write a wrapper to load the metrics from the filesystem and test that. |
Just to make sure my mentioning of a wrapper wasn't ambiguous: For models and examples we are trying to be as explicit as possible to help the users understand what is going on in the code - so avoiding refactoring and duplicating code where it is needed. Unless we can make something a functional method in Trainer and then all the noise can be abstracted away, especially for code that's really just formatting. For tests it's the software engineering as normal, refactoring is important as it helps minimize hard to maintain code and avoid errors. So there let's not duplicate any code like reading the json file from the filesystem. |
Hi @stas00, In trainer, we can write code for loading the metrics but to access the trainer in Another thing if we use Sorry keep asking multiple questions, Once these things are clear then implementation and testing will not take much time |
You have the transformers/examples/test_examples.py Line 78 in 805c520
so you just load the transformers/examples/test_examples.py Line 101 in 805c520
That's it - You have the metrics to test on the following line ;)
I already changed the code to save transformers/src/transformers/trainer_pt_utils.py Lines 651 to 661 in 805c520
On the contrary, please don't hesitate to ask any questions. It takes quite some time to find one's way in this complex massive code base. |
Hi @stas00, Below two lines in run_ner.py don't seem much meaningful since trainer.log_metrics("test", metrics)
trainer.save_metrics("test", metrics) |
Why do you suggest it's not meaningful?
|
oooh, I didn't notice it! |
I have made changes for adding three arguments in PyTorch-based scripts. It's working as expected. I also modified For TensorFlow-based scripts, I am facing issues while running the script even in colab directly from the master branch without any changes. I create an issue for the same. We have four run_tf_*.py files :
Based on the error in the third file, Do we need to add test_script for this TensorFlow files, currently, we only have PyTorch-based scripts in the |
Please don't touch the TF examples as they have not been cleaned up and will change in the near future. And yes, none of the TF examples are currently tested. |
As we were planning to add
--max_train_samples --max_val_samples --max_test_samples
to all examples #10423, I thought is there any reason why we don't expand the Trainer to handle that?It surely would be useful to be able to truncate the dataset at the point of Trainer to enable quick testing.
Another plus is that the metrics can then automatically include the actual number of samples run, rather than how it is done at the moment in examples.
That way this functionality would be built-in and examples will get it for free.
TODO:
--max_train_samples --max_val_samples --max_test_samples
to Trainer and remove the then unneeded code inrun_seq2seq.py
transformers/examples/seq2seq/run_seq2seq.py
Line 590 in aca6288
so that all scripts automatically get this metric reported. Most likely it should be done here:
transformers/src/transformers/trainer_utils.py
Line 224 in aca6288
@sgugger
The text was updated successfully, but these errors were encountered: