Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trainer] add --max_train_samples --max_val_samples --max_test_samples #10437

Closed
2 tasks
stas00 opened this issue Feb 27, 2021 · 23 comments · Fixed by #10551
Closed
2 tasks

[Trainer] add --max_train_samples --max_val_samples --max_test_samples #10437

stas00 opened this issue Feb 27, 2021 · 23 comments · Fixed by #10551
Assignees
Labels

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 27, 2021

As we were planning to add --max_train_samples --max_val_samples --max_test_samples to all examples #10423, I thought is there any reason why we don't expand the Trainer to handle that?

It surely would be useful to be able to truncate the dataset at the point of Trainer to enable quick testing.

Another plus is that the metrics can then automatically include the actual number of samples run, rather than how it is done at the moment in examples.

That way this functionality would be built-in and examples will get it for free.

TODO:

  1. port --max_train_samples --max_val_samples --max_test_samples to Trainer and remove the then unneeded code in run_seq2seq.py
  2. extend metrics to report the number of samples as it's done now in:

metrics["train_samples"] = min(max_train_samples, len(train_dataset))

so that all scripts automatically get this metric reported. Most likely it should be done here:

def speed_metrics(split, start_time, num_samples=None):

@sgugger

@sgugger
Copy link
Collaborator

sgugger commented Feb 27, 2021

Yes, that would be a nice refactor in Trainer! I believe this can be done when we create the dataloaders, to keep the original datasets untouched.

@stas00
Copy link
Contributor Author

stas00 commented Feb 28, 2021

@bhadreshpsavani, would you like to try this slightly more complex task?

Step 1 is to take run_seq2seq.py and move the functionality that handles --max_train_samples --max_val_samples --max_test_samples (args and logic) to training_args.py and trainer.py correspondingly. And to ensure it works the same way. Please see @sgugger's note above to where to move the logic to.

Step 2 - no step, every other script should just work with these now-Trainer-level cl args.

and then later it'd be great to have the metrics updated with the actual number of samples run, like it's done manually right now in run_seq2seq.py - I added the full details to the OP.

@bhadreshpsavani
Copy link
Contributor

Ya @stas00,
I can do that

@stas00
Copy link
Contributor Author

stas00 commented Feb 28, 2021

Awesome! Please don't hesitate to ask question if you run into any uncertainties.

Thank you!

@patil-suraj
Copy link
Contributor

I agree with the proposed solution here. But we pre-process the datasets in scripts before passing them to the Trainer. Now if I just want to use say a 100 validation examples, the script would unnecessary process the whole dataset and then Trainer will drop the extra examples.

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 1, 2021

Hi @patil-suraj,
You are right about that
I was thinking to add the below code here

if data_args.max_train_samples is not None:
    train_dataset = train_dataset.select(range(data_args.max_train_samples))

But ya, it will select sample from processed dataset only

@sgugger
Copy link
Collaborator

sgugger commented Mar 1, 2021

Ah, in that case we sadly can't really add this in the Trainer as it would be inefficient. I was also thinking more and having the functionality in Trainer will require us to support all kinds of datasets (even iterable datasets) and it's going to be very painful, so I think we should just copy the code in all the scripts.

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 1, 2021

Hi @sgugger,
Shall we add this arguments --max_train_samples --max_val_samples --max_test_samples and code accordingly in all the scripts like implemented in run_seq2seq.py?

@sgugger
Copy link
Collaborator

sgugger commented Mar 1, 2021

Yes, I think it's the best solution.

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 2, 2021

Hi @stas00,
I was going through the code of run_seq2seq and trying to make changes in other scripts
I came across result={}. We are not storing anything inside this dictionary and we are returning it at the end as an empty dictionary. Is it necessary?

In other scripts like run_qa.py we are using results instead of metrics in eval and test section. Should we unify this behaviour in the scripts? I mean either use metrics like in run_seq2seq or use results like other scripts

I also want to ask that in many scripts we are only doing train and eval things, Not test/predict things. Should we also create a separate issue for that? we might not be adding --max_test_samples since we are not doing testing in the script?

@sgugger
Copy link
Collaborator

sgugger commented Mar 2, 2021

The test thing should definitely have its separate issue (and --max_test_samples can be added when the test/predict is added for those scripts).

@stas00
Copy link
Contributor Author

stas00 commented Mar 2, 2021

Good analysis, @bhadreshpsavani!

I was going through the code of run_seq2seq and trying to make changes in other scripts
I came across result={}. We are not storing anything inside this dictionary and we are returning it at the end as an empty dictionary. Is it necessary?

That was invalid porting of the original. As you can see the original aggregated all the metrics and returned them:

In other scripts like run_qa.py we are using results instead of metrics in eval and test section. Should we unify this behaviour in the scripts? I mean either use metrics like in run_seq2seq or use results like other scripts

I think all these return metrics from main were mainly used for testing. But since we save all the metrics to the disc, this isn't necessarily needed as the data is already accessible - changing this may impact some tests which would need to be adjusted to use the metric dumps.

Alternatively, we could have the trainer also store all the metrics not only on the disc but internally, and so the last command of main in all scripts could be:

return trainer.metrics()

@sgugger, what do you think - should we just not return anything from main in all example scripts or always return all metrics and then tweak the trainer to store all the metrics internally and have an method to return them?

I also want to ask that in many scripts we are only doing train and eval things, Not test/predict things. Should we also create a separate issue for that? we might not be adding --max_test_samples since we are not doing testing in the script?

Another good catch. Probably for now just skip --max_test_samples in those scripts, but meanwhile I raised your question here: #10482

@stas00
Copy link
Contributor Author

stas00 commented Mar 2, 2021

The test thing should definitely have its separate issue (and --max_test_samples can be added when the test/predict is added for those scripts).

Filed: #10482

It might be easier to sort out test/predict first then, as it'd make the copy-n-paste of all 3 cl arg flags easier. But either way works.

@sgugger
Copy link
Collaborator

sgugger commented Mar 2, 2021

The metrics returned are mainly for the tests AFAIK, so we can remove that behavior if the tests are all adapted to load the file where the metrics are stored.

@stas00
Copy link
Contributor Author

stas00 commented Mar 2, 2021

OK, let's sync everything then to remove the inconsistent return metrics. Just please be aware that the example tests examples/test_examples.py will need to be adapted, e.g. currently they rely on the return value from main:

with patch.object(sys, "argv", testargs):
result = run_glue.main()
self.assertGreaterEqual(result["eval_accuracy"], 0.75)

So instead should write a wrapper to load the metrics from the filesystem and test that.

@stas00
Copy link
Contributor Author

stas00 commented Mar 2, 2021

Just to make sure my mentioning of a wrapper wasn't ambiguous:

For models and examples we are trying to be as explicit as possible to help the users understand what is going on in the code - so avoiding refactoring and duplicating code where it is needed. Unless we can make something a functional method in Trainer and then all the noise can be abstracted away, especially for code that's really just formatting.

For tests it's the software engineering as normal, refactoring is important as it helps minimize hard to maintain code and avoid errors. So there let's not duplicate any code like reading the json file from the filesystem.

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 4, 2021

Hi @stas00,
I could not figure out the code or implementation for a wrapper for loading all metrics for testing in test_exmaples.py. We are storing in the file system based on an argument output_dir which is accessible for trainer object. I don't how to access the trainer object for individual script in test_exmaples.py.

In trainer, we can write code for loading the metrics but to access the trainer in test_examples.py that I couldn't figure out.

Another thing if we use all_metrics={} to store all the metrics of train, test, and validation, we can save it once as all_results.json like legacy code, right?

Sorry keep asking multiple questions, Once these things are clear then implementation and testing will not take much time

@stas00
Copy link
Contributor Author

stas00 commented Mar 4, 2021

Hi @stas00,
I could not figure out the code or implementation for a wrapper for loading all metrics for testing in test_examples.py. We are storing in the file system based on an argument output_dir which is accessible for trainer object. I don't how to access the trainer object for individual script in test_exmaples.py.

You have the output_dir tmp_dir here:

tmp_dir = self.get_auto_remove_tmp_dir()

so you just load the f"{tmp_dir}/all_results.json" file right after this line:

result = run_glue.main()

That's it - You have the metrics to test on the following line ;)

Another thing if we use all_metrics={} to store all the metrics of train, test, and validation, we can save it once as all_results.json like legacy code, right?

I already changed the code to save all_results.json auto-magically in trainer.save_metrics - make sure you rebased your branch

if combined:
path = os.path.join(self.args.output_dir, "all_results.json")
if os.path.exists(path):
with open(path, "r") as f:
all_metrics = json.load(f)
else:
all_metrics = {}
all_metrics.update(metrics)
with open(path, "w") as f:
json.dump(all_metrics, f, indent=4, sort_keys=True)

Sorry keep asking multiple questions, Once these things are clear then implementation and testing will not take much time

On the contrary, please don't hesitate to ask any questions. It takes quite some time to find one's way in this complex massive code base.

@bhadreshpsavani
Copy link
Contributor

Hi @stas00,

Below two lines in run_ner.py don't seem much meaningful since metrics does not represent any details from test/prediction.

trainer.log_metrics("test", metrics)
trainer.save_metrics("test", metrics)

@stas00
Copy link
Contributor Author

stas00 commented Mar 4, 2021

Why do you suggest it's not meaningful?

metrics gets set here:

predictions, labels, metrics = trainer.predict(test_dataset)

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 5, 2021

oooh, I didn't notice it!
I thought it is taking earlier metrics
Thanks

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Mar 5, 2021

Hello @stas00 and @sgugger,

I have made changes for adding three arguments in PyTorch-based scripts. It's working as expected. I also modified test_examples.py accordingly.

For TensorFlow-based scripts, I am facing issues while running the script even in colab directly from the master branch without any changes. I create an issue for the same.

We have four run_tf_*.py files :

  1. run_tf_multiple_choice.py (Reported Issue)
  2. run_tf_squad.py (Reported Issue)
  3. run_tf_glue.py (Got AttributeError: 'TFTrainer' object has no attribute 'log_metrics')
  4. run_tf_text_classification.py

Based on the error in the third file, trainer.py only for PyTorch based model, and trainer_tf.py is for tf based model. In that case do we need to write save_metrics() and log_metrics() for trainer_tf.py, right? In the last pull request, I could not test the changes for TF Script but I will fix that mistake in this PR.

Do we need to add test_script for this TensorFlow files, currently, we only have PyTorch-based scripts in the test_examples.py?

@sgugger
Copy link
Collaborator

sgugger commented Mar 5, 2021

Please don't touch the TF examples as they have not been cleaned up and will change in the near future. And yes, none of the TF examples are currently tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants