Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max_sample_ arguments #10551

Merged
merged 8 commits into from
Mar 8, 2021
Merged

Added max_sample_ arguments #10551

merged 8 commits into from
Mar 8, 2021

Conversation

bhadreshpsavani
Copy link
Contributor

@bhadreshpsavani bhadreshpsavani commented Mar 5, 2021

What does this PR do?

Fixes #10437 #10423

Before submitting

Notes:

All the PyTorch-based examples except the below two files will have support for the arguments by adding these changes.

  1. The same changes can be implemented for run_mlm_flax.py but since I couldn't test the changes I didn't make changes to that file.
  2. run_generation.py
  • I have reverted the code changes for three TF-based examples since it was giving an error and we want to keep it as it is.
  • Test/Predict code addition is still pending. I will do it next.

review:

@stas00 @sgugger

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, @bhadreshpsavani!

There are a few small tweak requests that I left in the comments.

Thank you!

num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
if training_args.do_eval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if training_args.do_eval:
if training_args.do_eval:

please add a new line between the ifs so that they don't mesh together (same in all other scripts).

if os.path.exists(path):
with open(path, "r") as f:
results = json.load(f)
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are expecting this to always work, then perhaps:

else:
    raise ValueError(f"can't find {path}")

otherwise result["eval_accuracy"] will complain about a missing key, which is not the problem.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving into this. There is a slight problem with the language modeling examples and the QA examples: for both, the number of samples in the dataset is actually changed in the preprocessing, so we must take more care to have the right number of samples in the final dataset.

In particular, I don't think we can avoid preprocessing the whole dataset in the language modeling examples.

Comment on lines 350 to 353
def preprocess_function(examples):
examples = tokenizer(examples[text_column_name])
return group_texts(examples)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work grouped like this as the group_texts function relies on the length of tokenized samples. Moreover, in this case, the number of samples is actually the number of elements in the lm_datasets (in the version before your PR) as group_texts changes the number of examples.

Therefore, all the preprocessing should be left as is here and the number of samples selected at the end.

Comment on lines 401 to 405
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))
train_dataset = train_dataset.map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before. The selecting should be done after the preprocessing.

num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
if training_args.do_train:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on this script too.

train_dataset = datasets["train"]
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))
train_dataset = train_dataset.map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prepare_train_features function will create multiple entries for each example. So we should do a second select after the preprocessing.

eval_dataset = datasets["validation"]
if data_args.max_val_samples is not None:
eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
eval_dataset = eval_dataset.map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for validation.

train_dataset = datasets["train"]
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))
train_dataset = train_dataset.map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the run_qa script, we should do a second select after preprocessing.

eval_dataset = datasets["validation"]
if data_args.max_val_samples is not None:
eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
eval_dataset = eval_dataset.map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for validation.

@stas00
Copy link
Contributor

stas00 commented Mar 5, 2021

Thank you for having a closer look that I did, @sgugger.

Ideally we should have tests that would have caught this

@bhadreshpsavani
Copy link
Contributor Author

Hi @stas00,

How can we add test cases for this testing? If we check max_train_samples and max_valid_samples from metrics and add assert statement that might be possible.

@stas00
Copy link
Contributor

stas00 commented Mar 6, 2021

How can we add test cases for this testing? If we check max_train_samples and max_valid_samples from metrics and add assert statement that might be possible.

Yes, that's exactly the idea

@bhadreshpsavani
Copy link
Contributor Author

Hi @stas00,

What should I do if I got this error while using git,

$ git push origin argument-addition
To https://github.com/bhadreshpsavani/transformers.git
 ! [rejected]            argument-addition -> argument-addition (non-fast-forward)
error: failed to push some refs to 'https://github.com/bhadreshpsavani/transformers.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

@bhadreshpsavani
Copy link
Contributor Author

I found that I need to use this command git push -f origin argument-addition with fast-forward flag.
Thanks, @stas00 I used your rebase script. It's cool! I did it the first time!

@stas00
Copy link
Contributor

stas00 commented Mar 6, 2021

Some unrelated to your work CI tests were failing so I rebased your PR branch to master, and then they passed. You may have not noticed that.

So you needed to do git pull before continuing pushing. and if you already made some changes and git pull doesn't work because an update was made in files that you locally modify, you normally do:

git stash
git pull
git stash pop

and deal with merge conflicts if any emerge.

In general force-pushing should only be reserved for when a bad mistake was made and you need to undo some damage.

So your force-pushing undid the changes I pushed. But since you then rebased it's the same as what I did. No damage done in this situation.

But please be careful in the future and first understand why you think of doing force pushing.

@bhadreshpsavani
Copy link
Contributor Author

Okay @stas00,
I will be careful while using force push I will use stash.
Now I understood

@stas00 stas00 added the Examples Which is related to examples in general label Mar 7, 2021
@bhadreshpsavani bhadreshpsavani requested a review from sgugger March 8, 2021 05:07
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! I have a few more and then it should be ready to be merged :-)

Comment on lines 368 to 373
train_dataset = tokenized_datasets["train"].map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this map can be dome has before (deleted lines 349 to 354 in the diff) since it's the same for training and validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sgugger,

so we should do it like below

lm_datasets = tokenized_datasets.map(
        group_texts,
        batched=True,
        num_proc=data_args.preprocessing_num_workers,
        load_from_cache_file=not data_args.overwrite_cache,
    )

and we simply select samples for train and validation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It avoids duplicating the same code this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for almost all the examples, I thought preprocessing will be done only if it will be required.
Shall I do these changes for all the examples or mentioned here only?

Copy link
Collaborator

@sgugger sgugger Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other examples, you are doing the select before doing the map (to avoid preprocessing all the dataset) so it's not possible to group all the preprocessing together.I think it only applies to the three scripts in language_modeling.

Comment on lines 378 to 383
train_dataset = tokenized_datasets["train"].map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines 386 to 387
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still do this before the map: the map adds samples but do not reduce their numbers. So in this example, we can speed up preprocessing by doing the train_dataset = train_dataset.select(range(data_args.max_train_samples)) before the map so we preprocess at most max_train_samples examples, and then once more after the map to make sure we have the right number of examples.

Comment on lines 441 to 442
if data_args.max_val_samples is not None:
eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here for the validation set.

Comment on lines 399 to 401
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in run_qa

@bhadreshpsavani
Copy link
Contributor Author

Hello @stas00 and @sgugger,
I have made the suggested changes
Please let me know if any other changes are required
Thanks

@sgugger
Copy link
Collaborator

sgugger commented Mar 8, 2021

@LysandreJik I think this is ready for final review and merge if you're happy with it.

@sgugger sgugger requested a review from LysandreJik March 8, 2021 18:25
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

@LysandreJik LysandreJik merged commit dfd16af into huggingface:master Mar 8, 2021
@bhadreshpsavani bhadreshpsavani deleted the argument-addition branch March 8, 2021 21:58
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* reverted changes of logging and saving metrics

* added max_sample arguments

* fixed code

* white space diff

* reformetting code

* reformatted code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Examples Which is related to examples in general
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Trainer] add --max_train_samples --max_val_samples --max_test_samples
4 participants