-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added max_sample_ arguments #10551
Added max_sample_ arguments #10551
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, @bhadreshpsavani!
There are a few small tweak requests that I left in the comments.
Thank you!
num_proc=data_args.preprocessing_num_workers, | ||
load_from_cache_file=not data_args.overwrite_cache, | ||
) | ||
if training_args.do_eval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if training_args.do_eval: | |
if training_args.do_eval: |
please add a new line between the ifs so that they don't mesh together (same in all other scripts).
if os.path.exists(path): | ||
with open(path, "r") as f: | ||
results = json.load(f) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are expecting this to always work, then perhaps:
else:
raise ValueError(f"can't find {path}")
otherwise result["eval_accuracy"]
will complain about a missing key, which is not the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diving into this. There is a slight problem with the language modeling examples and the QA examples: for both, the number of samples in the dataset is actually changed in the preprocessing, so we must take more care to have the right number of samples in the final dataset.
In particular, I don't think we can avoid preprocessing the whole dataset in the language modeling examples.
def preprocess_function(examples): | ||
examples = tokenizer(examples[text_column_name]) | ||
return group_texts(examples) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work grouped like this as the group_texts
function relies on the length of tokenized samples. Moreover, in this case, the number of samples is actually the number of elements in the lm_datasets
(in the version before your PR) as group_texts
changes the number of examples.
Therefore, all the preprocessing should be left as is here and the number of samples selected at the end.
if data_args.max_train_samples is not None: | ||
train_dataset = train_dataset.select(range(data_args.max_train_samples)) | ||
train_dataset = train_dataset.map( | ||
group_texts, | ||
batched=True, | ||
num_proc=data_args.preprocessing_num_workers, | ||
load_from_cache_file=not data_args.overwrite_cache, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before. The selecting should be done after the preprocessing.
num_proc=data_args.preprocessing_num_workers, | ||
load_from_cache_file=not data_args.overwrite_cache, | ||
) | ||
if training_args.do_train: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on this script too.
train_dataset = datasets["train"] | ||
if data_args.max_train_samples is not None: | ||
train_dataset = train_dataset.select(range(data_args.max_train_samples)) | ||
train_dataset = train_dataset.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prepare_train_features
function will create multiple entries for each example. So we should do a second select after the preprocessing.
eval_dataset = datasets["validation"] | ||
if data_args.max_val_samples is not None: | ||
eval_dataset = eval_dataset.select(range(data_args.max_val_samples)) | ||
eval_dataset = eval_dataset.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for validation.
train_dataset = datasets["train"] | ||
if data_args.max_train_samples is not None: | ||
train_dataset = train_dataset.select(range(data_args.max_train_samples)) | ||
train_dataset = train_dataset.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in the run_qa script, we should do a second select after preprocessing.
eval_dataset = datasets["validation"] | ||
if data_args.max_val_samples is not None: | ||
eval_dataset = eval_dataset.select(range(data_args.max_val_samples)) | ||
eval_dataset = eval_dataset.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for validation.
Thank you for having a closer look that I did, @sgugger. Ideally we should have tests that would have caught this |
Hi @stas00, How can we add test cases for this testing? If we check |
Yes, that's exactly the idea |
Hi @stas00, What should I do if I got this error while using git,
|
I found that I need to use this command |
Some unrelated to your work CI tests were failing so I rebased your PR branch to master, and then they passed. You may have not noticed that. So you needed to do
and deal with merge conflicts if any emerge. In general force-pushing should only be reserved for when a bad mistake was made and you need to undo some damage. So your force-pushing undid the changes I pushed. But since you then rebased it's the same as what I did. No damage done in this situation. But please be careful in the future and first understand why you think of doing force pushing. |
Okay @stas00, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! I have a few more and then it should be ready to be merged :-)
train_dataset = tokenized_datasets["train"].map( | ||
group_texts, | ||
batched=True, | ||
num_proc=data_args.preprocessing_num_workers, | ||
load_from_cache_file=not data_args.overwrite_cache, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this map can be dome has before (deleted lines 349 to 354 in the diff) since it's the same for training and validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sgugger,
so we should do it like below
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
and we simply select samples for train and validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It avoids duplicating the same code this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this for almost all the examples, I thought preprocessing will be done only if it will be required.
Shall I do these changes for all the examples or mentioned here only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the other examples, you are doing the select before doing the map (to avoid preprocessing all the dataset) so it's not possible to group all the preprocessing together.I think it only applies to the three scripts in language_modeling.
train_dataset = tokenized_datasets["train"].map( | ||
group_texts, | ||
batched=True, | ||
num_proc=data_args.preprocessing_num_workers, | ||
load_from_cache_file=not data_args.overwrite_cache, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
if data_args.max_train_samples is not None: | ||
train_dataset = train_dataset.select(range(data_args.max_train_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still do this before the map: the map adds samples but do not reduce their numbers. So in this example, we can speed up preprocessing by doing the train_dataset = train_dataset.select(range(data_args.max_train_samples))
before the map so we preprocess at most max_train_samples
examples, and then once more after the map to make sure we have the right number of examples.
if data_args.max_val_samples is not None: | ||
eval_dataset = eval_dataset.select(range(data_args.max_val_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here for the validation set.
if data_args.max_train_samples is not None: | ||
train_dataset = train_dataset.select(range(data_args.max_train_samples)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in run_qa
@LysandreJik I think this is ready for final review and merge if you're happy with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
* reverted changes of logging and saving metrics * added max_sample arguments * fixed code * white space diff * reformetting code * reformatted code
What does this PR do?
Fixes #10437 #10423
Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
Notes:
All the PyTorch-based examples except the below two files will have support for the arguments by adding these changes.
run_mlm_flax.py
but since I couldn't test the changes I didn't make changes to that file.run_generation.py
review:
@stas00 @sgugger