Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[examples] add main_process_first context manager to datasets map calls #12363

Closed
stas00 opened this issue Jun 25, 2021 · 7 comments · Fixed by #12367
Closed

[examples] add main_process_first context manager to datasets map calls #12363

stas00 opened this issue Jun 25, 2021 · 7 comments · Fixed by #12367

Comments

@stas00
Copy link
Contributor

stas00 commented Jun 25, 2021

We need to replay this addition that has been modelled in run_translation.py in #12351 to all other pytorch examples

The actual changes for the model example are:
https://github.com/huggingface/transformers/pull/12351/files#diff-09777f56cee1060a535a72ce99a6c96cdb7f330c8cc3f9dcca442b3f7768237a
(just run_translation.py)

Here is a time-saver:

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(train_dataset = train_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="train dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(eval_dataset = eval_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="validation dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(predict_dataset = predict_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="prediction dataset map pre-processing"):\n$p$t] } }' {} \;

git checkout examples/pytorch/translation/run_translation.py

make fixup

I noticed other scripts may have other datasets.map calls, which get automatically rewritten by the scripts above, so please review the changes to see if the desc needs to be modified. But we want to use the context manager on all of these calls, it's possible that the perl rewrite scripts didn't catch some.

  • also this template needs to have this change as well:
    templates/adding_a_new_example_script/\{\{cookiecutter.directory_name\}\}/run_\{\{cookiecutter.example_shortcut\}\}.py
    can do via perl or manually or whatever other way works for you.

And please validate that scripts still work, by either running:

RUN_SLOW=1 pytest  examples/pytorch/test_examples.py

or running each script manually as explained in its corresponding README.md file.

This issue is open to all and should be very simple to complete, the main effort is to validate.

And thank you for your contribution!

@bhadreshpsavani
Copy link
Contributor

Can I take this?
Since it will not take much time for me

@stas00
Copy link
Contributor Author

stas00 commented Jun 26, 2021

Yes, thank you, @bhadreshpsavani

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Jun 26, 2021

Hi @stas00 and @sgugger,
In the earlier PR, I wanted to ask one thing
in the below code,

print(f"Saving predictions to {prediction_file}.")
with open(prediction_file, "w") as writer:
writer.write(json.dumps(all_predictions, indent=4) + "\n")
print(f"Saving nbest_preds to {nbest_file}.")
with open(nbest_file, "w") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
if version_2_with_negative:
print(f"Saving null_odds to {null_odds_file}.")
with open(null_odds_file, "w") as writer:
writer.write(json.dumps(scores_diff_json, indent=4) + "\n")

Shall we use logger.info() instead print() like we did in below code
logger.info(f"Saving predictions to {prediction_file}.")
with open(prediction_file, "w") as writer:
writer.write(json.dumps(all_predictions, indent=4) + "\n")
logger.info(f"Saving nbest_preds to {nbest_file}.")
with open(nbest_file, "w") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
if version_2_with_negative:
logger.info(f"Saving null_odds to {null_odds_file}.")
with open(null_odds_file, "w") as writer:
writer.write(json.dumps(scores_diff_json, indent=4) + "\n")

or is it intensionally written like this?

Because of this when we run the run_qa_beam_search.py script we get the below kind of prints for the train, eval, and test stage even when we pass --log_level error

Saving predictions to /tmp/debug_squad/predict_predictions.json.                                                        | 0/5 [00:00<?, ?it/s]
Saving nbest_preds to /tmp/debug_squad/predict_nbest_predictions.json.
Saving null_odds to /tmp/debug_squad/predict_null_odds.json.

@stas00
Copy link
Contributor Author

stas00 commented Jun 26, 2021

good catch, @bhadreshpsavani! logger.info() please as you suggested.

Please feel free to make a separate PR if you don't want to mix this with this particular change.

@bhadreshpsavani
Copy link
Contributor

bhadreshpsavani commented Jun 26, 2021

Hi @stas00 and @sgugger,
There is a minor thing,
at this line

predict_dataset.remove_columns_("label")

we are getting

examples/pytorch/text-classification/run_glue.py:530: FutureWarning: remove_columns_ is deprecated and will be removed in the next major version of datasets. Use Dataset.remove_columns instead.
  predict_dataset.remove_columns_("label")

fix is,

predict_dataset.remove_columns("label") 

shall we change it?

it is also present at below line

@stas00
Copy link
Contributor Author

stas00 commented Jun 26, 2021

yes, except you now need to assign the return value since this is no longer an inplace edit. Therefore in both places it'll be now be:

x  = x.remove_columns("label")

with the right x of course.

thank you for fixing it.

reference: https://huggingface.co/docs/datasets/processing.html#removing-one-or-several-columns-remove-columns

@bhadreshpsavani
Copy link
Contributor

I have committed changes in the open PR for the fix of this warning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants