[examples] add `main_process_first` context manager to datasets map calls #12363

stas00 · 2021-06-25T22:29:26Z

We need to replay this addition that has been modelled in run_translation.py in #12351 to all other pytorch examples

The actual changes for the model example are:
https://github.com/huggingface/transformers/pull/12351/files#diff-09777f56cee1060a535a72ce99a6c96cdb7f330c8cc3f9dcca442b3f7768237a
(just run_translation.py)

Here is a time-saver:

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(train_dataset = train_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="train dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(eval_dataset = eval_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="validation dataset map pre-processing"):\n$p$t] } }' {} \;

find examples/pytorch -type f -exec perl -0777 -pi -e 's|^(\s+)(predict_dataset = predict_dataset.map\(.*?\))|x($1, $2)|msge; BEGIN {sub x {($p, $t) = @_ ; $t =~ s/^/    /msg; return qq[${p}with training_args.main_process_first(desc="prediction dataset map pre-processing"):\n$p$t] } }' {} \;

git checkout examples/pytorch/translation/run_translation.py

make fixup

I noticed other scripts may have other datasets.map calls, which get automatically rewritten by the scripts above, so please review the changes to see if the desc needs to be modified. But we want to use the context manager on all of these calls, it's possible that the perl rewrite scripts didn't catch some.

also this template needs to have this change as well:
templates/adding_a_new_example_script/\{\{cookiecutter.directory_name\}\}/run_\{\{cookiecutter.example_shortcut\}\}.py
can do via perl or manually or whatever other way works for you.

And please validate that scripts still work, by either running:

RUN_SLOW=1 pytest  examples/pytorch/test_examples.py

or running each script manually as explained in its corresponding README.md file.

This issue is open to all and should be very simple to complete, the main effort is to validate.

And thank you for your contribution!

The text was updated successfully, but these errors were encountered:

bhadreshpsavani · 2021-06-26T03:21:30Z

Can I take this?
Since it will not take much time for me

stas00 · 2021-06-26T03:49:01Z

Yes, thank you, @bhadreshpsavani

bhadreshpsavani · 2021-06-26T06:19:53Z

Hi @stas00 and @sgugger,
In the earlier PR, I wanted to ask one thing
in the below code,

transformers/examples/pytorch/question-answering/utils_qa.py

Lines 416 to 425 in 9a75459

    
           print(f"Saving predictions to {prediction_file}.") 
        
           with open(prediction_file, "w") as writer: 
        
               writer.write(json.dumps(all_predictions, indent=4) + "\n") 
        
           print(f"Saving nbest_preds to {nbest_file}.") 
        
           with open(nbest_file, "w") as writer: 
        
               writer.write(json.dumps(all_nbest_json, indent=4) + "\n") 
        
           if version_2_with_negative: 
        
               print(f"Saving null_odds to {null_odds_file}.") 
        
               with open(null_odds_file, "w") as writer: 
        
                   writer.write(json.dumps(scores_diff_json, indent=4) + "\n")

Shall we use logger.info() instead print() like we did in below code

transformers/examples/pytorch/question-answering/utils_qa.py

Lines 228 to 237 in 9a75459

    
           logger.info(f"Saving predictions to {prediction_file}.") 
        
           with open(prediction_file, "w") as writer: 
        
               writer.write(json.dumps(all_predictions, indent=4) + "\n") 
        
           logger.info(f"Saving nbest_preds to {nbest_file}.") 
        
           with open(nbest_file, "w") as writer: 
        
               writer.write(json.dumps(all_nbest_json, indent=4) + "\n") 
        
           if version_2_with_negative: 
        
               logger.info(f"Saving null_odds to {null_odds_file}.") 
        
               with open(null_odds_file, "w") as writer: 
        
                   writer.write(json.dumps(scores_diff_json, indent=4) + "\n")

or is it intensionally written like this?

Because of this when we run the run_qa_beam_search.py script we get the below kind of prints for the train, eval, and test stage even when we pass --log_level error

Saving predictions to /tmp/debug_squad/predict_predictions.json.                                                        | 0/5 [00:00<?, ?it/s]
Saving nbest_preds to /tmp/debug_squad/predict_nbest_predictions.json.
Saving null_odds to /tmp/debug_squad/predict_null_odds.json.

stas00 · 2021-06-26T06:27:07Z

good catch, @bhadreshpsavani! logger.info() please as you suggested.

Please feel free to make a separate PR if you don't want to mix this with this particular change.

bhadreshpsavani · 2021-06-26T07:24:57Z

Hi @stas00 and @sgugger,
There is a minor thing,
at this line

transformers/examples/pytorch/text-classification/run_glue.py

Line 529 in 9a75459

predict_dataset.remove_columns_("label")

we are getting

examples/pytorch/text-classification/run_glue.py:530: FutureWarning: remove_columns_ is deprecated and will be removed in the next major version of datasets. Use Dataset.remove_columns instead.
  predict_dataset.remove_columns_("label")

fix is,

predict_dataset.remove_columns("label")

shall we change it?

it is also present at below line

transformers/tests/sagemaker/scripts/pytorch/run_glue_model_parallelism.py

Line 506 in 9a75459

test_dataset.remove_columns_("label")

stas00 · 2021-06-26T16:27:09Z

yes, except you now need to assign the return value since this is no longer an inplace edit. Therefore in both places it'll be now be:

x  = x.remove_columns("label")

with the right x of course.

thank you for fixing it.

reference: https://huggingface.co/docs/datasets/processing.html#removing-one-or-several-columns-remove-columns

bhadreshpsavani · 2021-06-26T17:55:49Z

I have committed changes in the open PR for the fix of this warning!

stas00 added the Good First Issue label Jun 25, 2021

bhadreshpsavani mentioned this issue Jun 26, 2021

[Examples] Added context manager to datasets map #12367

Merged

5 tasks

bhadreshpsavani mentioned this issue Jun 26, 2021

[Examples] Replace print statement with logger.info in QA example utils #12368

Merged

5 tasks

stas00 closed this as completed in #12367 Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[examples] add `main_process_first` context manager to datasets map calls #12363

[examples] add `main_process_first` context manager to datasets map calls #12363

stas00 commented Jun 25, 2021 •

edited

Loading

bhadreshpsavani commented Jun 26, 2021

stas00 commented Jun 26, 2021

bhadreshpsavani commented Jun 26, 2021 •

edited

Loading

stas00 commented Jun 26, 2021 •

edited

Loading

bhadreshpsavani commented Jun 26, 2021 •

edited

Loading

stas00 commented Jun 26, 2021

bhadreshpsavani commented Jun 26, 2021

[examples] add main_process_first context manager to datasets map calls #12363

[examples] add main_process_first context manager to datasets map calls #12363

Comments

stas00 commented Jun 25, 2021 • edited Loading

bhadreshpsavani commented Jun 26, 2021

stas00 commented Jun 26, 2021

bhadreshpsavani commented Jun 26, 2021 • edited Loading

stas00 commented Jun 26, 2021 • edited Loading

bhadreshpsavani commented Jun 26, 2021 • edited Loading

stas00 commented Jun 26, 2021

bhadreshpsavani commented Jun 26, 2021

[examples] add `main_process_first` context manager to datasets map calls #12363

[examples] add `main_process_first` context manager to datasets map calls #12363

stas00 commented Jun 25, 2021 •

edited

Loading

bhadreshpsavani commented Jun 26, 2021 •

edited

Loading

stas00 commented Jun 26, 2021 •

edited

Loading

bhadreshpsavani commented Jun 26, 2021 •

edited

Loading