Use barriers to reduce duplicate work/resources #9

jysohn23 · 2020-04-01T00:50:01Z

No description provided.

taylanbil

much cleaner this way, thanks Daniel.

taylanbil · 2020-04-01T17:12:27Z

transformers/modeling_utils.py

+            if xm.is_master_ordinal():
+                model_to_save.config.save_pretrained(save_directory)
+            # xm.save takes care of saving only from master
+            xm.save(model_to_save.state_dict(), output_model_file)


idea for the future; you could use the checkpoint tagger later, if they mark which chpt is the best etc.

As far as I know, they don't have such tagger but let me see if they do. Good idea we could even upstream if they don't. Thanks.

taylanbil · 2020-04-01T17:14:26Z

examples/run_glue_tpu.py

@@ -341,6 +355,9 @@ def main(args):
    label_list = processor.get_labels()
    num_labels = len(label_list)

+    if not xm.is_master_ordinal():
+        xm.rendezvous('download_only_once')  # Make sure only the first process in distributed training will download model & vocab


I'm assuming no tensor work is done in the methods b/w download_only_once rendezvous's

Yep, correct.

* Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (pytorch-tpu#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (pytorch-tpu#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (pytorch-tpu#5) * Add XLNet in list of models for `run_glue_tpu.py` (pytorch-tpu#6) * Add RoBERTa to list of models in TPU GLUE (pytorch-tpu#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (pytorch-tpu#8) * Use barriers to reduce duplicate work/resources (pytorch-tpu#9) * Shard eval dataset and aggregate eval metrics (pytorch-tpu#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (pytorch-tpu#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <ahmetustun89@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

Use barriers to reduce duplicate work/resources

32a34d2

jysohn23 requested a review from taylanbil April 1, 2020 00:50

taylanbil approved these changes Apr 1, 2020

View reviewed changes

jysohn23 merged commit 6d17e91 into pytorch-tpu:tpu Apr 1, 2020

jysohn23 deleted the tpu branch April 1, 2020 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use barriers to reduce duplicate work/resources #9

Use barriers to reduce duplicate work/resources #9

jysohn23 commented Apr 1, 2020

taylanbil left a comment

taylanbil Apr 1, 2020

jysohn23 Apr 1, 2020

taylanbil Apr 1, 2020

jysohn23 Apr 1, 2020

Use barriers to reduce duplicate work/resources #9

Use barriers to reduce duplicate work/resources #9

Conversation

jysohn23 commented Apr 1, 2020

taylanbil left a comment

Choose a reason for hiding this comment

taylanbil Apr 1, 2020

Choose a reason for hiding this comment

jysohn23 Apr 1, 2020

Choose a reason for hiding this comment

taylanbil Apr 1, 2020

Choose a reason for hiding this comment

jysohn23 Apr 1, 2020

Choose a reason for hiding this comment