-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add run_glue_tpu.py
that trains models on TPUs
#3702
Conversation
Initial commit to get GLUE (BERT) on TPU
TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested.
TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested.
Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself.
* Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead.
examples/README.md
Outdated
--task_name $TASK_NAME \ | ||
--do_train \ | ||
--do_eval \ | ||
--do_lower_case \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's present in other example codes (and should be changed), but should we keep the --do_lower_case
option with cased models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed.
src/transformers/modeling_utils.py
Outdated
def save_pretrained(self, save_directory, xla_device=False): | ||
""" Save a model and its configuration file to a directory, so that it | ||
can be re-loaded using the `:func:`~transformers.PreTrainedModel.from_pretrained`` class method. | ||
|
||
Arguments: | ||
save_directory: directory to which to save. | ||
xla_device: True if saving after training on TPU/XLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit uncomfortable. I'm pretty sure the best option would be to save it in the model configuration instead of adding an argument to save_pretrained
. I'll look into what can be done to have this clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes have now been merged and I like the way it looks now better 😄
This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer.
Using configuration for `xla_device`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with a couple of small changes.
examples/run_glue_tpu.py
Outdated
|
||
import numpy as np | ||
import torch | ||
import torch_xla.core.xla_model as xm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this naming for non-standard modules. Can you refer to these by their full names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all our examples we use xm
, xmp
, met
, pl
consistently so I'd rather keep this way for the sake of consistency. I similarly do see np
throughout Huggingface, but if you feel strongly, I'm happy to call them by full name instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If that's the XLA standard I guess it's okay. (I feel like numpy is a special case).
examples/run_glue_tpu.py
Outdated
set_seed(args.seed) # Added here for reproductibility (even between python 2 and 3) | ||
for epoch in train_iterator: | ||
# Get TPU parallel loader which sends data to TPU in background. | ||
train_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a "TPU" comment like this over each of the non-standard lines? I think that would be helpful for learning from this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done.
examples/run_glue_tpu.py
Outdated
loss.backward() | ||
|
||
if (step + 1) % args.gradient_accumulation_steps == 0: | ||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that this line was very slow on TPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're referring to the comment made here: https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md I've sent our this PR to clarify.
tl;dr: we've patched torch.nn.utils.clip_grad_norm_
so that it's not slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like our comments crossed haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to know.
examples/run_glue_tpu.py
Outdated
) | ||
|
||
# TPU Parameters | ||
parser.add_argument("--num_cores", default=8, type=int, help="Number of TPU cores to use.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On colab you can only use 1 or 8 here right? Is that true generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's not only colab specific but you have to user either 1 or all TPU cores at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, maybe we could put that comment in the doc string just so users don't get confused. (I found the error messages a bit cryptic when I did this wrong.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, clarified on the CLI flag.
Codecov Report
@@ Coverage Diff @@
## master #3702 +/- ##
==========================================
- Coverage 78.06% 78.03% -0.03%
==========================================
Files 100 100
Lines 17134 17144 +10
==========================================
+ Hits 13375 13378 +3
- Misses 3759 3766 +7
Continue to review full report at Codecov.
|
Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Won't move the script to examples/glue/
yet as I know @julien-c is working on scripts and I don't want to create a complicated rebase.
Hmm don't worry about me, this shouldn't be a problem – feel free to merge if ready @LysandreJik ! |
@jysohn23 I'm trying to run a variant of
I tried reducing the batch-size to 1 and running on a single core, both led to the same error. I'm using this full log - shorturl.at/iswxR
|
No description provided.