-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use torch native amp #3128
use torch native amp #3128
Conversation
8e099f8
to
efbdef2
Compare
update: running AMP on cpu is possible, however I experienced extreme slow downs compared to non-amp on CPU. Hence I deleted AMP from the testing scripts again |
ed70a5d
to
a5839b2
Compare
4d375a7
to
ff6806d
Compare
after feedback from @dchaplinsky I also upgraded the language model trainer. I have verified that amp works, however I haven't gathered information about the speed boost on languagemodelling |
ff6806d
to
92f0ae6
Compare
0d7693f
to
aa0f67f
Compare
@helpmefindaname have you checked if this gives speedups? At least the following script on my local machine trained on cuda:0 becomes slower if flair.set_seed(123)
# set this to True or False
use_amp = True
# get downsampled corpus
corpus = CONLL_03(in_memory=False).downsample(0.05)
# make label dictionary
label_dict = corpus.make_label_dictionary("ner")
# init embeddings
embeddings = TransformerWordEmbeddings("distilbert-base-uncased", fine_tune=True)
# init simple tagger
tagger = TokenClassifier(
embeddings=embeddings,
label_dictionary=label_dict,
label_type="ner",
)
# train model
trainer = ModelTrainer(tagger, corpus)
trainer.fine_tune(
f"resources/taggers/test_tagger_{use_amp}-chunk_4",
monitor_test=True,
shuffle=False,
max_epochs=1,
use_amp=use_amp,
) Without amp I get 400 samples / sec, but with Additionally, it throws the following warning: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " |
That is interesting, I get: with amp:
without amp:
So amp is faster for me, on almost the opposite ratio. About the warning: with that fix on the amp, I get:
So in that specific case the fix performs worse, however I don't think that generalizes to models with more training epochs, etc. |
094bfcc
to
6739861
Compare
6739861
to
abcc759
Compare
@helpmefindaname thanks for adding this! (Locally, I am still seeing slowdowns when using AMP, but as discussed offline, there may be a CPU/GPU tradeoff that is causing this.) |
nvidia-apex amp is deprecated for some time now, as pytorch has torch.cuda.amp since pytorch 1.6.
This PR upgrades the usage to the newer version, hence setting
use_amp=True
ontrainer.train
ortrainer.finetune
will work out of the box.Also
use_amp
will be used in all tests that train a model, hence the tests should be faster (waiting for the pipeline to finish, to evaluate it)I did 2 training runs:
Transformer (distilbert) using use_amp reduces the time from 440 s/epoch to 217 s/epoch.
Flair-embeddings + wordembeddings (without a BiLstm layer) using use_amp reduces the time for the first epoch (no embedding storage) from 221 s/epoch to 70s/epoch.
This PR also fixes recreation of the types to be aligned with the layers, hence you can speed up inference by using
tagger.half()
.pytorch generally selects the dtype to cast to depending on your device. However, one can select the dtype themself, by using
torch.set_autocast_cpu_dtype
ortorch.set_autocast_gpu_dtype
. Notice, that cpu currently only supporttorch.bfloat16
However be aware, that using amp, reduces the accuracy of gradients and therefore can lead to a lower score or higher loss.