-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
@@ -66,6 +68,7 @@ def __init__( | |||
local_rank: int = 0, | |||
world_size: int = 1, | |||
num_gradient_accumulation_steps: int = 1, | |||
opt_level: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this to the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about amp_level
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, they also call this arg opt_level
in apex
(see here). Is there a good argument for calling it something different in AllenNLP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cause amp is the main thing in apex. As far as I understand, opt_level
means "option level". In AllenNLP there are many options. So, amp_level
would mean "automatic mixed precision level".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's actually "optimization level". Still, I'd do amp_opt_level
, cause there are many different options in Trainer
and it can get confusing. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards sticking with opt_level
here, or perhaps amp_opt_level
. Switching the name of a parameter makes it harder for someone familiar with apex to know what's going on here.
allennlp/training/trainer.py
Outdated
@@ -32,6 +32,8 @@ | |||
from allennlp.training.tensorboard_writer import TensorboardWriter | |||
from allennlp.training.trainer_base import TrainerBase | |||
|
|||
from apex import amp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with just adding this to our list of dependencies, unless it's super heavy. Though it should be above, in the same block as the torch
imports.
Also, can you add a test with a not-None
opt_level
, showing that this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, moved apex import to torch block. Put it at the top because its convention for them to be alphabetically sorted (I think?)
Here is the install process for apex. I will leave it up to you as to whether this is too heavy or not!
Sure, will work on adding a test either tonight or tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can add it to our setup.py
, and if someone wants to install their own optimized version, that's fine too? The point is we want to not crash when someone just does pip install allennlp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that seems perfectly reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I need a little time to get my head around how the testing works in AllenNLP. If it is very easy for one of the AllenNLP devs to add a test for when opt_level is not None
I could also pass the buck so this can get merged sooner.
allennlp/training/trainer.py
Outdated
return training_util.sparse_clip_norm(parameters_to_clip, self._grad_norm) | ||
return None | ||
else: | ||
return training_util.rescale_gradients(self.model, self._grad_norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to add a use_amp
flag to rescale_gradients
. Oh, except that then also requires passing in the optimizer... Yeah, I think what you have here is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly my thinking. Otherwise, we could have a second util function named something like rescale_gradients_amp
which accepts an optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding that function would be overkill. If we want to refactor things, we should have rescale_gradients
take a list of parameters. But then we can just remove it and call sparse_clip_norm
directly. I'm somewhat in favor of that option, as it would simplify the above code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I refactored Trainer.rescale_gradients
to use training_utils.sparse_clip_norm
directly. Otherwise the function is the same.
opt_level : `str`, optional, (default = `None`) | ||
Each opt_level establishes a set of properties that govern Amp’s implementation of pure or mixed | ||
precision training. Must be a choice of `"O0"`, `"O1"`, `"O2"`, or `"O3"`. | ||
See the Apex [documentation](https://nvidia.github.io/apex/amp.html#opt-levels-and-properties) for | ||
more details. If `None`, Amp is not used. Defaults to `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-gardner How's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnGiorgi this PR is great, thanks a lot!
Firstly, you asked how to run the tests. You can do that by:
pip install -r dev-requirements.txt
- Unit tests:
pytest allennlp/tests/path/to/test
.-v -s
flags make it verbose. - Auto formatter:
black allennlp
reformats python code automatically - just commit the changes it makes. - Lint:
flake8 allennlp
- Type Checking:
bash scripts/mypy.sh
(we have some specific mypy settings, so we have a script which runs it automatically.)
It would be good if you could add a test like this, which just checks that the trainer can run with one of the non-None optimisation levels. You can also add a decorator to the test which checks if apex is installed (see discussion below).
Some things we might need to resolve:
-
Apex does not offer a pip installable package, which means either we need to install from github, e.g
pip install git+https://github.com/NVIDIA/apex.git@master
insetup.py
or we need to make apex optional. I am in favour of the second option. This would mean moving the apex import inside the check foropt_level
being none, and adding the github install todev-requirements.txt
, so that we can at least run a test when it is present (we install dev-requirements when testing, but actually using allennlp must not be dependent on them). -
Checkpointing the loss scaling seems to require some additional changes. I don't want this to block the addition of apex to allennlp, but we should at least open an issue for it afterward if we can't properly restore a model without it.
# Required for automatic mixed precision (AMP) training | ||
git+https://github.com/NVIDIA/apex.git@master | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, as discussed, I have added apex as a dev requirement
|
||
@pytest.mark.skipif(not torch.cuda.is_available(), reason="No CUDA device registered.") | ||
@pytest.mark.skipif(importlib.util.find_spec("apex") is None, reason="Apex is not installed.") | ||
def test_trainer_can_run_amp(self): | ||
from apex import amp | ||
self.model.cuda() | ||
trainer = Trainer( | ||
self.model, self.optimizer, self.data_loader, num_epochs=2, cuda_device=0, opt_level="O1" | ||
) | ||
metrics = trainer.train() | ||
assert "peak_cpu_memory_MB" in metrics | ||
assert isinstance(metrics["peak_cpu_memory_MB"], float) | ||
assert metrics["peak_cpu_memory_MB"] > 0 | ||
assert "peak_gpu_0_memory_MB" in metrics | ||
assert isinstance(metrics["peak_gpu_0_memory_MB"], int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test to make sure the model runs with amp
. There are two skipif
checks, one to make sure a cuda device is available and one to check if apex
is importable (I wasn't sure if it made more sense to combine the checks in one decorator or have one check per decorator, went with the latter).
@DeNeutoy Thanks a lot for the guidance. That made it easy to add a test. Also added One thing: If I put Good catch on the checkpointing thing. I guess it is up to AllenNLP devs, I could open another PR after this is merged to get checkpointing with an amp-enabled model working OR try to add it to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to leave the checkpointing, but open an issue/follow up PR - LGTM with the suggested change for managing the optional dependency above 👍
allennlp/training/trainer.py
Outdated
@@ -7,6 +7,7 @@ | |||
import traceback | |||
from typing import Dict, List, Optional, Tuple, Union, Any | |||
|
|||
from apex import amp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest this:
try:
from apex import amp
except ImportError:
amp = None
and then below, check that amp is not None
and raise a ConfigurationError
if it is and the user has passed an opt level. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
As mentioned in #3851, there are some things around in the codebase (e.g., some metrics) that IMHO wouldn't play well with FP16 and I have had problems in the past, such as having some |
allennlp/training/trainer.py
Outdated
if amp is not None: | ||
raise ConfigurationError( | ||
("Apex not installed but opt_level was provided. Please install NVIDIA's Apex to enable" | ||
" automatic mixed precision (AMP) training. See: https://github.com/NVIDIA/apex.") | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed! -- Whoops, caught a bug. It is now if amp is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnGiorgi One last thing and then this is good to go, thanks a lot!
opt_level="O1", | ||
) | ||
metrics = trainer.train() | ||
assert "peak_cpu_memory_MB" in metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete these asserts, as they aren't related to this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -121,6 +122,26 @@ def test_passing_trainer_multiple_gpus_raises_error(self): | |||
self.model, self.optimizer, self.data_loader, num_epochs=2, cuda_device=[0, 1], | |||
) | |||
|
|||
@pytest.mark.skipif(not torch.cuda.is_available(), reason="No CUDA device registered.") | |||
@pytest.mark.skipif(importlib.util.find_spec("apex") is None, reason="Apex is not installed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This importlib
thing is doing something funny to our CI, can you just change it to how we imported it in the trainer itself?
try:
import apex
accept:
apex = None
...
@pytest.mark.skipif(apex is None, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
I saw the GPU tests failed after this PR. Maybe there's something to review? Can the bulldozer run the GPU tests and report if they fail? |
Overview
This PR enables automatic mixed precision training with NVIDIA's Apex which has been much discussed in #2149 and is one of the action items on the roadmap.
The modifications were very straightforward. Here is the high-level:
opt_level
toTrainer
. This coincides with the opt levels of Apex.Trainer
, ifopt_level is not None
, wrapself.model
andself.optimizer
withamp.initialize
.rescale_gradients
needs to be modified slightly when using Apex. See here for more info._train_epoch
,if opt_level is not None
, callloss.backward()
in theamp.scale_loss
context manager.I got all of this straight from the Apex docs.
Benchmark
I tested it on my model and found a ~40% speedup in time per epoch during training. It also allowed me to train with a 40% larger mini-batch. The majority of my models parameters exist in a pre-trained transformer (
distilroberta-base
from Transformers).Simple benchmark on a 16GB-V100 using a random 1K documents from wikitext-103:
opt_level
"O0"
"O1"
"O1"
*Maximum batch size that fits in memory at this
opt_level
.Obviously, mileage will vary but this gives a flavour of the magnitude of speedup you can get with Apex.
Things I need help with
ImportError
if the user does not haveamp
installed. This is probably not what we want.rescale_gradients
. There is likely a cleaner way to do this.