Fix accelerate bug, document, add scripts #947

BenjaminBossan · 2023-03-31T12:56:59Z

Partly resolves #944

There is an issue with using skorch in a multi-GPU setting with accelerate. After some searching, it turns out there were two problems:

skorch did not call accelerator.gather_for_metrics, which resulted in y_pred not having the correct size. For more on this, consult the accelerate docs.
accelerate has an issue with being deepcopied, which happens for instance when using GridSearchCV. The problem is that some references get messed up, resulting in the GradientState of the Accelerator instance and of the dataloader to diverge. Therefore, the accelerator did not "know" when the last batch was encountered and was thus unable to remove the dummy samples added for multi-GPU inference.

The fix for 1. is provided in this PR. For 2., there is no solution in skorch, but a possible (maybe hacky) fix is suggested in the docs. The fix consists of writing a custom Accelerator class that overrides __deepcopy__ to just return self. I don't know enough about accelerate internals to determine if this is a safe solution or if it can cause more issues down the line, but it resolves the issue.

Since reproducing this bug requires a multi-GPU setup and running the scripts with the accelerate launcher (accelerate lauch <script.py>), it cannot be covered by normal unit tests. Instead, this PR adds two scripts to reproduce the issue. With the appropriate hardware, they can be used to check the solution.

Partly resolves #944 There is an issue with using skorch in a multi-GPU setting with accelerate. After some searching, it turns out there were two problems: 1. skorch did not call `accelerator.gather_for_metrics`, which resulted in `y_pred` not having the correct size. For more on this, consult the [accelerate docs](https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation). 2. accelerate has an issue with beeing deepcopied, which happens for instance when using GridSearchCV. The problem is that some references get messed up, resulting in the GradientState of the accelerator instance and of the dataloader to diverge. Therefore, the accelerator did not "know" when the last batch was encountered and was thus unable to remove the dummy samples added for multi-GPU inference. The fix for 1. is provided in this PR. For 2., there is no solution in skorch, but a possible (maybe hacky) fix is suggested in the docs. The fix consists of writing a custom Accelerator class that overrides __deepcopy__ to just return self. I don't know enough about accelerate internals to determine if this is a safe solution or if it can cause more issues down the line, but it resolves the issue. Since reproducing this bug requires a multi-GPU setup and running the scripts with the accelerate launcher, it cannot be covered by normal unit tests. Instead, this PR adds two scripts to reproduce the issue. With the appropriate hardware, they can be used to check the solution.

DistributedHistory added in #955.

thomasjpfan · 2023-04-28T13:40:11Z

docs/user/huggingface.rst

+        def __deepcopy__(self, memo):
+            return self


I'm trying to come up with way this can break, but it seems okay. From my understanding, here is how I see the behavior:

GridSearchCV.fit will clone the NeuralNet, and the PyTorch Module gets deep copied and leave the Accelerator in place.

During initialize, the Accelerator will be reconfigured to point to the new Module (If there was a way to "reset the Accelerator" that would be nice too)

Profit?

Maybe passing an stateful Accelerator object into AccelerateMixin was a bad design choice. For NeutralNet, I usually prefer passing in the Module class and not an instance. For AccelerateMixin, we can allow accelerator__mixed_precision="fp16".

From my understanding, here is how I see the behavior: [...]

Sorry, I don't quite understand, is this behavior without the __deepcopy__ patch or with?

If there was a way to "reset the Accelerator" that would be nice too

AFAIK, there is no public way to do that (see here). The Accelerator was always intended to be run in a script that has a single training loop, re-using is normally out of scope. Devs told me there should only be one instance, and it should be created as early as possible in the script.

Maybe passing an stateful Accelerator object into AccelerateMixin was a bad design choice.

Yes, possibly the pattern you suggest should have been the default, although it would not solve the issue with there only ever being a single Accelerator instance. In general, though, I believe we could add the suggested pattern without BC breaking, allowing users to pass either instances or classes like with module.

Sorry, I don't quite understand, is this behavior without the deepcopy patch or with?

With the __deepcopy__ patch.

Got it. I created an issue, #960, for tracking what we discussed about instantiating the Accelerator.

thomasjpfan

This is PR documenting a deepcopy workaround and fixing up distrusted evaluation for Accelerate. If the deepcopy workaround has no issues, then it can be included directly into AccelerateMixin later.

Unfortunately, I am not able to test multi-GPU right now, but the code looks okay.

LGTM

Preparation for release of version 0.13.0 Release text: The new skorch release is here and it has some changes that will be exiting for some users. - First of all, you may have heard of the [PyTorch 2.0 release](https://pytorch.org/get-started/pytorch-2.0/), which includes the option to compile the PyTorch module for better runtime performance. This skorch release allows you to pass `compile=True` when initializing the net to enable compilation. - Support for training on multiple GPUs with the help of the [`accelerate`](https://huggingface.co/docs/accelerate/index) package has been improved by fixing some bugs and providing a dedicated [history class](https://skorch.readthedocs.io/en/latest/user/history.html#distributed-history). Our documentation contains more information on [what to consider when training on multiple GPUs](https://skorch.readthedocs.io/en/latest/user/huggingface.html#caution-when-using-a-multi-gpu-setup). - If you have ever been frustrated with your neural net not training properly, you know how hard it can be to discover the underlying issue. Using the new [`SkorchDoctor`](https://skorch.readthedocs.io/en/latest/helper.html#skorch.helper.SkorchDoctor) class will simplify the diagnosis of underlying issues. Take a look at the accompanying [notebook](https://nbviewer.org/github/skorch-dev/skorch/blob/master/notebooks/Skorch_Doctor.ipynb) Apart from that, a few bugs have been fixed and the included notebooks have been updated to properly install requirements on Google Colab. We are grateful for external contributors, many thanks to: - Kshiteej K (kshitij12345) - Muhammad Abdullah (abdulasiraj) - Royi (RoyiAvital) - Sawradip Saha (sawradip) - y10ab1 (y10ab1) Find below the list of all changes since v0.12.1 below: ### Added - Add support for compiled PyTorch modules using the `torch.compile` function, introduced in [PyTorch 2.0 release](https://pytorch.org/get-started/pytorch-2.0/), which can greatly improve performance on new GPU architectures; to use it, initialize your net with the `compile=True` argument, further compilation arguments can be specified using the dunder notation, e.g. `compile__dynamic=True` - Add a class [`DistributedHistory`](https://skorch.readthedocs.io/en/latest/history.html#skorch.history.DistributedHistory) which should be used when training in a multi GPU setting (#955) - `SkorchDoctor`: A helper class that assists in understanding and debugging the neural net training, see [this notebook](https://nbviewer.org/github/skorch-dev/skorch/blob/master/notebooks/Skorch_Doctor.ipynb) (#912) - When using `AccelerateMixin`, it is now possible to prevent unwrapping of the modules by setting `unwrap_after_train=True` (#963) ### Fixed - Fixed install command to work with recent changes in Google Colab (#928) - Fixed a couple of bugs related to using non-default modules and criteria (#927) - Fixed a bug when using `AccelerateMixin` in a multi-GPU setup (#947) - `_get_param_names` returns a list instead of a generator so that subsequent error messages return useful information instead of a generator `repr` string (#925) - Fixed a bug that caused modules to not be sufficiently unwrapped at the end of training when using `AccelerateMixin`, which could prevent them from being pickleable (#963)

BenjaminBossan added 2 commits March 31, 2023 14:49

Fix missing method in mock for test

3370dad

BenjaminBossan requested review from thomasjpfan and ottonemo March 31, 2023 13:18

BenjaminBossan added the bug label Mar 31, 2023

BenjaminBossan added 3 commits April 14, 2023 12:57

Also document turning off caching with multi GPU

15e2c65

Merge branch 'master' into fix-accelerate-gather

a8f86b8

Docs & script take DistributedHistory into account

d28ada6

DistributedHistory added in #955.

thomasjpfan reviewed Apr 28, 2023

View reviewed changes

thomasjpfan approved these changes Apr 28, 2023

View reviewed changes

BenjaminBossan merged commit 3c7fa60 into master Apr 28, 2023

BenjaminBossan deleted the fix-accelerate-gather branch April 28, 2023 15:43

BenjaminBossan mentioned this pull request May 8, 2023

Release 0.13.0 #961

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix accelerate bug, document, add scripts #947

Fix accelerate bug, document, add scripts #947

BenjaminBossan commented Mar 31, 2023

thomasjpfan Apr 28, 2023

BenjaminBossan Apr 28, 2023

thomasjpfan Apr 28, 2023

BenjaminBossan Apr 28, 2023

thomasjpfan left a comment

Fix accelerate bug, document, add scripts #947

Fix accelerate bug, document, add scripts #947

Conversation

BenjaminBossan commented Mar 31, 2023

thomasjpfan Apr 28, 2023

Choose a reason for hiding this comment

BenjaminBossan Apr 28, 2023

Choose a reason for hiding this comment

thomasjpfan Apr 28, 2023

Choose a reason for hiding this comment

BenjaminBossan Apr 28, 2023

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment