-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError and speed loss with opt_level = O1, O2 or O3 #373
Comments
Hi @adrienchaton. the issue in your first point seems to point to a CPU operation. Could you give us some information on the shapes you are using for the operations? A code snippet to debug and profile your use case would be helpful. :) |
Hi @ptrblck As described on the pytorch discussion, there is a behavior a bit weird when calling torch.hann_window. I was already setting torch.backends.cudnn.benchmark=True in my codes before getting to know apex.amp but the specific shape considerations explained by @mcarilli I didn't know. I modified the architecture parameters to get closer to the GEMM specifications. Which means now all conv1d channels are multiples of 8, except the single input channel of the first convolution and the single output channel of the last convolution (fixed by the single channel signals I process). And also every linear input/output sizes are now multiples of 8. In the speed comparison I ran, the model's layers are all conv1d and linear, plus some non-linear activations and 1d batch-norms. About making the batch size a multiple of 8, here it gets more complicate for my use case .. The model trains on signals of variable lengths that are thus sliced in a variable number of sub-elements. I shuffle the training samples as signals, not as slices of signals, then I create the mini-batch from the slices of the considered signals of the mini-batch. This means that the minibatches are of variable size which might not be a multiple of 8. Also, for that reason I do not use the pytorch dataloader because I cannot shape my training/test sets into a single set of tensors. Thus I do not use workers and pin_memory for iterating the minibatches ... However, I ran a series of 10 epochs of the same training with different opt_levels that I call APEX_O 0/1/2/3 (3 with keep_batchnorm_fp32=True) and get the following: APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 APEX_O1 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 APEX_O2 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 APEX_O3 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 According to this and with the updated parameters, it seems still that both mixed precisions are slower to compute in average. And that the pure FP16 does not optimize stable but this was possibly expected. Here is some kind of pseudo code of what's happening during training:
A main complication is that the signals train_notes[note_id][0] are of variable length so I cannot fit them into pytorch's dataloader ... maybe there would be some tricks to do that I dont know ? Or some recommended ways to optimize handling of training elements with variable size ? thanks for your time and reading ! ps: afterwards I also need to be able to take mb_slices and mb_gen along with the variable number of slices per signal to shape back the individual intput signals from mb_slices (which were the train_notes[note_id][0]) along with their corresponding reconstructions from mb_gen. |
Additionally to that I have some more warnings/errors which could cause the amp optimization not to work properly maybe ? after the optimization is setup, I get the following warning however I run the install as recommended this install is fulfilled but with an intermediate error (that I did not spot before, I tried re-installing) |
You are missing |
@ngimel thank you for pointing this out, the first time I read it I saw only . and not ./ so I didn't get the meaning .. I tried the install command with ./ at the end, it goes till the end but also raises an error: ERROR: Command "/fast-2/adrien/virtualenv_p3/bin/python3 -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-req-build-rzct3jf4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'" and running again the code with apex.amp still give the warning I hardly understand the install error .. do you have any idea about it please ? |
Can you uninstalling first before reinstalling just to be sure?
In general it's hard to predict what exactly will be your network's bottleneck. Some potentially useful points:
|
@mcarilli Thank you for your insights ! When I pip installed apex into my virtualenv, it did not make the environment path to it, so I just tried it first with importing apex as import sys which means the pip uninstall apex command does not find apex installed but I deleted manually the directory where was the installation and tried re-cloning/installing but got the same error in the course of installation, which continues and ends but it seems restricted to the Python-only build .. (the servers I am using are setup with cuda 9.1.85 and cudnn 7.1.3) my tomorrow step was to cProfile the FP32 code to see what I could improve, probably a lot as the data handling is a bit 'off the beaten track' of the usual and efficient pytorch dataloaders .. I will do that but if you have some ideas on how to fix the apex installation, I would be very interested to test apex.amp with the C++ backend, maybe with a bit of prior code improvement it could all-in-all get much faster ! ultimately and if the C++ backend can install, then it would make sense to follow your apex profiling guide to try to gain even more speed from the mixed precision or FP16 code version ? and thanks for pointing out that torch.backends.cudnn.benchmark=True is not always the right choice .. I thought that on the long run it was always (that over 40-60h training the benchmarking is always beneficial) but with this variable minibatch size it may be not ? I am unsure of what cudnn sees as new size parameters ; the convolutions are unchanged throughout the training, same channels in/out, same kernels etc. |
Input (data) sizes do count as new size parameters from the perspective of convolutions, so I expect your case will benefit from You should not simply Apex must be imported from wherever it is installed on your system. You can check how your script attempted to import Apex by including the following lines in your script:
The print statement should show a path to one of your environment's Python library directories, NOT the path to the cloned repo. |
@mcarilli Thank you for the precisions about torch.backends.cudnn.benchmark, I will try disabling it in this case. Regarding the installation, doing an additional python setup.py install after the pip install creates apex into python3.5/site-packages as apex-0.1-py3.5.egg ; then I can import apex without pointing to the cloned repo but to site-packages. However, with this import from python environment, I still get the warning that Warning: multi_tensor_applier fused unscale kernel is unavailable .... Now I can pip uninstall apex, I tried cloning/reinstalling all, it did not fix the import with C++ backend which seems still missing as the warning says. And if I try python setup.py install --cpp_ext --cuda_ext then I get the following error: torch.version = 1.1.0 Compiling cuda extensions with Traceback (most recent call last): This may be the reason why the backend doesnt work right ? Should I reinstall pytorch first ? Or try updating cudnn ? |
And I tried to make the python setup.py install --cpp_ext --cuda_ext with commenting out It seems to allow installation only with one warning But still when initializing amp I get the warning that the C++ backend installation is not working I guess I need to build pytorch from source with the matching cuda, not let's go ! |
I am having issues with installing from source, please anyone could help ? @ptrblck @ngimel @mcarilli I already installed pytorch from source on my local OSX laptop without issues and it runs great with a thunderbolt eGPU, NVIDIA webdrivers, cuda and cudnn. Our servers are Linux and we do not use conda. Then I clone But in every case I end up to the following error:
I tried to put first export CUDAHOSTCXX='/usr/bin/gcc-5' but still the install or build give the same error .. no way than updating CUDA to 9.2 ? thanks ! |
I don't think you need to build Pytorch from source on your servers. I think it will probably be ok to use the existing Pytorch installation. When you installed after commented out these lines of the Apex setup.py, I think it's possible you simply had a conflicting Python-only install hanging around somewhere.
Running in a Docker container can also help avoid environment issues, since we are able to explicitly test the install in containers. |
sorry .. in between the posts I forgot writing the flags but it was python setup.py install with --cuda_ext --cpp_ext already I pip uninstalled apex fully, I also reinstalled pytorch with pip (after uninstalling pytorch and torchvision) then I cloned again apex, comment out lines 49-55 which is the torch cuda vs bare metal check inside the clone directory I run tomorrow afternoon we stop one server and update cuda before trying commenting out the setup codes .. the error report is quite huge now .. I will let you know how it goes tomorrow, we want to put one test server on cuda 10 and cuDNN>=7.3 and try mixed precision on it |
Before we update this machine, I tried running again the install from scratch, making sure everything about apex is pip uninstalled before, directories deleted and cloned again and lines 49-55 of setup.py commented out. running python setup.py install --cpp_ext --cuda_ext gives the following errors and do not complete installation. running pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ with the lines commented as well gives the following errors but ends the installation, however when running apex, it still gives the warning that the backend is not available after this installation. I dont know if it is of any help but in case, here are the error reports. |
Did you have a change to try installing apex in a docker container as suggested by @mcarilli? The first log file seems to point to a GCC version error, while the second log file doesn't seem to indicate any errors. |
I had a GCC error already when trying to build pytorch from source on the server (to have matching CUDA versions) which said I should either upgrade CUDA>=9.2 or use gcc-5 and not 6 (but the server doesnt have gcc-5 and the IT guys said we rather upgrade CUDA). I looked into the dockers, never used that neither opened an NGC account yet. I do not have already any docker container, so option 2 is not for me ? Or is it rather what you would recommend ? And about option 1, Dockerfile installs the latest Apex on top of an existing image. |
Image refers to the base container, which would be the The container will isolate the pytorch and apex installation, and provide a clean and fresh Ubuntu inside of it. You don't have to run the docker container inside a virtual environment. |
Thank you @ptrblck for the explanations. So if I make that docker container, then inside I will install my own drivers and also install the libraries I need (for instance the ones I pip installed inside my virtualenv) as if I had a new machine ? That could be an optimal solution for trying out apex without need to change the current servers ' main systems. I am proposing this to the IT service, we will see if that's possible, thanks ! |
You would have to install the drivers on your bare metal (not inside the container). If you follow the docker install guide from @mcarilli, your container will already come with a working PyTorch, |
You might not need to reinstall Cuda, or the bare-metal driver. It might just be a matter of picking the right container. For example, if you have cuda 9.2 installed on your servers, you can use a pytorch 1.0, cuda 9.2 devel container from https://hub.docker.com/r/pytorch/pytorch/tags.
|
Finally it's importing correctly ! When I try to run on the default settings of opt_level O1 or O2 ; very early in the course of the first epoch I have the following: File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/scaler.py", line 193, in update_scale I guess it needs some practice to be used well. I also looked for this RuntimeError in the issues. In this case, the code runs on a single GPU (titan v) and is optimized with adam. if running opt_level O0 no issues, about this I am considering something .. what about making a few initial epochs pure FP32 and once the model is better initialized, continue the training with O1 or O2 to have faster training. Basically, the initial loss gradients of my model are quite large and variable (init dependent) however after a couple of epochs it gets smoother and probably less prone to extra instabilities from the mixed precision .. Anything else I could try or check please ? |
How early in the first epoch does this occur? Immediately (on the first iteration) or after several dozen iterations? Immediately would imply some functionality bug. After several dozen iterations would imply some numerical issue. |
first iteration, at the scaled_loss.backward() Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods. Defaults for this optimization level are: then no warning, but maybe still some bugs ? what could I check please ? it does the same with the default opt_level O2 |
and if I run the same with opt_level O0 and O3(keep_batchnorm_fp32=True) it ends up as *** APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 *** APEX_O3 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 so O3 runs, makes some NaN but it's more or less expected |
For illegal memory access issue, can you please run with environment variable CUDA_LAUNCH_BLOCKING set to 1? (CUDA_LAUNCH_BLOCKING=1 python my_script.py ....)? This will probably give more informative error message. |
I re-used https://github.com/pytorch/examples/mnist this one runs on every opt_level (for 3 I just use the default and anyway the example doesnt use batchnorm) I ran time python fp16_main.py >/dev/null for each opt_level O0 ends with O1 ends with O2 ends with O3 ends with system wise it seems only getting slower but at least it runs without CUDA error following I give you the report with CUDA_LAUNCH_BLOCKING=1 ; thanks for the advices and help ! start iteration 0 |
about the numerical instabilities vs bug, I am not sure since the pytorch mnist example runs on every opt_level but here I tried to run my code with 10 epochs FP32 then initializing amp and backward on the scaled_loss gives directly the CUDA error Here it optimizes correctly (gradient descent) in FP32 *** APEX_O1 - EPOCH #9 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 184 *** APEX_O1 - EPOCH #10 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207 *** APEX_O1 - EPOCH #11 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 230 Here at the start of epoch #11 I start mixed precision and gets the error from the first iteration of epoch #11 Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods. Defaults for this optimization level are: |
it's a bit hard to understand what's wrong or not .. on the same server, slot 0 and 1 are both equipped with titan v slot 0 in the first iteration says (same virtual env in the same machine) so using only slot 0 and running more iterations per epoch, after 10 epochs, timing ends up as: *** APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 729 *** APEX_O1 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 729 tomorrow when the IT system is open again, I will ask for nvprof |
Hello,
I discovered your apex tools for integrating mixed precision and FP16 training in pytorch, which is a great idea to develop ! Our servers are mainly equipped with TITAN V cards hence I was really looking forward to trying them out at their fastest. Software versions are pytorch 1.1.0, cuda 9.1.85 and cudnn 7.1.3. As I never tried this before, I used the more straight-forward apex.amp to compare FP32 training with opt_level = O1 or O2 or O3 (the last with keep_batchnorm_fp32=True as my models use batchnorm).
It is a rather large set of codes so I report here my main questions/issues but if needed I can provide more details and can try to give some reproducible issue cases.
#1 the training script I wanted to run with amp uses torch.stft for computing spectral losses
if computing these losses, I get
RuntimeError: arange_out not supported on CPUType for Half
which point to the stft operation. Is that correct that my script should not use spectral operations such as torch.stft to be optimized in mixed precision ? Or is there a fix/workaround for that please ?
#2 I tried to run the comparison only optimizing time domain losses (eg. waveform MSE instead of spectral reconstruction) so that the code runs without error for every opt_level, but then opt_level = O1 or O2 or O3 were all slower than opt_level = O0 (or running my original FP32 training) ... obviously I expected the speed gain to depend on the code, the operations involved, the batch sizes etc. but I did not expect it to be slowed down ...
For this I only used amp.initialize and amp.scale_loss (as recommended in the 1st example of https://nvidia.github.io/apex/amp.html). I train generative models composed mainly of conv1d, batchnorm1d and linear layers. Everything is feed-forward, no softmax or classification. What could I check to understand wether I could hope for speed gain or not in my application case please ?
Good luck developing the mixed precision training, it has a lot of potential if made more integrated in existing tools !
The text was updated successfully, but these errors were encountered: