Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed colab dependency versions #76

Merged
merged 1 commit into from
May 25, 2020
Merged

Fixed colab dependency versions #76

merged 1 commit into from
May 25, 2020

Conversation

AlphaGit
Copy link
Contributor

@AlphaGit AlphaGit commented May 13, 2020

Fixes #44.

Detailed changes

  • Minor updates on instructions to run the colab file
  • Made the colab file install torch 1.4.0+cu100, torchvision 0.5.0+cu100 and scipy 1.1.0 (thanks @CyFeng16)
  • Made the colab file clone the code from the official repo (this can now be done since the main changes are merged in it)
  • Removed warning messages when running interpolation (thanks @CyFeng16)

## Detailed changes

- Minor updates on instructions to run the colab file
- Made the colab file install torch 1.4.0+cu100, torchvision 0.5.0+cu100 and scipy 1.1.0 (thanks @CyFeng16)
- Made the colab file clone the code from the oficial repo (this can now be done since the main changes are merged in it)
- Removed warning messages when running interpolation (thanks @CyFeng16)
@betegon
Copy link

betegon commented May 13, 2020

Hi @AlphaGit
I have used your colab with GPU Tesla P100-PCIE-16GB, 418.67, 16280 MiB Which is supposed to work with 720p, but it doesn't.

The error is obviously an out of memory error:

RuntimeError: CUDA out of memory. Tried to allocate 1.09 GiB (GPU 0; 15.90 GiB total capacity; 13.56 GiB already allocated; 465.75 MiB free; 14.74 GiB reserved in total by PyTorch)

Is there a way to make it run given that hardware? Is it possible to change any parameter of the network to perform it? Can it use cpu (even lasting for many time)?

Also, the same question applies on how to run it for 1080p.

Thanks a lot for your work and PR,

Kind regards.

@AlphaGit
Copy link
Contributor Author

Hey there @betegon!

No, unfortunately I’m not aware of a good way to make that happen. I know that some pieces of software based on DAIN (like GRisk/DainApp) just break out the image into smaller ones and perform multiple interpolations for each frame.

That works but it’s mainly circumventing the problem. But I guess you could simulate the same approach with ffmpeg commands.

Aside from that, I would look into performing some network pruning on DAIN’s stored models, but that seems like a good effort on its own. If you actually did this, not only you’d solve the memory problem, but it’d also run significantly faster.

@lbourdois
Copy link

Hi @AlphaGit

I have used your colab with GPU Tesla K80, 418.67, 11441 MiB on a video in 480p (https://www.youtube.com/watch?v=gWemAUjHo4U).

Everything works correctly up to the cell :

Interpolation

%shell mkdir -p '{FRAME_OUTPUT_DIR}'
%cd /content/DAIN

!python -W ignore colab_interpolate.py --netName DAIN_slowmotion --time_step {fps/TARGET_FPS} --start_frame 1 --end_frame {pngs_generated_count} --frame_input_dir '{FRAME_INPUT_DIR}' --frame_output_dir '{FRAME_OUTPUT_DIR}'

where I have the following error :

/content/DAIN
revise the unique id to a random numer 58004
Namespace(SAVED_MODEL=None, alpha=[0.0, 1.0], arg='./model_weights/58004-Sun-May-24-10:21/args.txt', batch_size=1, channels=3, ctx_lr_coe=1.0, datasetName='Vimeo_90K_interp', datasetPath='', dataset_split=97, debug=False, depth_lr_coe=0.001, dtype=<class 'torch.cuda.FloatTensor'>, end_frame=2703, epsilon=1e-06, factor=0.2, filter_lr_coe=1.0, filter_size=4, flow_lr_coe=0.01, force=False, frame_input_dir='/content/DAIN/input_frames', frame_output_dir='/content/DAIN/output_frames', log='./model_weights/58004-Sun-May-24-10:21/log.txt', lr=0.002, netName='DAIN_slowmotion', no_date=False, numEpoch=100, occ_lr_coe=1.0, patience=5, rectify_lr=0.001, save_path='./model_weights/58004-Sun-May-24-10:21', save_which=1, seed=1, start_frame=1, time_step=0.4166666666666667, uid=None, use_cuda=True, use_cudnn=1, weight_decay=0, workers=8)
cudnn is used
Interpolate 1 frames
error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device
Warning: Legacy autograd function with non-static forward method is deprecated and will be removed in 1.3. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) (THPFunction_do_forward at /pytorch/torch/csrc/autograd/python_function.cpp:622)
Traceback (most recent call last):
File "colab_interpolate.py", line 112, in
y_s, offset, filter = model(torch.stack((X0, X1),dim = 0))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/content/DAIN/networks/DAIN_slowmotion.py", line 148, in forward
self.forward_flownets(self.flownets, cur_offset_input, time_offsets=time_offsets),
File "/content/DAIN/networks/DAIN_slowmotion.py", line 212, in forward_flownets
temp = model(input) # this is a single direction motion results, but not a bidirectional one
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/content/DAIN/PWCNet/PWCNet.py", line 221, in forward
corr6 = self.corr(c16, c26)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/content/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 59, in forward
result = CorrelationFunction(self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)(input1, input2)
File "/content/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 27, in forward
self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)
RuntimeError: CUDA call failed (correlation_forward_cuda at correlation_cuda.cc:80)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f32e5c61193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: correlation_forward_cuda(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, int, int, int, int, int) + 0x628 (0x7f32e219eb38 in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x1bd4a (0x7f32e21aed4a in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x18890 (0x7f32e21ab890 in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: python3() [0x50a635]

frame #7: python3() [0x594931]
frame #9: THPFunction_do_forward(THPFunction
, _object
) + 0x4ac (0x7f332ec72d4c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #11: python3() [0x54a941]
frame #13: python3() [0x50a5c3]
frame #16: python3() [0x594931]
frame #19: python3() [0x507d64]
frame #21: python3() [0x594931]
frame #22: python3() [0x54a941]
frame #24: python3() [0x50a5c3]
frame #26: python3() [0x507d64]
frame #28: python3() [0x594931]
frame #31: python3() [0x507d64]
frame #33: python3() [0x594931]
frame #34: python3() [0x54a941]
frame #36: python3() [0x50a5c3]
frame #38: python3() [0x507d64]
frame #39: python3() [0x509a90]
frame #40: python3() [0x50a48d]
frame #42: python3() [0x507d64]
frame #44: python3() [0x594931]
frame #47: python3() [0x507d64]
frame #49: python3() [0x594931]
frame #50: python3() [0x54a941]
frame #52: python3() [0x50a5c3]
frame #54: python3() [0x507d64]
frame #56: python3() [0x634c82]
frame #61: __libc_start_main + 0xe7 (0x7f3339ea6b97 in /lib/x86_64-linux-gnu/libc.so.6)

Do you think this means that there are things to fix or that it comes from the GPU (in the collab, it says that tests have not been done on the K80)?

Thanks a lot for your work

@AlphaGit
Copy link
Contributor Author

Hi @lbourdois! I have been investigating a bit and I suspect that the error you mention might not be related to the changes in this PR.

Unfortunately, I cannot be entirely sure because Google Collab won't give me a K80 so I cannot verify my hypothesis. However, I'll tell you how you can do it and you can tell me if this approach worked. If it works, feel free to submit a new PR with the changes. If it didn't, I think you could open a new issue and we can investigate further.

The error seems to be related to the NVidia Drivers and the actual hardware that is running the compiled CUDA code. According to this thread, this issue happens on a K80 when the code being compiled doesn't account for that version.

NVIDIA/flownet2-pytorch#86 (comment)

If that's the case, you can find the exact version of the code for K80s in this page (it's 3.7), so you would need to modify the setup.py files in the repository to include:

    '-gencode', 'arch=compute_37,code=sm_37',

This is how I would do it in your case, so you don't have to deal with cloning the repo and modifying the Colab notebook:

  1. Run the notebook until you can reproduce the same error you got with a Tesla K80.
  2. Restart the runtime (but don't disconnect).
  3. Run the notebook, make sure the nvidia_smi command tells you you are still using the Tesla K80.
  4. Keep running until you clone the repository, but avoid running the code for compiling the CUDA Modules (you will need to break the cell into two)
  5. After the repository is cloned, navigate to the files in Google Collab, and look for these files:
    • my_package\DepthFlowProjection\setup.py
    • my_package\FilterInterpolation\setup.py
    • my_package\FlowProjection\setup.py
    • my_package\InterpolationCh\setup.py
    • my_package\MinDepthFlowProjection\setup.py
    • my_package\SeparableConv\setup.py
    • my_package\SeparableConvFlow\setup.py
    • PWCNet\correlation_package_pytorch1_0\setup.py
  6. Ensure that all of them have the following line in nvcc_args
    '-gencode', 'arch=compute_37,code=sm_37',
  7. Keep running the notebook.

If it works, that was it and you have the solution. If it doesn't, then this isn't it and we'll need further investigation.

@lbourdois
Copy link

@AlphaGit
Between my previous message and your anwser, I went back to the notebook and got a P100. The code worked perfectly ! Thanks for your work on this Collab :)

For the problem with the K80, I tried to restart the Collab for 1 hour to follow your instructions but I didn't get it back.
I'll try again next weekend. Maybe should add a comment on the Collab to say that there might be problems with the K80 and indicate your answer for more indications

@baowenbo baowenbo merged commit d69e455 into baowenbo:master May 25, 2020
@alphayome
Copy link

@AlphaGit hola, hablas español cierto? Un gusto! Y muchas gracias por los aportes.
Consulta, podrías por favor, compartirme el link del último tutorial sin errores (link ) con Google colab para Dain. Sucede que he visto varios (algunos en la portada, otros aquí, etc) y ya no sé cual es el último. He realizado unos pasos de Dain pero solo me exporta 30 fps, a pesar que lo pongo a 60.

Muchas gracias por tu ayuda, un abrazo.

hello, you speak spanish right?
Could you please share the link of the last tutorial with Google collaborated for Dain. It happens that I have seen several and I do not know which is the last. I have done a few steps of Dain but it only exports 30 fps, although I set it to 60. Thank you very much for your help, a hug.

@AlphaGit
Copy link
Contributor Author

AlphaGit commented May 31, 2020

@alphayome Hola! Sí, hablo español. Un placer poder ayudar. :)

En lugar de utilizar un link te recomiendo bajar el archivo .ipynb y subirlo a Google Colab. El problema con los links es que se desactualizan muy fácil cuando hay más de una persona trabajando en el archivo, dado que no trabajamos sobre una misma copia.

Esto que dices me hace pensar que deberíamos poner algún tipo de versión en ese mismo archivo. Probablemente lo haga en adición a algún cambio extra.

@alphayome
Copy link

@AlphaGit gracias por la respuesta.
claro, me referia al acrhivo .ipynb. Lo que pasa es que visto varias versiones. Donde puedo encontrar la última version?

@AlphaGit
Copy link
Contributor Author

@alphayome En la rama master de este repositorio, esa debería siempre ser la versión "autoritativa".

https://github.com/baowenbo/DAIN/blob/master/Colab_DAIN.ipynb

@AlphaGit AlphaGit deleted the fixed-colab-dependency-versions branch May 31, 2020 23:01
@AlphaGit
Copy link
Contributor Author

@AlphaGit La respuesta corta es que no se sabe, jeje -- no hay una forma clara de determinar qué te va a entregar Google.

No tengo problema en darte una mano con lo que necesites, pero preferiría que no sea en este repositorio a menos que sea un problema con el código. Eso es para evitar generar ruido a los dueños originales. Tengo miedo que muchas actualizaciones los fuercen a no prestar atención aquí, lo cual sería desafortunado para todos.

Podés contactarme en privado a alphagma@gmail.com -- con gusto te doy una mano en el proyecto que estés trabajando. ¡Saludos!

@alphayome
Copy link

entiendo @AlphaGit te escribo a gmail.
Muchas gracias por intención de ayudar, y disculpen los demás.

Referente a este hilo, realicé la corrida de CUDA y

Se truncaron las últimas líneas 5000 del resultado de transmisión.

@alphayome
Copy link

@AlphaGit estaba viendo la solución que le indicaste a iborduos sobre modificar en el setup en

my_package\DepthFlowProjection\setup.py
my_package\FilterInterpolation\setup.py
my_package\FlowProjection\setup.py
my_package\InterpolationCh\setup.py
my_package\MinDepthFlowProjection\setup.py
my_package\SeparableConv\setup.py
my_package\SeparableConvFlow\setup.py
PWCNet\correlation_package_pytorch1_0\setup.py
Asegúrese de que todos tengan la siguiente línea en nvcc_args

tengo que cambiar todas las lineas para que sean iguales? como adjunto acá abajo?

`#!/usr/bin/env python3
import os
import torch

from setuptools import setup, find_packages
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

cxx_args = ['-std=c++11']

nvcc_args = [
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_37,code=sm_37'
# '-gencode', 'arch=compute_37,code=sm_37',
# '-gencode', 'arch=compute_37,code=compute_37'
]

setup(
name='depthflowprojection_cuda',
ext_modules=[
CUDAExtension('depthflowprojection_cuda', [
'depthflowprojection_cuda.cc',
'depthflowprojection_cuda_kernel.cu'
], extra_compile_args={'cxx': cxx_args, 'nvcc': nvcc_args})
],
cmdclass={
'build_ext': BuildExtension
})
`

gracias por tu ayuda. (Respondí acá porque tiene que ver con el hilo)
Slds

@AlphaGit
Copy link
Contributor Author

AlphaGit commented Jun 1, 2020

Do I need to change all the lines so that they are equal?

No, you just need to add them. Make sure you use the right version of the compute_xx and sm_xx based on the model that you got in Collab.

@AlphaGit
Copy link
Contributor Author

AlphaGit commented Jun 3, 2020

@lbourdois Hey there! I actually got a Tesla T4 and was able to test our hypothesis. Yes, it works! I will soon be sending a patch to address the missing Colab GPU model kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Colab CUDA error
5 participants