Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error on second epoch #28

Open
nbardy opened this issue Nov 14, 2023 · 1 comment
Open

CUDA Error on second epoch #28

nbardy opened this issue Nov 14, 2023 · 1 comment

Comments

@nbardy
Copy link

nbardy commented Nov 14, 2023

Seeing an unknown CUDA error on the second epoch. Will try to debug more tomorrow.

Traceback (most recent call last):
  File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in <module>
    trainer.train(pipe, Aesthetics())
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train
    if self.config.train.total_samples is not None:
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in <listcomp>
    if self.config.train.total_samples is not None:
  File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess
    images = images.detach().cpu().permute(0,2,3,1).numpy()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in
trainer.train(pipe, Aesthetics())
File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train
if self.config.train.total_samples is not None:
File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in
if self.config.train.total_samples is not None:
File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess
images = images.detach().cpu().permute(0,2,3,1).numpy()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@nbardy
Copy link
Author

nbardy commented Nov 14, 2023

(base) paperspace@psy0glj6t:~$ nvidia-smi
Unable to determine the device handle for GPU0000:00:05.0: Unknown Error

Also seems to have borked the GPU enough to need a restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant