Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #209

Closed
thackl opened this issue Dec 10, 2021 · 24 comments
Closed

RuntimeError: CUDA error: an illegal memory access was encountered #209

thackl opened this issue Dec 10, 2021 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@thackl
Copy link

thackl commented Dec 10, 2021

I'm encountering the following problem with one of my runs. A different run finished successfully. The error reproducibly occurs for this run after 90605 reads, so there seems to be a specific issue related to the data. Any ideas?

> loading model dna_r10.3@v3.3
> outputting unaligned fastq
> calling: 90605 reads [29:06, 60.17 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 37, in compute_scores
    scale=scale, offset=offset, blank_score=blank_score
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

> calling: 90605 reads [29:20, 60.17 reads/s]
@Flower9618
Copy link

I'm encountering the following problem with one of my runs. A different run finished successfully. The error reproducibly occurs for this run after 90605 reads, so there seems to be a specific issue related to the data. Any ideas?

> loading model dna_r10.3@v3.3
> outputting unaligned fastq
> calling: 90605 reads [29:06, 60.17 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 37, in compute_scores
    scale=scale, offset=offset, blank_score=blank_score
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

> calling: 90605 reads [29:20, 60.17 reads/s]

same here!

@mattloose
Copy link

Hi - I'm also getting this error:

Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/multiprocessing.py", line 110, in run for item in self.iterator: File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 69, in <genexpr> (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 37, in compute_scores scale=scale, offset=offset, blank_score=blank_score File "/usr/local/lib/python3.6/dist-packages/koi_cuda111-0.0.5-py3.6-linux-x86_64.egg/koi/decode.py", line 58, in beam_search moves = moves.data.reshape(N, -1).cpu() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This happens reproducibly in processing the run suggesting that it is a specific read or file. Is there any way to log this better to work out the issue?

@vellamike
Copy link
Collaborator

@mattloose @Flower9618 @thackl Would it be possible to share data and command to reproduce this issue?

As the error suggests, could you also run with export CUDA_LAUNCH_BLOCKING=1? This will help diagnose the problem

@iiSeymour iiSeymour added the bug Something isn't working label Feb 1, 2022
@mattloose
Copy link

Hello!

Sure - the command was:

bonito basecaller dna_r9.4.1_e8_sup@v3.3 <path/to/data>/ --recursive --reference /mnt/refs/hg38.mmi --alignment-threads 16 --modified-bases 5HmC 5mC > Results_basecalls_with_mods.bam

I can also run with the suggested flag.

I can share the data but will do that through a different channel!

Cheers

Matt

@vellamike
Copy link
Collaborator

vellamike commented Feb 1, 2022

I can share the data but will do that through a different channel!

Yes please do - Mike dot Vella at Nanoporetech dot com

@mattloose
Copy link

Just an update - running with CUDA_LAUNCH_BLOCKING=1 doesn't appear to help much:

Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 69, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 37, in compute_scores
scale=scale, offset=offset, blank_score=blank_score
File "/usr/local/lib/python3.6/dist-packages/koi_cuda111-0.0.5-py3.6-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered

@mattloose
Copy link

One comment is that this is running on an HPC in a shared GPU environment although this should have a single card assigned to it.

@mattloose
Copy link

Does look like a read issue though - the error is happening at exactly the same point each time:

trial 1:

calling: 704906 reads [3:55:05, 101.88 reads/s]Exception in thread Thread-7:

trial 2:

calling: 704906 reads [3:58:44, 79.87 reads/s]

@iiSeymour
Copy link
Member

Okay, this looked like it was a scaling issue and I think 6e91a9d should sort it.

@vellamike
Copy link
Collaborator

vellamike commented Feb 4, 2022

@mattloose let us know if the scaling change has resolved your issue, in the meantime I'm investigating a lower-level fix (which should have prevented the illegal memory access in the first place).

@mattloose
Copy link

Can confirm that scaling change has resolved this issue...

@iiSeymour
Copy link
Member

Fixed in v0.5.1.

@mattloose
Copy link

I spoke too soon.

It is still crashing with this same warning message, but now on a later file:

Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/crf/basecall.py", line 67, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
sequence, qstring, moves = beam_search(
File "/usr/local/lib/python3.8/dist-packages/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm considering re-batching the files and running on smaller subsets.

@vellamike
Copy link
Collaborator

This could be a different error - could you send me the files and instructions to reproduce? (let's pick this up over email).

@mattloose
Copy link

Yep - will do.

@iiSeymour iiSeymour reopened this Feb 12, 2022
@iiSeymour
Copy link
Member

I had missed the same scaling fix on short read scaling path - resolved in 3187198.

@teenjes
Copy link

teenjes commented Mar 9, 2022

I've been having the same issue — I trained a model based on one of the pre-existing models and have been unable to successfully basecall all my data using this model - so far I've identified two separate fast5 files where I also receive the below error

Traceback (most recent call last):
File "/apps/python3/3.8.5/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 69, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
sequence, qstring, moves = beam_search(
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_koi-0.0.7-py3.8-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm working with fungal genomes, but as with the above people the issue has been reoccuring at the same file each time.

@vellamike
Copy link
Collaborator

vellamike commented Mar 9, 2022 via email

@teenjes
Copy link

teenjes commented Mar 9, 2022

I have guppy 6.0.1 and have pulled the latest updates of Bonito so should have the most recent version, however this does not appear to have solved the issue.

@thackl
Copy link
Author

thackl commented Mar 18, 2022

I'm also still experiencing the issue. After upgrade from python 3.6 to 3.9 I installed release v0.5.1 (with 6e91a9d) and also tried the latest master 2bbea4c (which includes the short read fix 3187198). The error is still same.

I'd also share the data via email if that would help

@thackl
Copy link
Author

thackl commented Mar 18, 2022

OK, scratch that. Made a mistake deploying the github version. With 3187198 everything works

@teenjes might be that you've only pulled the latest released version (v0.5.1) and not the more current github version including the latest fix for this issue?

@teenjes
Copy link

teenjes commented Mar 21, 2022

I've pulled the latest version and even set the HEAD specifically to 3187198 but continue to run into this issue

> calling: 3112 reads [03:00, 14.02 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/apps/python3/3.8.5/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
    sequence, qstring, moves = beam_search(
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_koi-0.0.7-py3.8-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@vellamike
Copy link
Collaborator

Apologies this is taking a while to resolve - we are working on it and will keep this thread updated.

@vellamike
Copy link
Collaborator

@teenjes can you send me your model and data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants