RuntimeError: CUDA error: an illegal memory access was encountered #209

thackl · 2021-12-10T11:08:15Z

I'm encountering the following problem with one of my runs. A different run finished successfully. The error reproducibly occurs for this run after 90605 reads, so there seems to be a specific issue related to the data. Any ideas?

> loading model dna_r10.3@v3.3
> outputting unaligned fastq
> calling: 90605 reads [29:06, 60.17 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 37, in compute_scores
    scale=scale, offset=offset, blank_score=blank_score
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

> calling: 90605 reads [29:20, 60.17 reads/s]

The text was updated successfully, but these errors were encountered:

Flower9618 · 2021-12-28T02:36:04Z

I'm encountering the following problem with one of my runs. A different run finished successfully. The error reproducibly occurs for this run after 90605 reads, so there seems to be a specific issue related to the data. Any ideas?

> loading model dna_r10.3@v3.3
> outputting unaligned fastq
> calling: 90605 reads [29:06, 60.17 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner
    self.run()
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/bonito/crf/basecall.py", line 37, in compute_scores
    scale=scale, offset=offset, blank_score=blank_score
  File "/nfs/bmm/thackl/software/bonito-env/lib64/python3.6/site-packages/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

> calling: 90605 reads [29:20, 60.17 reads/s]

same here!

mattloose · 2022-02-01T09:46:52Z

Hi - I'm also getting this error:

Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/multiprocessing.py", line 110, in run for item in self.iterator: File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 69, in <genexpr> (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 37, in compute_scores scale=scale, offset=offset, blank_score=blank_score File "/usr/local/lib/python3.6/dist-packages/koi_cuda111-0.0.5-py3.6-linux-x86_64.egg/koi/decode.py", line 58, in beam_search moves = moves.data.reshape(N, -1).cpu() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This happens reproducibly in processing the run suggesting that it is a specific read or file. Is there any way to log this better to work out the issue?

vellamike · 2022-02-01T09:56:02Z

@mattloose @Flower9618 @thackl Would it be possible to share data and command to reproduce this issue?

As the error suggests, could you also run with export CUDA_LAUNCH_BLOCKING=1? This will help diagnose the problem

mattloose · 2022-02-01T10:52:32Z

Hello!

Sure - the command was:

bonito basecaller dna_r9.4.1_e8_sup@v3.3 <path/to/data>/ --recursive --reference /mnt/refs/hg38.mmi --alignment-threads 16 --modified-bases 5HmC 5mC > Results_basecalls_with_mods.bam

I can also run with the suggested flag.

I can share the data but will do that through a different channel!

Cheers

Matt

vellamike · 2022-02-01T11:22:12Z

I can share the data but will do that through a different channel!

Yes please do - Mike dot Vella at Nanoporetech dot com

mattloose · 2022-02-02T08:23:51Z

Just an update - running with CUDA_LAUNCH_BLOCKING=1 doesn't appear to help much:

Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 69, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/usr/local/lib/python3.6/dist-packages/ont_bonito_cuda111-0.5.0-py3.6.egg/bonito/crf/basecall.py", line 37, in compute_scores
scale=scale, offset=offset, blank_score=blank_score
File "/usr/local/lib/python3.6/dist-packages/koi_cuda111-0.0.5-py3.6-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered

mattloose · 2022-02-02T08:24:31Z

One comment is that this is running on an HPC in a shared GPU environment although this should have a single card assigned to it.

mattloose · 2022-02-02T08:26:28Z

Does look like a read issue though - the error is happening at exactly the same point each time:

trial 1:

calling: 704906 reads [3:55:05, 101.88 reads/s]Exception in thread Thread-7:

trial 2:

calling: 704906 reads [3:58:44, 79.87 reads/s]

iiSeymour · 2022-02-04T14:49:00Z

Okay, this looked like it was a scaling issue and I think 6e91a9d should sort it.

vellamike · 2022-02-04T14:52:04Z

@mattloose let us know if the scaling change has resolved your issue, in the meantime I'm investigating a lower-level fix (which should have prevented the illegal memory access in the first place).

mattloose · 2022-02-10T11:40:48Z

Can confirm that scaling change has resolved this issue...

iiSeymour · 2022-02-11T16:42:17Z

Fixed in v0.5.1.

mattloose · 2022-02-12T17:02:24Z

I spoke too soon.

It is still crashing with this same warning message, but now on a later file:

Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/crf/basecall.py", line 67, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
sequence, qstring, moves = beam_search(
File "/usr/local/lib/python3.8/dist-packages/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm considering re-batching the files and running on smaller subsets.

vellamike · 2022-02-12T17:10:06Z

This could be a different error - could you send me the files and instructions to reproduce? (let's pick this up over email).

mattloose · 2022-02-12T17:13:16Z

Yep - will do.

iiSeymour · 2022-02-20T17:36:01Z

I had missed the same scaling fix on short read scaling path - resolved in 3187198.

teenjes · 2022-03-09T00:56:14Z

I've been having the same issue — I trained a model based on one of the pre-existing models and have been unable to successfully basecall all my data using this model - so far I've identified two separate fast5 files where I also receive the below error

Traceback (most recent call last):
File "/apps/python3/3.8.5/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/multiprocessing.py", line 110, in run
for item in self.iterator:
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 69, in
(read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
sequence, qstring, moves = beam_search(
File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_koi-0.0.7-py3.8-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm working with fungal genomes, but as with the above people the issue has been reoccuring at the same file each time.

vellamike · 2022-03-09T01:16:46Z

Hi, are you using the latest version of Bonito? We believe this problem was resolved.

…

On Wed, 9 Mar 2022 at 00:56, teenjes ***@***.***> wrote: I've been having the same issue — I trained a model based on one of the pre-existing models and have been unable to successfully basecall all my data using this model - so far I've identified two separate fast5 files where I also receive the below error Traceback (most recent call last): File "/apps/python3/3.8.5/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/multiprocessing.py", line 110, in run for item in self.iterator: File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 69, in (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores sequence, qstring, moves = beam_search( File "/g/data/xf3/te4331/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_koi-0.0.7-py3.8-linux-x86_64.egg/koi/decode.py", line 58, in beam_search moves = moves.data.reshape(N, -1).cpu() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. I'm working with fungal genomes, but as with the above people the issue has been reoccuring at the same file each time. — Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALYB7N5P4KJLN4PNBJZTYLU67ZL3ANCNFSM5JY3RDBQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were assigned.Message ID: ***@***.***>

teenjes · 2022-03-09T04:37:41Z

I have guppy 6.0.1 and have pulled the latest updates of Bonito so should have the most recent version, however this does not appear to have solved the issue.

thackl · 2022-03-18T18:40:13Z

I'm also still experiencing the issue. After upgrade from python 3.6 to 3.9 I installed release v0.5.1 (with 6e91a9d) and also tried the latest master 2bbea4c (which includes the short read fix 3187198). The error is still same.

I'd also share the data via email if that would help

thackl · 2022-03-18T21:37:17Z

OK, scratch that. Made a mistake deploying the github version. With 3187198 everything works

@teenjes might be that you've only pulled the latest released version (v0.5.1) and not the more current github version including the latest fix for this issue?

teenjes · 2022-03-21T00:05:11Z

I've pulled the latest version and even set the HEAD specifically to 3187198 but continue to run into this issue

> calling: 3112 reads [03:00, 14.02 reads/s]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/apps/python3/3.8.5/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/multiprocessing.py", line 110, in run
    for item in self.iterator:
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 69, in <genexpr>
    (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_bonito_cuda113-0.5.0-py3.8.egg/bonito/crf/basecall.py", line 35, in compute_scores
    sequence, qstring, moves = beam_search(
  File "/usr/bonito-0.5.0/venv3/lib/python3.8/site-packages/ont_koi-0.0.7-py3.8-linux-x86_64.egg/koi/decode.py", line 58, in beam_search
    moves = moves.data.reshape(N, -1).cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

vellamike · 2022-03-21T16:58:40Z

Apologies this is taking a while to resolve - we are working on it and will keep this thread updated.

vellamike · 2022-03-21T17:01:28Z

@teenjes can you send me your model and data?

iiSeymour assigned vellamike Feb 1, 2022

iiSeymour added the bug Something isn't working label Feb 1, 2022

iiSeymour closed this as completed Feb 11, 2022

iiSeymour reopened this Feb 12, 2022

iiSeymour closed this as completed Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered #209

RuntimeError: CUDA error: an illegal memory access was encountered #209

thackl commented Dec 10, 2021

Flower9618 commented Dec 28, 2021

mattloose commented Feb 1, 2022

vellamike commented Feb 1, 2022

mattloose commented Feb 1, 2022

vellamike commented Feb 1, 2022 •

edited

Loading

mattloose commented Feb 2, 2022

mattloose commented Feb 2, 2022

mattloose commented Feb 2, 2022

iiSeymour commented Feb 4, 2022

vellamike commented Feb 4, 2022 •

edited

Loading

mattloose commented Feb 10, 2022

iiSeymour commented Feb 11, 2022

mattloose commented Feb 12, 2022

vellamike commented Feb 12, 2022

mattloose commented Feb 12, 2022

iiSeymour commented Feb 20, 2022

teenjes commented Mar 9, 2022 •

edited

Loading

vellamike commented Mar 9, 2022 via email •

edited

Loading

teenjes commented Mar 9, 2022

thackl commented Mar 18, 2022

thackl commented Mar 18, 2022

teenjes commented Mar 21, 2022

vellamike commented Mar 21, 2022

vellamike commented Mar 21, 2022

RuntimeError: CUDA error: an illegal memory access was encountered #209

RuntimeError: CUDA error: an illegal memory access was encountered #209

Comments

thackl commented Dec 10, 2021

Flower9618 commented Dec 28, 2021

mattloose commented Feb 1, 2022

vellamike commented Feb 1, 2022

mattloose commented Feb 1, 2022

vellamike commented Feb 1, 2022 • edited Loading

mattloose commented Feb 2, 2022

mattloose commented Feb 2, 2022

mattloose commented Feb 2, 2022

iiSeymour commented Feb 4, 2022

vellamike commented Feb 4, 2022 • edited Loading

mattloose commented Feb 10, 2022

iiSeymour commented Feb 11, 2022

mattloose commented Feb 12, 2022

vellamike commented Feb 12, 2022

mattloose commented Feb 12, 2022

iiSeymour commented Feb 20, 2022

teenjes commented Mar 9, 2022 • edited Loading

vellamike commented Mar 9, 2022 via email • edited Loading

teenjes commented Mar 9, 2022

thackl commented Mar 18, 2022

thackl commented Mar 18, 2022

teenjes commented Mar 21, 2022

vellamike commented Mar 21, 2022

vellamike commented Mar 21, 2022

vellamike commented Feb 1, 2022 •

edited

Loading

vellamike commented Feb 4, 2022 •

edited

Loading

teenjes commented Mar 9, 2022 •

edited

Loading

vellamike commented Mar 9, 2022 via email •

edited

Loading