Operations involved while basecalling #127

akankshabaranwal · 2021-03-09T10:02:25Z

I am someone completely new to this field and I am trying to understand all the operations involved in basecalling.

Is the uniqueness of Bonito because its based on Quartznet? Apart from the model used for training what would differentiate Bonito from some other basecaller for eg Guppy/Scrappie?
Is there some preprocessing or post processing of the signals as a part of the basecaller? The third model here : https://github.com/nanoporetech/bonito/tree/master/bonito/models/configs is based on LSTMs and is very different from the Quartznet architecture. Also the number of output features for this config (v3) is 5120 whereas for the v1 and v2 configs it is 5.
I assumed that the output of 5 features(for v1 and v2) would give the probability of each base (A,T,G,C and none) and then it gets translated to the sequence. Is that correct? How does the tool interpret the 5120 features in output (of v3) to give the A,T,G,C sequence of fastq file eventually?
Is there a way I could get pretrained models for v1 and v2 configs? The one I downloaded directly is based on v3 config.

Thank you. Let me know if this is not the correct place to ask these doubts. If these questions are too naive, it would be helpful if someone could point me to resources to understand this better.

ktan8 · 2021-03-12T04:28:12Z

For Q4, I think there are some paths hardcoded into the "download" function of Bonito (https://github.com/nanoporetech/bonito/blob/master/bonito/cli/download.py).

r9_models = [
"n8c07gc9ro09zt0ivgcoeuz6krnwsnf6.zip", # dna_r9.4.1@v1
"nas0uhf46fd1lh2jndhx2a54a9vvhxp4.zip", # dna_r9.4.1@v2
"1wodp3ur4jhvqvu5leowfg6lrw54jxp2.zip", # dna_r9.4.1@v3
"uetgwsnb8yfqvuyoka8p09mxilgskqc7.zip", # dna_r9.4.1@v3.1
"47t2y48zw4waly25lmzx6sagf4bbbqqz.zip", # dna_r9.4.1@v3.2
"arqi4qwcj9btsd6bbjsnlbai0s6dg8yd.zip",
]

You might be able to download the older pretrained models by looking through the code and the paths.

akankshabaranwal · 2021-03-12T07:49:22Z

Yes. Thank you @ktan8 I was able to get the pretrained models for other configs.
It would be great if someone could help guide on the other 3 questions.

iiSeymour · 2021-03-12T09:52:18Z

Bonito is a research project that provides model training + basecalling which aims to push forwards method development. Guppy is our production inference engine and is the target runtime where successful research models end up.
There is some simple normalization done for preprocessing the signal all of which happens in the constructor of the Read object here.
For the v{1,2} models the outputs is the distribution over {N, A, C, G, T} and a sequence is constructed by either a greedy or beam seam, see fast-ctc-decode. There is some more info on v3 here.
As @ktan8 pointed out, all previous pre-trained models are available to download with bonito download --models --all.

HTH

akankshabaranwal · 2021-03-15T20:03:58Z

Thank you for pointing me to the files and older issue.
Keeping this issue open for now in case I need further help in understanding.

akankshabaranwal · 2021-07-23T15:07:31Z

Hi @ktan8 @iiSeymour thank you for answering my questions before. I have a few more questions. It would be great if you could help answer these as well:

In common notation, Bonito basecaller refers to the Quartznet based config Fast bonito, Nanopore basecalling on the Edge .
But in the configs provided (configs ont bonito) the configs v3.1 and v3 are LSTM based configs and are not derived from the Quartznet model. Is there any paper which describes from which model these configs have been derived ? Is Guppy based on these LSTM based configs?
In both cases (Quartznet v/s LSTM) for the inferencing part the read length is divided into chunks before inferencing. But when we produce the final output sequence are the inferenced outputs of these chunks operated independently or is it combined to a single length and then CTC/CRF decoding applied accordingly ?
There are parameters like chunk size, overlap which seem tunable. Are the default values (4000 chunk size and 500 overlap for the LSTM based config) the ones which give the best accuracy?

iiSeymour self-assigned this Mar 12, 2021

iiSeymour added the question Further information is requested label Mar 12, 2021

akankshabaranwal closed this as completed Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations involved while basecalling #127

Operations involved while basecalling #127

akankshabaranwal commented Mar 9, 2021

ktan8 commented Mar 12, 2021

akankshabaranwal commented Mar 12, 2021

iiSeymour commented Mar 12, 2021

akankshabaranwal commented Mar 15, 2021

akankshabaranwal commented Jul 23, 2021

Operations involved while basecalling #127

Operations involved while basecalling #127

Comments

akankshabaranwal commented Mar 9, 2021

ktan8 commented Mar 12, 2021

akankshabaranwal commented Mar 12, 2021

iiSeymour commented Mar 12, 2021

akankshabaranwal commented Mar 15, 2021

akankshabaranwal commented Jul 23, 2021