Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations involved while basecalling #127

Closed
akankshabaranwal opened this issue Mar 9, 2021 · 5 comments
Closed

Operations involved while basecalling #127

akankshabaranwal opened this issue Mar 9, 2021 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@akankshabaranwal
Copy link

I am someone completely new to this field and I am trying to understand all the operations involved in basecalling.

  1. Is the uniqueness of Bonito because its based on Quartznet? Apart from the model used for training what would differentiate Bonito from some other basecaller for eg Guppy/Scrappie?
  2. Is there some preprocessing or post processing of the signals as a part of the basecaller? The third model here : https://github.com/nanoporetech/bonito/tree/master/bonito/models/configs is based on LSTMs and is very different from the Quartznet architecture. Also the number of output features for this config (v3) is 5120 whereas for the v1 and v2 configs it is 5.
  3. I assumed that the output of 5 features(for v1 and v2) would give the probability of each base (A,T,G,C and none) and then it gets translated to the sequence. Is that correct? How does the tool interpret the 5120 features in output (of v3) to give the A,T,G,C sequence of fastq file eventually?
  4. Is there a way I could get pretrained models for v1 and v2 configs? The one I downloaded directly is based on v3 config.

Thank you. Let me know if this is not the correct place to ask these doubts. If these questions are too naive, it would be helpful if someone could point me to resources to understand this better.

@ktan8
Copy link

ktan8 commented Mar 12, 2021

For Q4, I think there are some paths hardcoded into the "download" function of Bonito (https://github.com/nanoporetech/bonito/blob/master/bonito/cli/download.py).

r9_models = [
"n8c07gc9ro09zt0ivgcoeuz6krnwsnf6.zip", # dna_r9.4.1@v1
"nas0uhf46fd1lh2jndhx2a54a9vvhxp4.zip", # dna_r9.4.1@v2
"1wodp3ur4jhvqvu5leowfg6lrw54jxp2.zip", # dna_r9.4.1@v3
"uetgwsnb8yfqvuyoka8p09mxilgskqc7.zip", # dna_r9.4.1@v3.1
"47t2y48zw4waly25lmzx6sagf4bbbqqz.zip", # dna_r9.4.1@v3.2
"arqi4qwcj9btsd6bbjsnlbai0s6dg8yd.zip",
]

You might be able to download the older pretrained models by looking through the code and the paths.

@akankshabaranwal
Copy link
Author

Yes. Thank you @ktan8 I was able to get the pretrained models for other configs.
It would be great if someone could help guide on the other 3 questions.

@iiSeymour iiSeymour self-assigned this Mar 12, 2021
@iiSeymour iiSeymour added the question Further information is requested label Mar 12, 2021
@iiSeymour
Copy link
Member

  1. Bonito is a research project that provides model training + basecalling which aims to push forwards method development. Guppy is our production inference engine and is the target runtime where successful research models end up.

  2. There is some simple normalization done for preprocessing the signal all of which happens in the constructor of the Read object here.

  3. For the v{1,2} models the outputs is the distribution over {N, A, C, G, T} and a sequence is constructed by either a greedy or beam seam, see fast-ctc-decode. There is some more info on v3 here.

  4. As @ktan8 pointed out, all previous pre-trained models are available to download with bonito download --models --all.

HTH

@akankshabaranwal
Copy link
Author

Thank you for pointing me to the files and older issue.
Keeping this issue open for now in case I need further help in understanding.

@akankshabaranwal
Copy link
Author

Hi @ktan8 @iiSeymour thank you for answering my questions before. I have a few more questions. It would be great if you could help answer these as well:

  1. In common notation, Bonito basecaller refers to the Quartznet based config Fast bonito, Nanopore basecalling on the Edge .
    But in the configs provided (configs ont bonito) the configs v3.1 and v3 are LSTM based configs and are not derived from the Quartznet model. Is there any paper which describes from which model these configs have been derived ? Is Guppy based on these LSTM based configs?
  2. In both cases (Quartznet v/s LSTM) for the inferencing part the read length is divided into chunks before inferencing. But when we produce the final output sequence are the inferenced outputs of these chunks operated independently or is it combined to a single length and then CTC/CRF decoding applied accordingly ?
  3. There are parameters like chunk size, overlap which seem tunable. Are the default values (4000 chunk size and 500 overlap for the LSTM based config) the ones which give the best accuracy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants