Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

menickname · 2020-04-14T07:08:57Z

I have successfully trained a species-specific bonito model reaching assembly Q-scores up to 45 for de novo assembly (what is wonderful and what I hoped to reach with bonito!). However, when performing qcat demultiplexing of my data, significantly more reads (30% to 60%) were classified as "none" as compared to demultiplexing of guppy or standard model (dna_r.9.4.1) trained bonito basecalled reads.

As you guided me previously through training the bonito model in https://github.com/nanoporetech/bonito/issues/22, I was wondering if you have suggestions to get my newly trained model supplemented with this barcode information. It sounds like my own model has become too smart (?!) after its training.

Thank you in advance.
Best regards.
Nick Vereecke

iiSeymour · 2020-04-14T12:02:02Z

Hey @menickname

That is a really great result!

I think the classification issue you are seeing is a known effect and hopefully something we can fix pretty easily. When the production models are prepared for guppy they use the --input_per_read_params to prepare_mapped_reads.py to provide a start and end location in signal space for trimming. So what you are seeing is the result of a boundary effect.

A simple fix would be to use a fixed offset in bonito/convert.py to trim the reads.

@@ -73,6 +73,7 @@ def main(args):
     off_the_end_ref = 0
     off_the_end_sig = 0
     min_run_count = 0
+    read_too_short = 0
     homopolymer_boundary = 0
 
     total_reads = num_reads(args.chunkify_file)
@@ -91,11 +92,15 @@ def main(args):
         read_idx += 1
 
         squiggle_duration = len(samples)
-        sequence_length = len(reference) - 1
+        sequence_length = len(reference) - args.offset - 1
+
+        if sequence_length < args.max_seq_len + args.offset:
+            read_too_short += 1
+            continue
 
         # first chunk
-        seq_starts = 0
-        seq_ends = np.random.randint(args.min_seq_len, args.max_seq_len)
+        seq_starts = args.offset
+        seq_ends = seq_starts + np.random.randint(args.min_seq_len, args.max_seq_len)
 
         repick = int((args.max_seq_len - args.min_seq_len) / 2)
         while boundary(reference[seq_starts:seq_ends]) and repick:
@@ -185,6 +190,7 @@ def main(args):
     print("Reason for skipping:")
     print("  - off the end (signal)          ", off_the_end_sig)
     print("  - off the end (sequence)        ", off_the_end_ref)
+    print("  - read too short (sequence)     ", read_too_short)
     print("  - homopolymer chunk boundary    ", homopolymer_boundary)
     print("  - longest run too short         ", min_run_count)
     print("  - minimum number of bases       ", min_bases)
@@ -230,6 +236,7 @@ def argparser():
     parser.add_argument("--seed", default=25, type=int)
     parser.add_argument("--chunks", default=10000000, type=int)
     parser.add_argument("--validation-chunks", default=1000, type=int)
+    parser.add_argument("--offset", default=200, type=int)
     parser.add_argument("--min-run", default=5, type=int)
     parser.add_argument("--min-seq-len", default=200, type=int)
     parser.add_argument("--max-seq-len", default=400, type=int)

Could you try creating the bonito .npy training files again with the above patch and retraining?

Best,

Chris.

menickname · 2020-04-14T12:09:45Z

Dear @iiSeymour

Great, I am on it!

I did just run some tests to find out most optimal training parameters. Epoch number does change a lot. However, I think I can still enhance this slightly further by tweeking other parameters besides epoch. Do you think different chunk and/or validation split would be a way of going? Which one would most probably render enhanced training? What are the default training parameters ?

Thank you in advance.
Best regards,
Nick Vereecke

iiSeymour · 2020-04-14T12:14:34Z

@menickname see the updated patch in the comment above (it had a bug in it before).

iiSeymour · 2020-04-14T12:37:00Z

On optimal training parameters -

Increasing the number of epochs should lead to a better model (longer training usually leads to a better model). You can look at the training.csv file in the model-dir to see how the model is improving over time. With the latest release, you can continue training a model by reusing the same model-dir.
It's worth exploring how the reads are chunked, it definitely has an effect however I haven't concluded an optimal strategy myself yet.
Increasing the validation split would only give you increased confidence in a model, it doesn't feedback into how the model is trained.

An unreleased change that I've had really good success with is switching the activation function from ReLU to Swish.

@@ -2,14 +2,23 @@
 Bonito Model template
 """
 
+import torch
 import torch.nn as nn
 from torch.nn import ReLU, LeakyReLU
 from torch.nn import Module, ModuleList, Sequential, Conv1d, BatchNorm1d, Dropout
 
 
+class Swish(Module):
+    def __init__(self):
+        super().__init__()
+    def forward(self, x):
+        return x * torch.sigmoid(x)
+
+
 activations = {
     "relu": ReLU,
     "leaky_relu": LeakyReLU,
+    "swish": Swish,
 }

If you want to try it, apply the patch above to bonito/model.py and create a new model config i.e. quartznet5x5-swish.toml and change activation = "relu" to activation = "swish" (and don't forget to use the new config when training).

menickname · 2020-04-14T12:46:47Z

Dear @iiSeymour

Using my dataset it seems continued training does not necessarily result in a better trained model. Somehow the model performed worse again if trained too long. For my dataset it reached an optimal training epoch, resulting in a decrease of performance again. I have now used a fixed chunking size which gave me definitely nice results.

I will look into the Swish change. I will do that after I got the results from the patched convert.py, I just started a new training with the newly created .npy files.

I will keep you posted on that.
Best regards,
Nick Vereecke

iiSeymour · 2021-02-24T10:59:39Z

Demultiplexing performance should be on par with guppy in v0.3.6.

iiSeymour self-assigned this Apr 14, 2020

iiSeymour added the enhancement New feature or request label Apr 14, 2020

rajwanir mentioned this issue Dec 5, 2020

More unclassified reads in demultiplexing. #81

Closed

iiSeymour closed this as completed Feb 24, 2021

menickname mentioned this issue Aug 30, 2021

Lowered barcode recognition of bonito basecalled data with Bonito 0.4.0 #176

Open

mbhall88 mentioned this issue May 7, 2022

Trimming CTC data #253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

menickname commented Apr 14, 2020

iiSeymour commented Apr 14, 2020 •

edited

Loading

menickname commented Apr 14, 2020

iiSeymour commented Apr 14, 2020

iiSeymour commented Apr 14, 2020

menickname commented Apr 14, 2020

iiSeymour commented Feb 24, 2021

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

Comments

menickname commented Apr 14, 2020

iiSeymour commented Apr 14, 2020 • edited Loading

menickname commented Apr 14, 2020

iiSeymour commented Apr 14, 2020

iiSeymour commented Apr 14, 2020

menickname commented Apr 14, 2020

iiSeymour commented Feb 24, 2021

iiSeymour commented Apr 14, 2020 •

edited

Loading