Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26

Closed
menickname opened this issue Apr 14, 2020 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@menickname
Copy link

Dear @iiSeymour

I have successfully trained a species-specific bonito model reaching assembly Q-scores up to 45 for de novo assembly (what is wonderful and what I hoped to reach with bonito!). However, when performing qcat demultiplexing of my data, significantly more reads (30% to 60%) were classified as "none" as compared to demultiplexing of guppy or standard model (dna_r.9.4.1) trained bonito basecalled reads.

As you guided me previously through training the bonito model in https://github.com/nanoporetech/bonito/issues/22, I was wondering if you have suggestions to get my newly trained model supplemented with this barcode information. It sounds like my own model has become too smart (?!) after its training.

Thank you in advance.
Best regards.
Nick Vereecke

@iiSeymour iiSeymour self-assigned this Apr 14, 2020
@iiSeymour iiSeymour added the enhancement New feature or request label Apr 14, 2020
@iiSeymour
Copy link
Member

iiSeymour commented Apr 14, 2020

Hey @menickname

That is a really great result!

I think the classification issue you are seeing is a known effect and hopefully something we can fix pretty easily. When the production models are prepared for guppy they use the --input_per_read_params to prepare_mapped_reads.py to provide a start and end location in signal space for trimming. So what you are seeing is the result of a boundary effect.

A simple fix would be to use a fixed offset in bonito/convert.py to trim the reads.

@@ -73,6 +73,7 @@ def main(args):
     off_the_end_ref = 0
     off_the_end_sig = 0
     min_run_count = 0
+    read_too_short = 0
     homopolymer_boundary = 0
 
     total_reads = num_reads(args.chunkify_file)
@@ -91,11 +92,15 @@ def main(args):
         read_idx += 1
 
         squiggle_duration = len(samples)
-        sequence_length = len(reference) - 1
+        sequence_length = len(reference) - args.offset - 1
+
+        if sequence_length < args.max_seq_len + args.offset:
+            read_too_short += 1
+            continue
 
         # first chunk
-        seq_starts = 0
-        seq_ends = np.random.randint(args.min_seq_len, args.max_seq_len)
+        seq_starts = args.offset
+        seq_ends = seq_starts + np.random.randint(args.min_seq_len, args.max_seq_len)
 
         repick = int((args.max_seq_len - args.min_seq_len) / 2)
         while boundary(reference[seq_starts:seq_ends]) and repick:
@@ -185,6 +190,7 @@ def main(args):
     print("Reason for skipping:")
     print("  - off the end (signal)          ", off_the_end_sig)
     print("  - off the end (sequence)        ", off_the_end_ref)
+    print("  - read too short (sequence)     ", read_too_short)
     print("  - homopolymer chunk boundary    ", homopolymer_boundary)
     print("  - longest run too short         ", min_run_count)
     print("  - minimum number of bases       ", min_bases)
@@ -230,6 +236,7 @@ def argparser():
     parser.add_argument("--seed", default=25, type=int)
     parser.add_argument("--chunks", default=10000000, type=int)
     parser.add_argument("--validation-chunks", default=1000, type=int)
+    parser.add_argument("--offset", default=200, type=int)
     parser.add_argument("--min-run", default=5, type=int)
     parser.add_argument("--min-seq-len", default=200, type=int)
     parser.add_argument("--max-seq-len", default=400, type=int)

Could you try creating the bonito .npy training files again with the above patch and retraining?

Best,

Chris.

@menickname
Copy link
Author

Dear @iiSeymour

Great, I am on it!

I did just run some tests to find out most optimal training parameters. Epoch number does change a lot. However, I think I can still enhance this slightly further by tweeking other parameters besides epoch. Do you think different chunk and/or validation split would be a way of going? Which one would most probably render enhanced training? What are the default training parameters ?

Thank you in advance.
Best regards,
Nick Vereecke

@iiSeymour
Copy link
Member

@menickname see the updated patch in the comment above (it had a bug in it before).

@iiSeymour
Copy link
Member

On optimal training parameters -

  • Increasing the number of epochs should lead to a better model (longer training usually leads to a better model). You can look at the training.csv file in the model-dir to see how the model is improving over time. With the latest release, you can continue training a model by reusing the same model-dir.

  • It's worth exploring how the reads are chunked, it definitely has an effect however I haven't concluded an optimal strategy myself yet.

  • Increasing the validation split would only give you increased confidence in a model, it doesn't feedback into how the model is trained.

An unreleased change that I've had really good success with is switching the activation function from ReLU to Swish.

@@ -2,14 +2,23 @@
 Bonito Model template
 """
 
+import torch
 import torch.nn as nn
 from torch.nn import ReLU, LeakyReLU
 from torch.nn import Module, ModuleList, Sequential, Conv1d, BatchNorm1d, Dropout
 
 
+class Swish(Module):
+    def __init__(self):
+        super().__init__()
+    def forward(self, x):
+        return x * torch.sigmoid(x)
+
+
 activations = {
     "relu": ReLU,
     "leaky_relu": LeakyReLU,
+    "swish": Swish,
 }

If you want to try it, apply the patch above to bonito/model.py and create a new model config i.e. quartznet5x5-swish.toml and change activation = "relu" to activation = "swish" (and don't forget to use the new config when training).

@menickname
Copy link
Author

Dear @iiSeymour

Using my dataset it seems continued training does not necessarily result in a better trained model. Somehow the model performed worse again if trained too long. For my dataset it reached an optimal training epoch, resulting in a decrease of performance again. I have now used a fixed chunking size which gave me definitely nice results.

I will look into the Swish change. I will do that after I got the results from the patched convert.py, I just started a new training with the newly created .npy files.

I will keep you posted on that.
Best regards,
Nick Vereecke

@iiSeymour
Copy link
Member

Demultiplexing performance should be on par with guppy in v0.3.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants