-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model #26
Comments
Hey @menickname That is a really great result! I think the classification issue you are seeing is a known effect and hopefully something we can fix pretty easily. When the production models are prepared for guppy they use the A simple fix would be to use a fixed offset in @@ -73,6 +73,7 @@ def main(args):
off_the_end_ref = 0
off_the_end_sig = 0
min_run_count = 0
+ read_too_short = 0
homopolymer_boundary = 0
total_reads = num_reads(args.chunkify_file)
@@ -91,11 +92,15 @@ def main(args):
read_idx += 1
squiggle_duration = len(samples)
- sequence_length = len(reference) - 1
+ sequence_length = len(reference) - args.offset - 1
+
+ if sequence_length < args.max_seq_len + args.offset:
+ read_too_short += 1
+ continue
# first chunk
- seq_starts = 0
- seq_ends = np.random.randint(args.min_seq_len, args.max_seq_len)
+ seq_starts = args.offset
+ seq_ends = seq_starts + np.random.randint(args.min_seq_len, args.max_seq_len)
repick = int((args.max_seq_len - args.min_seq_len) / 2)
while boundary(reference[seq_starts:seq_ends]) and repick:
@@ -185,6 +190,7 @@ def main(args):
print("Reason for skipping:")
print(" - off the end (signal) ", off_the_end_sig)
print(" - off the end (sequence) ", off_the_end_ref)
+ print(" - read too short (sequence) ", read_too_short)
print(" - homopolymer chunk boundary ", homopolymer_boundary)
print(" - longest run too short ", min_run_count)
print(" - minimum number of bases ", min_bases)
@@ -230,6 +236,7 @@ def argparser():
parser.add_argument("--seed", default=25, type=int)
parser.add_argument("--chunks", default=10000000, type=int)
parser.add_argument("--validation-chunks", default=1000, type=int)
+ parser.add_argument("--offset", default=200, type=int)
parser.add_argument("--min-run", default=5, type=int)
parser.add_argument("--min-seq-len", default=200, type=int)
parser.add_argument("--max-seq-len", default=400, type=int) Could you try creating the bonito Best, Chris. |
Dear @iiSeymour Great, I am on it! I did just run some tests to find out most optimal training parameters. Epoch number does change a lot. However, I think I can still enhance this slightly further by tweeking other parameters besides epoch. Do you think different chunk and/or validation split would be a way of going? Which one would most probably render enhanced training? What are the default training parameters ? Thank you in advance. |
@menickname see the updated patch in the comment above (it had a bug in it before). |
On optimal training parameters -
An unreleased change that I've had really good success with is switching the activation function from @@ -2,14 +2,23 @@
Bonito Model template
"""
+import torch
import torch.nn as nn
from torch.nn import ReLU, LeakyReLU
from torch.nn import Module, ModuleList, Sequential, Conv1d, BatchNorm1d, Dropout
+class Swish(Module):
+ def __init__(self):
+ super().__init__()
+ def forward(self, x):
+ return x * torch.sigmoid(x)
+
+
activations = {
"relu": ReLU,
"leaky_relu": LeakyReLU,
+ "swish": Swish,
} If you want to try it, apply the patch above to |
Dear @iiSeymour Using my dataset it seems continued training does not necessarily result in a better trained model. Somehow the model performed worse again if trained too long. For my dataset it reached an optimal training epoch, resulting in a decrease of performance again. I have now used a fixed chunking size which gave me definitely nice results. I will look into the Swish change. I will do that after I got the results from the patched convert.py, I just started a new training with the newly created .npy files. I will keep you posted on that. |
Demultiplexing performance should be on par with guppy in v0.3.6. |
Dear @iiSeymour
I have successfully trained a species-specific bonito model reaching assembly Q-scores up to 45 for de novo assembly (what is wonderful and what I hoped to reach with bonito!). However, when performing qcat demultiplexing of my data, significantly more reads (30% to 60%) were classified as "none" as compared to demultiplexing of guppy or standard model (dna_r.9.4.1) trained bonito basecalled reads.
As you guided me previously through training the bonito model in https://github.com/nanoporetech/bonito/issues/22, I was wondering if you have suggestions to get my newly trained model supplemented with this barcode information. It sounds like my own model has become too smart (?!) after its training.
Thank you in advance.
Best regards.
Nick Vereecke
The text was updated successfully, but these errors were encountered: