Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus sequences #22

Open
sihanbu opened this issue Oct 18, 2024 · 1 comment
Open

Consensus sequences #22

sihanbu opened this issue Oct 18, 2024 · 1 comment
Labels
question Further information is requested

Comments

@sihanbu
Copy link

sihanbu commented Oct 18, 2024

Hello,

I'm new to whole genome assembly using ONT reads. I have two questions.

Suppose the consensus sequences identified during the inference are not adapters but are important to the project (e.g., specific sequences I'm studying). In this case, how would I ensure they are not mistakenly inferred and trimmed?

I’m curious about why unknown adapters (consensus sequences) appear during ONT sequencing. Since the adapter sequences used are already known, shouldn't we be able to trim them off directly? Where do these unknown adapters originate from?

Thank you for your assistance!

Best,
Sihan

@qbonenfant qbonenfant added the question Further information is requested label Oct 18, 2024
@qbonenfant
Copy link
Collaborator

Hi
About your first question, since it can be quite hard to filter common patterns from adapters, we added an option to exclude a list of k-mers from the counting phase. This is not perfect, but will work fine if you need to prevent trimming of a specific sequence. Look for the "forbid_kmer" option of the configuration file.

Now, why would "unknown" adapters appear during ONT sequencing?
The answer is quite simple: Oxford Nanopore Technology do not publicly disclose the adapter sequences, or at least not outside of the ONT community from what i have seen.

The only known database for ONT adapters when we published our paper was the original Porechop database (adapters.py) curated by Ryan Wick and other members. This database is no longer maintained since 2018, so any new adapter is basically unknown.

It seems ONT is doing this on purpose, since recent ONT basecallers (guppy, dorado, and others) are supposed to trim the reads during the basecalling. Being based on neural network, those tools trimming step is basically a black box for us. It makes them pretty difficult to trust, and their effectiveness is hard to evaluate without the adapter sequences to compare.

Our study revealed (at least for guppy) that residual (known) adapter sequence can be found in public dataset processed by ONT basecallers. This is why tools such as Porechop_ABI are needed to clean datasets, or at least for quality control.

Disclaimer
I have been out of bioinformatics ressearch for 2 years now, and even if I keep reading papers from time to time, you should take my statements with a grain of salt. ONT may have changed it's policy recently (and i may be unaware of this), or maybe their basecallers are perfect now ? Who knows? Not me for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants