-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combining reads post duplex basecalling for assembly #25
Comments
I have the same question. Anyone have a suggestion? |
same here |
We have no definitive recommendation on this topic. I can provide only suggestions. The flye assembler has options for "corrected reads" and for "trusted contigs". These definitions are most apt for duplex leads. Simplex reads should be provided as raw nanopore reads to Flye. Providing the constituent simplex reads of duplex reads should be avoided on grounds of double counting. |
I asked the Nanopore support team for recommendation and here's what they said: You can either use simplex or duplex. But avoid using both to remove data redundancy. In case, if your coverage is low by using only duplex reads and you want include simplex reads as well, remove the corresponding forward and reverse reads of duplex reads from simplex and merge with the duplex reads. ONT do not have specific tool for this function. However, you can use "Seqtk" (https://github.com/lh3/seqtk ). First generate a set of single strand read ids (this will be all read ids in the simplex basecalled file minus the ids in pair_ids_filtered.txt). Then use the read ids to extract the corresponding simplex reads from simplex basecalled file (use Seqtk - Subseq function). Then merge them with the duplex base called reads. |
@surabhiranavat Thanks for the idea. I used seqkit instead and I'll cross-post what I did from the same question on ONT's community site.
###Combine all the simplex fastqs into one file. ###Use seqkit to extract all the names from the simplex run. ###Replace the reads in the simplex with the duplex reads. In that last line I'll explain what's going on. The {} block just means execute both of these commands in order and then send their output together to the next command via the '|' (pipe). The 'sed' command processes the pair_ids_filtered.txt file to put every read ID on its own line. Those read IDs are piped as input to seqkit grep which matches the IDs in the "file" (named "-") which is the input from the previous command. The -v inverts the search, so we're saying to seqkit, "Here's a bunch of read IDs from the paired_ids_filtered.txt, look through the simplex fastq file and match them to their associated sequences and spit out every read that doesn't match them. Now we've generated a file that has all the simplex reads minus the ones that exist in the duplex read set. Those are sent outside the {} block along with the next command which cats all the fastq files from 'duplex-guppy/../fastq_pass/*fastq.gz' to the next command via '|' (pipe). We gzip the reads in memory and output them to a new file, combined.fastq.gz. This method avoids writing any temporary files to the disk and it's a pretty quick hack. No need to run guppy a third time with a list of ids anymore, that's wasteful. Obviously the best solution would be to invoke guppy with a flag that would enable duplex calling followed by simplex calling for any reads that weren't called by duplex, but that would require an update to guppy. |
This may be a naïve question as I'm new to ONT sequencing and analyses. What is the best approach to combine reads post simplex and duplex basecalling for genome assembly? I have a low number of duplex reads, and reads from simplex basecalling. I also have the split reads using split_on_adapter (which I need to assess).
Thanks,
Surabhi
The text was updated successfully, but these errors were encountered: