Combining reads post duplex basecalling for assembly #25

surabhiranavat · 2022-11-04T16:31:06Z

This may be a naïve question as I'm new to ONT sequencing and analyses. What is the best approach to combine reads post simplex and duplex basecalling for genome assembly? I have a low number of duplex reads, and reads from simplex basecalling. I also have the split reads using split_on_adapter (which I need to assess).

Thanks,
Surabhi

dithiii · 2022-11-12T18:56:36Z

I have the same question. Anyone have a suggestion?

CharlotteAnne · 2022-11-14T02:45:40Z

same here

cjw85 · 2022-11-14T07:39:25Z

We have no definitive recommendation on this topic. I can provide only suggestions.

The flye assembler has options for "corrected reads" and for "trusted contigs". These definitions are most apt for duplex leads. Simplex reads should be provided as raw nanopore reads to Flye. Providing the constituent simplex reads of duplex reads should be avoided on grounds of double counting.

surabhiranavat · 2022-11-14T09:48:16Z

I asked the Nanopore support team for recommendation and here's what they said: You can either use simplex or duplex. But avoid using both to remove data redundancy. In case, if your coverage is low by using only duplex reads and you want include simplex reads as well, remove the corresponding forward and reverse reads of duplex reads from simplex and merge with the duplex reads. ONT do not have specific tool for this function. However, you can use "Seqtk" (https://github.com/lh3/seqtk ). First generate a set of single strand read ids (this will be all read ids in the simplex basecalled file minus the ids in pair_ids_filtered.txt). Then use the read ids to extract the corresponding simplex reads from simplex basecalled file (use Seqtk - Subseq function). Then merge them with the duplex base called reads.

dithiii · 2022-11-15T05:08:06Z

@surabhiranavat Thanks for the idea. I used seqkit instead and I'll cross-post what I did from the same question on ONT's community site.

Run guppy to call simplex
Run guppy again to call duplex (this really should be an option so you only have to run guppy once).
Now you have two output directories, simplex/.../fastq_pass and duplex/.../fastq_pass. Next run duplex_tools from the nanopore github page as described in the duplex tutorial to generate 'pair_ids_filtered.txt'. Then do the following:

###Combine all the simplex fastqs into one file.
cat simplex-guppy/*.fastq.gz > simplex.fastq.gz

###Use seqkit to extract all the names from the simplex run.
seqkit seq --name simplex_sup.fastq.gz > simplex_ids_txt

###Replace the reads in the simplex with the duplex reads.
{ sed 's/ /\n/' pair_ids_filtered.txt | seqkit grep -v -f - simplex_sup.fastq.gz ; zcat duplex-guppy/../fastq_pass/*.fastq.gz ; } | gzip - > combined.fastq.gz

In that last line I'll explain what's going on. The {} block just means execute both of these commands in order and then send their output together to the next command via the '|' (pipe). The 'sed' command processes the pair_ids_filtered.txt file to put every read ID on its own line. Those read IDs are piped as input to seqkit grep which matches the IDs in the "file" (named "-") which is the input from the previous command. The -v inverts the search, so we're saying to seqkit, "Here's a bunch of read IDs from the paired_ids_filtered.txt, look through the simplex fastq file and match them to their associated sequences and spit out every read that doesn't match them. Now we've generated a file that has all the simplex reads minus the ones that exist in the duplex read set. Those are sent outside the {} block along with the next command which cats all the fastq files from 'duplex-guppy/../fastq_pass/*fastq.gz' to the next command via '|' (pipe). We gzip the reads in memory and output them to a new file, combined.fastq.gz. This method avoids writing any temporary files to the disk and it's a pretty quick hack. No need to run guppy a third time with a list of ids anymore, that's wasteful.

Obviously the best solution would be to invoke guppy with a flag that would enable duplex calling followed by simplex calling for any reads that weren't called by duplex, but that would require an update to guppy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining reads post duplex basecalling for assembly #25

Combining reads post duplex basecalling for assembly #25

surabhiranavat commented Nov 4, 2022

dithiii commented Nov 12, 2022

CharlotteAnne commented Nov 14, 2022

cjw85 commented Nov 14, 2022

surabhiranavat commented Nov 14, 2022

dithiii commented Nov 15, 2022 •

edited

Loading

Combining reads post duplex basecalling for assembly #25

Combining reads post duplex basecalling for assembly #25

Comments

surabhiranavat commented Nov 4, 2022

dithiii commented Nov 12, 2022

CharlotteAnne commented Nov 14, 2022

cjw85 commented Nov 14, 2022

surabhiranavat commented Nov 14, 2022

dithiii commented Nov 15, 2022 • edited Loading

dithiii commented Nov 15, 2022 •

edited

Loading