Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining reads post duplex basecalling for assembly #25

Open
surabhiranavat opened this issue Nov 4, 2022 · 5 comments
Open

Combining reads post duplex basecalling for assembly #25

surabhiranavat opened this issue Nov 4, 2022 · 5 comments

Comments

@surabhiranavat
Copy link

This may be a naïve question as I'm new to ONT sequencing and analyses. What is the best approach to combine reads post simplex and duplex basecalling for genome assembly? I have a low number of duplex reads, and reads from simplex basecalling. I also have the split reads using split_on_adapter (which I need to assess).

Thanks,
Surabhi

@dithiii
Copy link

dithiii commented Nov 12, 2022

I have the same question. Anyone have a suggestion?

@CharlotteAnne
Copy link

same here

@cjw85
Copy link
Member

cjw85 commented Nov 14, 2022

We have no definitive recommendation on this topic. I can provide only suggestions.

The flye assembler has options for "corrected reads" and for "trusted contigs". These definitions are most apt for duplex leads. Simplex reads should be provided as raw nanopore reads to Flye. Providing the constituent simplex reads of duplex reads should be avoided on grounds of double counting.

@surabhiranavat
Copy link
Author

I asked the Nanopore support team for recommendation and here's what they said: You can either use simplex or duplex. But avoid using both to remove data redundancy. In case, if your coverage is low by using only duplex reads and you want include simplex reads as well, remove the corresponding forward and reverse reads of duplex reads from simplex and merge with the duplex reads. ONT do not have specific tool for this function. However, you can use "Seqtk" (https://github.com/lh3/seqtk ). First generate a set of single strand read ids (this will be all read ids in the simplex basecalled file minus the ids in pair_ids_filtered.txt). Then use the read ids to extract the corresponding simplex reads from simplex basecalled file (use Seqtk - Subseq function). Then merge them with the duplex base called reads.

@dithiii
Copy link

dithiii commented Nov 15, 2022

@surabhiranavat Thanks for the idea. I used seqkit instead and I'll cross-post what I did from the same question on ONT's community site.

  1. Run guppy to call simplex
  2. Run guppy again to call duplex (this really should be an option so you only have to run guppy once).
    Now you have two output directories, simplex/.../fastq_pass and duplex/.../fastq_pass. Next run duplex_tools from the nanopore github page as described in the duplex tutorial to generate 'pair_ids_filtered.txt'. Then do the following:

###Combine all the simplex fastqs into one file.
cat simplex-guppy/*.fastq.gz > simplex.fastq.gz

###Use seqkit to extract all the names from the simplex run.
seqkit seq --name simplex_sup.fastq.gz > simplex_ids_txt

###Replace the reads in the simplex with the duplex reads.
{ sed 's/ /\n/' pair_ids_filtered.txt | seqkit grep -v -f - simplex_sup.fastq.gz ; zcat duplex-guppy/../fastq_pass/*.fastq.gz ; } | gzip - > combined.fastq.gz

In that last line I'll explain what's going on. The {} block just means execute both of these commands in order and then send their output together to the next command via the '|' (pipe). The 'sed' command processes the pair_ids_filtered.txt file to put every read ID on its own line. Those read IDs are piped as input to seqkit grep which matches the IDs in the "file" (named "-") which is the input from the previous command. The -v inverts the search, so we're saying to seqkit, "Here's a bunch of read IDs from the paired_ids_filtered.txt, look through the simplex fastq file and match them to their associated sequences and spit out every read that doesn't match them. Now we've generated a file that has all the simplex reads minus the ones that exist in the duplex read set. Those are sent outside the {} block along with the next command which cats all the fastq files from 'duplex-guppy/../fastq_pass/*fastq.gz' to the next command via '|' (pipe). We gzip the reads in memory and output them to a new file, combined.fastq.gz. This method avoids writing any temporary files to the disk and it's a pretty quick hack. No need to run guppy a third time with a list of ids anymore, that's wasteful.

Obviously the best solution would be to invoke guppy with a flag that would enable duplex calling followed by simplex calling for any reads that weren't called by duplex, but that would require an update to guppy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants