Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low number of good pairs after filtering #52

Open
myxotheles opened this issue Sep 11, 2023 · 1 comment
Open

Low number of good pairs after filtering #52

myxotheles opened this issue Sep 11, 2023 · 1 comment

Comments

@myxotheles
Copy link

myxotheles commented Sep 11, 2023

Hi,

I am trying to get duplex sequencing up but I find I get a very low number of 'good pairs' after filtering and consenquently, a very low number of called duplex reads. For example:

Total Reads 28801102
Read pairs (n) 10901142
Paired (%) 75
Good pairs 1151139
Good pairs (%) 4
BAM Duplex reads 1002726
Percentage of original reads (%) 3.48
Mapped 94%

So in this example, I am left with only 4% of the original reads.

I am using the basic usage as recommended:

duplex_tools pairs_from_summary $output_dir/sequencing_summary.txt $output_dir

duplex_tools filter_pairs $output_dir/pair_ids.txt $output_dir

nanopore_guppy guppy_basecaller_duplex \
        --input_path $input_dir \
        -r --save_path $duplex_dir \
        --device auto \
        --config $model \
        --duplex_pairing_mode from_pair_list \
        --duplex_pairing_file $output_dir/pair_ids_filtered.txt \
        --align_ref $ref \
        --bam_out

Questions:

Why do I get so few good pairs and subsequently good reads? 4% is a bit useless.
Should I skip the filtering step and run the second guppy run with the pair_ids.text instead?

Lastly, the duplex basecalling could benefit from simplification. Dorado usage looks good but I am getting errors so its not working at the moment. Would be great if guppy could be simplified!

@ollenordesjo
Copy link
Contributor

Hi @myxotheles,

Apologies for late reply, we're phasing out duplex-tools in favour of all batteries included in dorado.

Sorry to hear you're getting issues, would be excellent to know which errors you are having with dorado as that is the current method we recommend.

Just a couple of sanity checks for the run and dataset:

  • Was the flow cell a high-duplex flow cell?
  • What is the read length of the sample?
  • Is the sample native human or something else?
  • For the basecalling, was both the pass and fail reads used in the input dir?

Lastly, the summary metrics you're reporting, which tool do they come from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants