Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promethion good pairs: 0 #44

Open
bef22 opened this issue Apr 18, 2023 · 3 comments
Open

promethion good pairs: 0 #44

bef22 opened this issue Apr 18, 2023 · 3 comments

Comments

@bef22
Copy link

bef22 commented Apr 18, 2023

Hi
I'm using duplex_tools filter_pairs (duplex tools version: 0.3.2) on promethion created fastq files and out of 2759916 duplex pairs none are reported good. I did find the issue where installing into a new virtual environment fixed this issue, however this didn't work for me. I also gunzip all fastq files and still no good pairs are reported.
The promethion run was created with Guppy 6.4.6 on R10 flow cell. Any ideas what else I could try?
Bettina

filter_pairs_minLen1000_gunzip.log

@ollenordesjo
Copy link
Contributor

Hi @bef22! Thanks for the question.

Can you try this again with the additional flag --debug and see if there's a specific reason why reads are skipped? It may be the case that you need to adjust the length settings.

There are four different reasons a read may be skipped, subtly different ones, so would be good to know which one this is.

https://git.oxfordnanolabs.local/research/duplex-tools/-/blob/dev/duplex_tools/filter_pairs.py#L236

If you send the command you used together with a short description of the folder/file structure, it may also help in the next step.

Thanks!

@bef22
Copy link
Author

bef22 commented Apr 19, 2023

Thanks for your suggestion. I now have traced the problem which could be a bug or me misunderstanding the options. I was originally running this:
duplex_tools filter_pairs --min_length 1000 pair_ids.txt pathTo/fastq_pass
and this was giving me no good pairs with "seq1 or seq2 not in requested length range" and I know that I have read pairs which are both >1kb long

I then run as you suggested this:
duplex_tools filter_pairs --debug pair_ids.txt pathTo/fastq_pass
Which reported Aligning 2759916 pairs and I did get Good pairs: 1045256

So I thought that I might have to specify both --min_length and --max_length and tried:
duplex_tools filter_pairs --debug --min_length 1000 --max_length 1000000 pair_ids.txt pathTo/fastq_pass
this again failed to give Good pairs

the last few rows of the debug report are:
[14:09:30 - AlignPairs] Skipped 0ca6731f-8fa9-4423-b162-a71ccf24aafd: sequence missing.
[14:09:30 - AlignPairs] Skipped Pandas(Index=2759909, first='5ef6ee2a-f3bb-4d03-99c0-b175f3c9b1ba', second='9fde27d0-e3fe-436c-b86b-4ecbb127e970'), seq1 or seq2 not in requested length range
[14:09:30 - AlignPairs] Skipped Pandas(Index=2759910, first='0fa786fc-8c5e-5001-a123-e447fdf1a275', second='c7866df3-49e6-5da4-b741-229d08705590'), seq1 or seq2 not in requested length range
[14:09:30 - AlignPairs] Skipped Pandas(Index=2759911, first='e7f8063a-fd7c-53b7-b51d-bca798b9791d', second='ac9ec84f-856d-5af3-a89f-877a394f6bfd'), seq1 or seq2 not in requested length range
[14:09:30 - AlignPairs] Skipped 5eda2b6c-6e63-583d-b673-5e713efc23df: sequence missing.
[14:09:30 - AlignPairs] Skipped Pandas(Index=2759913, first='7aca931a-3032-500c-a1c9-adcd10718047', second='ed5f1f35-7f8b-5656-ad1b-d7fa8433bf61'), seq1 or seq2 not in requested length range
[14:09:30 - AlignPairs] Skipped Pandas(Index=2759914, first='0deda2e7-b49a-50a7-86aa-dec7b9f0a613', second='f5531828-1b4b-5dc5-8dba-d5ca8b1b0b6a'), seq1 or seq2 not in requested length range
[14:09:30 - AlignPairs] Skipped 37738eb2-ca22-5eb3-80fb-455bba5fba29: sequence missing.
[14:09:30 - AlignPairs] Good pairs: 0
[14:09:30 - AlignPairs] defaultdict(<class 'int'>, {'skipped': 2759916, 'read1 missing': 56998, 'read0 missing': 179962, 'good': 0})

I don't have to filter by size at this stage so could continue with all good pairs, but I would like to understand if I used the --min_length argument correctly.

Thanks for you help.

Bettina

@ollenordesjo
Copy link
Contributor

Hi @bef22, sorry for taking a while to respond. Is there any chance you can print out the length of the sequences (or even the sequences themselves) at this location in the code?

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/filter_pairs.py#L237

It may be easiest to add another logger.debug(... line for printing this information.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants