Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairs.txt file empty, but pairs_from_bam/pair_ids.txt not empty #35

Open
avilella opened this issue Feb 6, 2023 · 1 comment
Open

Comments

@avilella
Copy link

avilella commented Feb 6, 2023

See below:

LS4:/data2/RUN/RUN091/20230203_1010_MN41147_ANY366_3647ace0/fastq_all$ duplex_tools pair *.sam > pairs.txt                                                                                      [10:53:04 - FindPairs] Duplex tools version: 0.3.0 
[10:53:04 - FindPairs] Creating seqsummary from bam                                                                                                                                                      24324it [00:01, 13652.28it/s]  
[10:53:05 - FindPairs] Calculating metrics.                                                                                                                                                              [10:53:05 - FindPairs] No alignment information found for validation.
[10:53:05 - FindPairs] Classifying pairs.                                                                                                                                                                [10:53:05 - FindPairs] 24242 pairs after filtering on duration between reads. (max 200000 s)
[10:53:05 - FindPairs] 4852 pairs after filtering on relative sequence length difference. (max 10.0% difference)                                                                                         [10:53:05 - FindPairs] 4851 pairs after absolute sequence length filtering. (max 5000 bp)
[10:53:05 - FindPairs] 4495 pairs after qscore filtering. (min qscore = 6)                                                                                                                               [10:53:05 - FindPairs] Found 4495 pairs within 24324 reads. (37.0% of reads are part of a pair).
[10:53:05 - FindPairs] Values above 100% are allowed since reads can be either template or complement                                                                                             
[10:53:05 - FindPairs] Writing files into pairs_from_bam directory                                                                                                                                       
[10:53:05 - FiltPairs] Duplex tools version: 0.3.0   
[10:53:05 - FiltPairs] Filtering Parameters:                                                        
        nbases_per_strand_to_align=250                                                              
        align_threshold=0.6
[10:53:05 - FiltPairs] Alignment Parameters:                                                        
        score_match=2                                                                               
        score_mismatch=-1                                                                                                                                                                                        penalty_open=4                                                                                                                                                                                   
        penalty_extend:1                                                                                                                                                                                 [10:53:07 - ReadFastq] Processed 0/1 input fastq/bam files.                                                                                                                                              
[10:53:07 - ReadFastq] Found 100.0% of required reads.
[10:53:07 - FiltPairs] Starting alignments.                                                         
[10:53:07 - AlignPairs] Aligning 4495 pairs                                                         
[10:53:07 - AlignPairs] Good pairs: 1799
[10:53:09 - Pair] Initial reads: 24324
[10:53:09 - Pair] Created pairs: 1799
[10:53:09 - Pair] Paired reads:  3598
[10:53:09 - Pair] Approximate duplex rate for fast5_all.sup.v4.0.0.wmoves.sam: 14.79%

The results are:

[          0 Feb  6 10:53]  ./pairs.txt
[       4096 Feb  6 10:53]  ./pairs_from_bam
[     332630 Feb  6 10:53]  ./pairs_from_bam/pair_ids.txt
[    1858079 Feb  6 10:53]  ./pairs_from_bam/pair_stats.txt
[     133126 Feb  6 10:53]  ./pairs_from_bam/pair_ids_filtered.txt
[     358563 Feb  6 10:53]  ./pairs_from_bam/pair_ids_scored.csv
[    2661432 Feb  6 10:53]  ./pairs_from_bam/read_segments.pkl

Could it be that the way the command was run, it outputs the pairs in the pairs_from_bam/pair_ids.txt file, rather than STDOUT?

Thanks

@ollenordesjo
Copy link
Contributor

Hi @avilella, sorry for the delay in replying

Yes, that's correct, by default it will go to what the --output_dir is set to. The reason for this is to keep the associated files together (both the paired ids without the filtering, and also the filtered ids). Note that you would typically want to use the filtered pair ids for downstream analysis. If there are any reads missing, it's always possible to reconstruct expected pairs from either the pair_stats file (which has some additional metrics), or by the pair_ids.txt file (if you are ok with some false positives).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants