Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major drop quality score with --trim adapters #53

Open
felipebatalini opened this issue Nov 13, 2024 · 2 comments
Open

Major drop quality score with --trim adapters #53

felipebatalini opened this issue Nov 13, 2024 · 2 comments
Labels
question Further information is requested

Comments

@felipebatalini
Copy link

felipebatalini commented Nov 13, 2024

Why are my q scores dropping so much with --trim adapters?

We are using FLO-MIN114 and R10 chemistry for a cDNA library derived from human RNA.
We noticed a high percentage (>50%) of unusable reads detected by pychopper with the wf-transcriptome workflow, and then identified that it could be improved to <10% if we trimmed the adapters (therefore keeping primers).

However, I was surprised to see a significant drop in the quality scores when I turn on --trim adapters:
nextflow run epi2me-labs/wf-basecalling \ -profile singularity \ --sample_name $sample_name \ --input $pod5_dir \ --dorado_ext pod5 \ --basecaller_cfg dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \ --qscore_filter 10 \ --basecaller_args "--trim adapters" \ --output_fmt fastq \ --out_dir $results_folder
While it makes sense for pychopper to work better with the primers present, I can't understand while the basecalling quality drops do much. In the example below, I demonstrate the different q scores from the same sample.
BC10_without_trimming_parameters
BC10_with---trim adapters on

I appreciate any help to understand this!
Felipe

@felipebatalini felipebatalini added the question Further information is requested label Nov 13, 2024
@cjw85
Copy link
Contributor

cjw85 commented Nov 21, 2024

This is due to the fact that the phred scores for bases in adapter, barcode, and primer regions are typically surpressed compared to bases further into reads. The default in dorado trims all of these components and so when the read quality score is computer by the workflow from the remaining bases the value is higher than when --trim adapters is enabled and barcode and primer sequences are left in place.

Dorado itself reports read quality scores having dropped the first 60 quality scores (see e.g. CRFModelConfig.cpp#L41). The workflow component that is responsible for the data behind these graphs does not do this as it does not have knowledge of whether the basecall has been pretrimmed of adapters, barcodes, and primers.

@felipebatalini
Copy link
Author

@cjw85 Thanks for your answer. So, in terms of overall assessment of our sequencing quality, it seems that the top graph would be more representative of the true quality of basecalling. And that the drop in Q score is an artifact caused by the adapters or primers (in this case) as opposed to poor sequencing quality. We are not multiplexing, so no barcodes. Would you agree with this assessment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants