Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

tseemann · 2016-02-12T05:03:14Z

We have downloaded some Illumina PE reads from SRA and we got the CONTRADICT_FASTQ error.

Both R1 and R2 were in Sanger+33 quality format. However we found in R1 that the first read has a quality symbol K which is Phred 42. Usually Illumina qualities stop at 40 but they can be hire (eg. in Moleculo sequencing etc) which is described here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

I think you need to adjust the thresholds in the code below to be more flexible in terms of what high Q values you allow for SANGER_FASTQ. Maybe change 74 to 80 ?

                if(chr < 59){
                    format_new = SANGER_FASTQ;
                    break;
                }
                if(chr > 74){
                    format_new = SOLEXA_FASTQ;
                    break;
                }

The text was updated successfully, but these errors were encountered:

maciejmotyka · 2022-03-30T20:42:02Z

I know that this software is not maintained anymore, but it's still in use in some pipelines, so maybe my comment will help somebody debug.

If the situation described above gives error message:

Error: the FASTQ quality formats of input files are different

The solution is to determine the encoding yourself by examining the .fastq files, then you can specify it manually using the -f flag

-f, --format Format of FASTQ quality value: sanger|solexa|auto; (auto)

In my case the first record in the first file was:

@SRR10266853.1 1 length=76
NACACTCCTGCCGGCTGGTCTTGGCCGCTGCCGTCCCTGCAGGCCTGAGCTGGGGGGCTTCGGCCACACTCGGAAC
+
#AAFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKFFKKKFKFFKKKK

skewer saw the # symbol and decided it's Sanger/Illumina 1.8+ encoding (correctly).

First record in the second file was:

@SRR10266853.1 1 length=74
CTCAGACAACGACAGCACAGAGAACGAGGCCCCAGAGCCGAGGGAGAGGGTTCCGAGTGTGGCCGAAGCCCCCC
+
AAAFFKKKKKKKKKKKKKAFKKKKKKKKKKKKFFFKFKKKA7AFKKKKFK,AKKKFF7FAFK7FKFAFFKKKKK

Here skewer proceeded until it saw K and decided it's Solexa/Illumina 1.3+/Illumina 1.5+ encoding while we clearly see that it's Illumina 1.8+.
HiSeq 3000/4000 and the X series can produce scores which include K.

markziemann mentioned this issue Feb 12, 2023

skewer problem markziemann/dee2#99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

tseemann commented Feb 12, 2016

maciejmotyka commented Mar 30, 2022 •

edited

Loading

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

Comments

tseemann commented Feb 12, 2016

maciejmotyka commented Mar 30, 2022 • edited Loading

maciejmotyka commented Mar 30, 2022 •

edited

Loading