Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

Open
tseemann opened this issue Feb 12, 2016 · 1 comment
Open

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

tseemann opened this issue Feb 12, 2016 · 1 comment

Comments

@tseemann
Copy link

We have downloaded some Illumina PE reads from SRA and we got the CONTRADICT_FASTQ error.

Both R1 and R2 were in Sanger+33 quality format. However we found in R1 that the first read has a quality symbol K which is Phred 42. Usually Illumina qualities stop at 40 but they can be hire (eg. in Moleculo sequencing etc) which is described here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

I think you need to adjust the thresholds in the code below to be more flexible in terms of what high Q values you allow for SANGER_FASTQ. Maybe change 74 to 80 ?

                if(chr < 59){
                    format_new = SANGER_FASTQ;
                    break;
                }
                if(chr > 74){
                    format_new = SOLEXA_FASTQ;
                    break;
                }
@maciejmotyka
Copy link

maciejmotyka commented Mar 30, 2022

I know that this software is not maintained anymore, but it's still in use in some pipelines, so maybe my comment will help somebody debug.

If the situation described above gives error message:

Error: the FASTQ quality formats of input files are different

The solution is to determine the encoding yourself by examining the .fastq files, then you can specify it manually using the -f flag

-f, --format Format of FASTQ quality value: sanger|solexa|auto; (auto)

In my case the first record in the first file was:

@SRR10266853.1 1 length=76
NACACTCCTGCCGGCTGGTCTTGGCCGCTGCCGTCCCTGCAGGCCTGAGCTGGGGGGCTTCGGCCACACTCGGAAC
+
#AAFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKFFKKKFKFFKKKK

skewer saw the # symbol and decided it's Sanger/Illumina 1.8+ encoding (correctly).

First record in the second file was:

@SRR10266853.1 1 length=74
CTCAGACAACGACAGCACAGAGAACGAGGCCCCAGAGCCGAGGGAGAGGGTTCCGAGTGTGGCCGAAGCCCCCC
+
AAAFFKKKKKKKKKKKKKAFKKKKKKKKKKKKFFFKFKKKA7AFKKKKFK,AKKKFF7FAFK7FKFAFFKKKKK

Here skewer proceeded until it saw K and decided it's Solexa/Illumina 1.3+/Illumina 1.5+ encoding while we clearly see that it's Illumina 1.8+.
HiSeq 3000/4000 and the X series can produce scores which include K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants