Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hisat2 treats fastq input files as fastq.gz files, resulting in zero mapping rate #1373

Closed
MingChen0919 opened this issue Jun 20, 2017 · 18 comments · Fixed by galaxyproject/galaxy#4224

Comments

@MingChen0919
Copy link

Here is my job command line from running galaxy hisat2:

ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_834.dat' input_f.fastq.gz && ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_835.dat' input_r.fastq.gz && ln -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_840.dat' genome.fa && hisat2-build -p ${GALAXY_SLOTS:-1} genome.fa genome && hisat2 -p ${GALAXY_SLOTS:-1} -x 'genome' -1 'input_f.fastq.gz' -2 'input_r.fastq.gz' --secondary | samtools sort - -@ ${GALAXY_SLOTS:-1} -l 6 -o '/opt/galaxy/galaxy-app/database/datasets/000/dataset_855.dat'

My reads files are fastq files, not fastq.gz files. hisat2 treats the fastq files as fastq.gz files. If I run the job directly from the command line, I got this error message:

gzip: input_f.fastq.gz: not in gzip format

gzip: input_r.fastq.gz: not in gzip format

The job was able to get executed after removing the .gz extensions from linked file names.

@mvdbeek
Copy link
Member

mvdbeek commented Jun 20, 2017

Can you make sure that your datatype is not set to fastq.gz or fastqsanger.gz ?
The hisat wrapper links files in based on their datatypes, so it this is wrongly set to fastq.gz or fastqsanger.gz you will get the error you are seeing.

@MingChen0919
Copy link
Author

thanks for your reply @mvdbeek. the datatype is fastq. I even don't know how to convert it to fastq.gz or fastqsanger.gz in galaxy. I also tried to upload a fastq.gz file to run the job, the uploaded file was automatically converted to fastq in galaxy.

@raphenya
Copy link

@mvdbeek I have the same issue as @MingChen0919

@mvdbeek
Copy link
Member

mvdbeek commented Jun 21, 2017

The job was able to get executed after removing the .gz extensions from linked file names.

I'm not quite sure what you mean by that. It should be sufficient to make sure that the datatype is set to fastq or fastqsanger (without the .gz) if your files are not compressed.

@MoHeydarian made an excellent video on how to do that here.

If that still does not work it may be helpful to contact your local galaxy administrator or to provide some small sample data that we can test.

@bgruening
Copy link
Member

Not sure if this is related but I have a similar one as well here: bgruening/galaxytools#598

Is this a Galaxy bug?

@nsoranzo
Copy link
Member

@MingChen0919 Which Galaxy server are you using? And what version of Galaxy, if it's not a public instance?

@MingChen0919
Copy link
Author

@mvdbeek what I mean is that:

the original command line generated by galaxy was

ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_834.dat' input_f.fastq.gz && ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_835.dat' input_r.fastq.gz && ln -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_840.dat' genome.fa && hisat2-build -p ${GALAXY_SLOTS:-1} genome.fa genome && hisat2 -p ${GALAXY_SLOTS:-1} -x 'genome' -1 'input_f.fastq.gz' -2 'input_r.fastq.gz' --secondary | samtools sort - -@ ${GALAXY_SLOTS:-1} -l 6 -o '/opt/galaxy/galaxy-app/database/datasets/000/dataset_855.dat'

I could run the alignment job and got correct answer by running the modified command line directly from the terminal like below.

ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_834.dat' input_f.fastq.gz && ln -f -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_835.dat' input_r.fastq.gz && ln -s '/opt/galaxy/galaxy-app/database/datasets/000/dataset_840.dat' genome.fa && hisat2-build -p ${GALAXY_SLOTS:-1} genome.fa genome && hisat2 -p ${GALAXY_SLOTS:-1} -x 'genome' -1 'input_f.fastq' -2 'input_r.fastq' --secondary | samtools sort - -@ ${GALAXY_SLOTS:-1} -l 6 -o '/opt/galaxy/galaxy-app/database/datasets/000/dataset_855.dat'

I am pretty sure my reads files are in fastq format, and I also tried to set the datatype to make sure it is fastq or fastqsanger. I also tried the fastq groomer tool.

@bgruening It didn't give me any error message. But the overall mapping rate was zero, which I know was impossible. By checking the job command line, I found that my fastq reads file were linked to .fastq.gz files.

@nsoranzo I am using a galaxy image on jetstream. The galaxy version is 16.07.

@nsoranzo
Copy link
Member

The fact that both this bug and bgruening/galaxytools#598 were experienced on Galaxy pre-17.01 (were compressed FASTQ formats were introduced) seems to indicate that the tools may not work correctly under Galaxy versions which lack these datatypes.

@raphenya What Galaxy version are you using?

@mvdbeek
Copy link
Member

mvdbeek commented Jun 22, 2017

Yeah, I wonder if an unknown datatype will map to Data ?! Pre 17.01 you'd effectively be asking hda.is_of_type('data') when you actually want to ask for hda.is_of_type('fastq.gz').

Urgh, so if we change this back to checking for the extension we should be OK, or alternatively we ship the compressed datatypes with the tools that need them.

@nsoranzo
Copy link
Member

nsoranzo commented Jun 22, 2017

Actually it seems to be equivalent to hda.is_of_type('txt'):

class DatasetFilenameWrapper( ToolParameterValueWrapper ):
    def is_of_type( self, *exts ):
        datatypes = [ self.datatypes_registry.get_datatype_by_extension( e ) for e in exts ]
        return self.dataset.datatype.matches_any( datatypes )
class Registry( object ):
    def get_datatype_by_extension( self, ext ):
        """Returns a datatype based on an extension"""
        try:
            builder = self.datatypes_by_extension[ ext ]
        except KeyError:
            builder = data.Text()
        return builder

which seems to be obviously a Galaxy bug.

This code has been around since the first registered commit in 2006! galaxyproject/galaxy@f788a34#diff-f6e9dd2399db7b16a8a299cc0292520dR36

@nsoranzo
Copy link
Member

Ping @jmchilton, we need your advice here!

@raphenya
Copy link

@nsoranzo We are running galaxy version 16.04

@dpryan79
Copy link
Contributor

I had hoped that setting profile="17.01" in the tool xml would have simply prevented installation on older instances :(

@jmchilton
Copy link
Member

We need to more aggressive warnings for incompatible profile versions for sure - there is an open issue for that. I'd say we should also patch is_of_type on older Galaxy versions but I'm not sure that would help anyone at this point - this tool is marked as incompatible with versions that are exhibiting this bug.

If someone wants to ping me post-GCC I could give fixing is_of_type a shot.

@nsoranzo
Copy link
Member

@jmchilton I'm working on a Galaxy PR for that, I'll add as reviewer when I open it.

@jmchilton
Copy link
Member

@nsoranzo Super fantastic - thanks so much!

nsoranzo added a commit to nsoranzo/galaxy that referenced this issue Jun 26, 2017
…nown

Without this fix, the Cheetah expression:

$dataset.is_of_type('unknown_ext')

in a tool command would be equivalent to:

$dataset.is_of_type('txt')

meaning that if the dataset datatype is a subclass of Text, the expression
would evaluate to True without any warning.

xref. galaxyproject/tools-iuc#1373

Also add missing `xml` datatype to
`test/functional/tools/sample_datatypes_conf.xml` which is needed by 3 test
tools.
@nsoranzo
Copy link
Member

The fix actually still needs to be backported, reopening.

@nsoranzo nsoranzo reopened this Jun 26, 2017
nsoranzo added a commit to nsoranzo/galaxy that referenced this issue Jun 26, 2017
…nown

Without this fix, the Cheetah expression:

$dataset.is_of_type('unknown_ext')

in a tool command would be equivalent to:

$dataset.is_of_type('txt')

meaning that if the dataset datatype is a subclass of Text, the expression
would evaluate to True without any warning.

xref. galaxyproject/tools-iuc#1373

Also add missing `xml` datatype to
`test/functional/tools/sample_datatypes_conf.xml` which is needed by 3 test
tools.
@nsoranzo
Copy link
Member

nsoranzo commented Jul 5, 2017

The fix has been backported to Galaxy releases 16.07 and later, closing.

@nsoranzo nsoranzo closed this as completed Jul 5, 2017
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 19, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 19, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 22, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 22, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 23, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
peterjc added a commit to peterjc/galaxy_blast that referenced this issue Oct 23, 2018
See galaxyproject/tools-iuc#1373
which was fixed and back-ported to Galaxy 16.07,
galaxyproject/galaxy#4224
galaxyproject/galaxy#4230

This would still break with other non-compressed FASTA
subclasses, but this is intended as a stop-gap until
the last few elderly Galaxy servers in use are updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants