Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF #7054

Closed
GATKSupportTeam opened this issue Jan 26, 2021 · 2 comments · Fixed by #7166
Closed

IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF #7054

GATKSupportTeam opened this issue Jan 26, 2021 · 2 comments · Fixed by #7166
Assignees

Comments

@GATKSupportTeam
Copy link
Collaborator

GATKSupportTeam commented Jan 26, 2021

Description

A user wants to run Funcotator with a mouse sample but has an issue running IndexFeatureFile. The GencodeGTF parser is specific to human data. Funcotator could be useful for users with non-human data if there is a workaround for these errors.

GATK Information

GATK 4.1.9.0
gatk IndexFeatureFile -I gencode.vM25.annotation.gtf
This request was created from a contribution made by T. Li on January 25, 2021 04:37 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360076815852-Error-Running-IndexFeatureFile-on-Ensembl-Mouse-GTF-file-

Error Log

Using GATK jar /gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar  
  
Running:  
  
java -Dsamjdk.use\_async\_io\_read\_samtools=false -Dsamjdk.use\_async\_io\_write\_samtools=true -Dsamjdk.use\_async\_io\_write\_tribble=false -Dsamjdk.compression\_level=2 -jar /gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar IndexFeatureFile -I gencode/mm10/gencode.vM25.annotation.gtf  
  
04:33:13.081 INFO NativeLibraryLoader - Loading libgkl\_compression.so from jar:file:/gatk/gatk-package-4.1.9.0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl\_compression.so  
  
Jan 25, 2021 4:33:13 AM shaded.cloud\_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine  
  
INFO: Failed to detect whether we are running on Google Compute Engine.  
  
04:33:13.195 INFO IndexFeatureFile - ------------------------------------------------------------  
  
04:33:13.195 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.1.9.0-SNAPSHOT  
  
04:33:13.195 INFO IndexFeatureFile - For support and documentation go to [https://software.broadinstitute.org/gatk/](https://software.broadinstitute.org/gatk/)  
  
04:33:13.195 INFO IndexFeatureFile - Executing as root@b4c480938d0d on Linux v5.4.0-1029-aws amd64  
  
04:33:13.195 INFO IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v1.8.0\_242-8u242-b08-0ubuntu3~18.04-b08  
  
04:33:13.195 INFO IndexFeatureFile - Start Date/Time: January 25, 2021 4:33:13 AM GMT  
  
04:33:13.195 INFO IndexFeatureFile - ------------------------------------------------------------  
  
04:33:13.195 INFO IndexFeatureFile - ------------------------------------------------------------  
  
04:33:13.196 INFO IndexFeatureFile - HTSJDK Version: 2.23.0  
  
04:33:13.196 INFO IndexFeatureFile - Picard Version: 2.23.3  
  
04:33:13.196 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION\_LEVEL : 2  
  
04:33:13.196 INFO IndexFeatureFile - HTSJDK Defaults.USE\_ASYNC\_IO\_READ\_FOR\_SAMTOOLS : false  
  
04:33:13.196 INFO IndexFeatureFile - HTSJDK Defaults.USE\_ASYNC\_IO\_WRITE\_FOR\_SAMTOOLS : true  
  
04:33:13.196 INFO IndexFeatureFile - HTSJDK Defaults.USE\_ASYNC\_IO\_WRITE\_FOR\_TRIBBLE : false  
  
04:33:13.196 INFO IndexFeatureFile - Deflater: IntelDeflater  
  
04:33:13.196 INFO IndexFeatureFile - Inflater: IntelInflater  
  
04:33:13.196 INFO IndexFeatureFile - GCS max retries/reopens: 20  
  
04:33:13.196 INFO IndexFeatureFile - Requester pays: disabled  
  
04:33:13.196 INFO IndexFeatureFile - Initializing engine  
  
04:33:13.196 INFO IndexFeatureFile - Done initializing engine  
  
04:33:13.396 INFO FeatureManager - Using codec EnsemblGtfCodec to read file file:///gatk/funcotator-scripts/gencode/mm10/gencode.vM25.annotation.gtf  
  
04:33:13.400 INFO ProgressMeter - Starting traversal  
  
04:33:13.400 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute  
  
04:33:21.040 INFO IndexFeatureFile - Shutting down engine  
  
\[January 25, 2021 4:33:21 AM GMT\] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.13 minutes.  
  
Runtime.totalMemory()=1835532288  
  
java.lang.IllegalArgumentException: Unexpected value: IG\_D\_pseudogene  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature$GeneTranscriptType.getEnum(GencodeGtfFeature.java:1060)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.<init>(GencodeGtfFeature.java:158)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfGeneFeature.<init>(GencodeGtfGeneFeature.java:19)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfGeneFeature.create(GencodeGtfGeneFeature.java:23)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature$FeatureType$1.create(GencodeGtfFeature.java:760)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfFeature.create(GencodeGtfFeature.java:327)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:138)  
  
at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:23)  
  
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:43)  
  
at org.broadinstitute.hellbender.utils.codecs.ProgressReportingDelegatingCodec.decodeLoc(ProgressReportingDelegatingCodec.java:46)  
  
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:689)  
  
at htsjdk.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:650)  
  
at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:511)  
  
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:446)  
  
at org.broadinstitute.hellbender.tools.IndexFeatureFile.createAppropriateIndexInMemory(IndexFeatureFile.java:118)  
  
at org.broadinstitute.hellbender.tools.IndexFeatureFile.doWork(IndexFeatureFile.java:75)  
  
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)  
  
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)  
  
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)  
  
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)  
  
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)  
  
at org.broadinstitute.hellbender.Main.main(Main.java:289)<br><br><i>(created from <a href='https://broadinstitute.zendesk.com/agent/tickets/100645'>Zendesk ticket #100645</a>)<br>gz#100645</i>
@droazen
Copy link
Contributor

droazen commented Jan 26, 2021

@jonn-smith Any insight into this one?

@jonn-smith
Copy link
Collaborator

Yeah - there is a limited set of allowed values for GeneTranscriptType and the one in this particular Mouse transcript (IG_D_pseudogene) is not among them. The values for GeneTranscriptType were modeled off the human Gencode files, but the GTF spec allows for arbitrary values in that field. I talked with @jamesemery about this and it ultimately comes down to my want for Funcotator to fail rather than produce bad / erroneous annotations.

The parsers / codecs will need to be updated to allow for arbitrary values in this field, and the parser should be reviewed for other fields that allow for this as well.

It may be beneficial in the medium-term to switch the codec over to the GFF3 codec that @kachulis wrote.

jonn-smith added a commit that referenced this issue Mar 26, 2021
- Now the GencodeGtfCodec no longer parses transcriptType and geneType
into enums.  They are now stored as strings.  This allows for arbitrary
values in these fields and will help to future-proof (and species proof)
  the GTF parser.

- Fixes #7054
jonn-smith added a commit that referenced this issue Mar 29, 2021
- Now the GencodeGtfCodec no longer parses transcriptType and geneType
into enums.  They are now stored as strings.  This allows for arbitrary
values in these fields and will help to future-proof (and species proof)
  the GTF parser.

- Fixes #7054
jonn-smith added a commit that referenced this issue Mar 30, 2021
- Now the GencodeGtfCodec no longer parses transcriptType and geneType
into enums.  They are now stored as strings.  This allows for arbitrary
values in these fields and will help to future-proof (and species proof)
  the GTF parser.

- Fixes #7054
jonn-smith added a commit that referenced this issue Apr 3, 2021
* Updated GencodeGtfCodec to be more permissive.

- Now the GencodeGtfCodec no longer parses transcriptType and geneType
into enums.  They are now stored as strings.  This allows for arbitrary
values in these fields and will help to future-proof (and species proof)
  the GTF parser.

- Fixes #7054
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants