Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for funcotator data sources #6660

Merged
merged 9 commits into from
Sep 10, 2020

Conversation

jonn-smith
Copy link
Collaborator

@jonn-smith jonn-smith commented Jun 16, 2020

Includes latest Gencode and an implicit fix for #6564. Had to make some code changes for latest liftover Gencode data(v34 -> hg19).

The associated DS test release correctly annotates data on hg19 and hg38.

Left to do:

  • Update data sources downloader.
  • Update data source version validation code.

Code updates:

  • Now both hg19 and hg38 have the contig names translated to chr__
  • Added 'lncRNA' to GeneTranscriptType.
  • Added "TAGENE" gene tag.
  • Added the MANE_SELECT tag to FeatureTag.
  • Added the STOP_CODON_READTHROUGH tag to FeatureTag.
  • Updated the GTF versions that are parseable.
  • Fixed a parsing error with new versions of gencode and the remap
    positions (for liftover files).
  • Added test for indexing new lifted over gencode GTF.
  • Added Gencode_34 entries to MAF output map.
  • Minor changes to FuncotatorIntegrationTest.java for code syntax.
  • Pointed data source downloader at new data sources URL.
  • Minor updates to workflows to point at new data sources.

Script updates:

  • Updated retrieval scripts for dbSNP and Gencode.
  • Added required field to gencode config file generation.
  • Now gencode retrieval script enforces double hash comments at
    top of gencode GTF files.

Bug Fixes:
Removing erroneous trailing tab in MAF file output.

@gatk-bot
Copy link

gatk-bot commented Jun 16, 2020

Travis reported job failures from build 30644
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30644.11 logs
integration openjdk8 30644.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 14, 2020

Travis reported job failures from build 30955
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30955.12 logs
integration openjdk8 30955.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 14, 2020

Travis reported job failures from build 30957
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30957.12 logs
integration openjdk8 30957.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 14, 2020

Travis reported job failures from build 30961
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30961.12 logs
integration openjdk8 30961.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 14, 2020

Travis reported job failures from build 30964
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30964.12 logs
integration openjdk8 30964.2 logs

@jonn-smith jonn-smith force-pushed the jts_funcotator_datasources_update branch from 2e74cf4 to c769d21 Compare July 16, 2020 19:55
@jonn-smith jonn-smith marked this pull request as ready for review July 16, 2020 19:59
@gatk-bot
Copy link

gatk-bot commented Jul 16, 2020

Travis reported job failures from build 30972
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 30972.1 logs
integration openjdk11 30972.12 logs
integration openjdk8 30972.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 16, 2020

Travis reported job failures from build 30976
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30976.12 logs
integration openjdk8 30976.2 logs

@gatk-bot
Copy link

Travis reported job failures from build 30986
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 30986.1 logs

@fleharty
Copy link
Contributor

@jonn-smith Did you add the fix for the extra tab at the end of the maf in this PR?

@jonn-smith
Copy link
Collaborator Author

@fleharty Yup

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Back to you with my comments

scripts/funcotator/data_sources/getDbSNP.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getDbSNP.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getGencode.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getGencode.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getGencode.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getGencode.sh Outdated Show resolved Hide resolved
scripts/funcotator/testing/testFuncotator.sh Outdated Show resolved Hide resolved
@droazen droazen assigned jonn-smith and unassigned droazen Jul 23, 2020
@Tjitskedv
Copy link

I'm not sure this is the right place, but I saw that gencode V34 was tested on hg19, I have downloaded the funcotator datasource files V1.7 but it raised an error for gencode, I made a topic of it on the GATK forum, but it might be relevant for you to look at.
I use Funcotator V4.1.7
https://gatk.broadinstitute.org/hc/en-us/community/posts/360072132411-Funcotator-datasources-v1-7-gencode-raise-error

Further I think Cosmic.db is not working correctly, I don't see any cosmic fields in my vcf output, while I know some of the mutations are in the Cosmic db and also in the Cosmic.db file provided.

@jonn-smith
Copy link
Collaborator Author

jonn-smith commented Aug 17, 2020

@Tjitskedv Typically we ask people make a new issue in github or create a forum post, rather than commenting on open pull requests.

To answer your question directly - do not use v1.7 datasources with GATK 4.1.7.X or 4.1.8.X. Those data sources are not yet compatible with the Funcotator code in the Master branch. When we release GATK 4.1.9.0 they will work with Funcotator. This is not a bug in the code - it is an artifact of how we have to do data source releases.

See issue #6708 for all the gory details.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you @jonn-smith with another round of comments.

scripts/funcotator/data_sources/getDbSNP.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getDbSNP.sh Outdated Show resolved Hide resolved
scripts/funcotator/data_sources/getDbSNP.sh Outdated Show resolved Hide resolved
scripts/funcotator/testing/testFuncotator.sh Outdated Show resolved Hide resolved
@@ -498,7 +498,7 @@ private ArgumentsBuilder createBaselineArgumentsForFuncotator(final String varia
// DO NOT ADD THIS TO ANY TEST GROUPS!
Copy link
Contributor

@droazen droazen Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to add some kind of Funcotator test to cover the fix for dbsnp. Anything reasonable that fails without the fix and passes with it is fine, provided that it covers the actual files published in the latest datasource release. How about a cloud test (@Test(groups={"cloud"})) that does 4 small queries on the actual hosted dbsnp VCF from the latest Funcotator datasources in GCS: a tiny interval on each of chr1, chrM, chrX, and chrY. The test could just assert that you get a certain expected number of records from each query. You can do this trivially using a FeatureDataSource.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@gatk-bot
Copy link

gatk-bot commented Sep 2, 2020

Travis reported job failures from build 31299
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 31299.1 logs
cloud openjdk11 31299.14 logs

@jonn-smith
Copy link
Collaborator Author

@droazen OK - back to you for another round. I added that dbSNP test like we talked about.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Approved for merge after you address my last few minor comments below.

/** String representing the hg38 version of the homo sapiens reference. */
protected static String FuncotatorReferenceVersionHg38 = "hg38";
@VisibleForTesting
public static String FuncotatorReferenceVersionHg38 = "hg38";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can't be final? Also, generally when something is @VisibleForTesting, the test class is in the same package and so default/package access is sufficient rather than public access.

Copy link
Collaborator Author

@jonn-smith jonn-smith Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. They definitely should be final. Should have done that when I originally implemented this.

Unfortunately they cannot be package-private. DbSnpIntegrationTest (org.broadinstitute.hellbender.tools.funcotator.dataSource) is not in the same package as BaseFuncotatorArgumentCollection (org.broadinstitute.hellbender.tools.funcotator) so needs the extra visibility.

static Path SOMATIC_GCLOUD_DATASOURCES_PATH = IOUtils.getPath(SOMATIC_GCLOUD_DATASOURCES_BASEURL + ".tar.gz");
public static String SOMATIC_GCLOUD_DATASOURCES_BASEURL = BASE_URL + "s";

public static Path SOMATIC_GCLOUD_DATASOURCES_PATH = IOUtils.getPath(SOMATIC_GCLOUD_DATASOURCES_BASEURL + ".tar.gz");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question: can these be final and package access instead of public?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of fields here can be made final, but the same package issue exists here as in the above case.

: DB_SNP_HG19_FILE_PATH;

// 2 - Create a FeatureDataSource from the dbSNP VCF:
final FeatureDataSource<VariantContext> dbSnpDataSource = new FeatureDataSource<>(dbSnpFile.toUri().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open this FeatureDataSource in a try-with-resources block to ensure that it gets closed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Fixed!

Assert.assertEquals(dbSnpVariant.getAlleles().size(), 2);
Assert.assertEquals(dbSnpVariant.getAlleles().get(0), expectedRefAllele, "Variant has incorrect ref allele: " + dbSnpVariant.getAlleles().get(0) + " != " + expectedRefAllele + " [" + interval + " in " + dbSnpFile + "]");
Assert.assertEquals(dbSnpVariant.getAlleles().get(1), expectedAltAllele);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test looks good otherwise 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks!

// catch (final IOException ex) {
// throw new UserException("Unable to open data sources from gcloud!", ex);
// }
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Good catch.

Fixed!

@gatk-bot
Copy link

gatk-bot commented Sep 9, 2020

Travis reported job failures from build 31351
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 31351.12 logs
integration openjdk8 31351.2 logs

@gatk-bot
Copy link

gatk-bot commented Sep 9, 2020

Travis reported job failures from build 31353
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 31353.12 logs
integration openjdk8 31353.2 logs

Code updates:
- Now both hg19 and hg38 have the contig names translated to `chr__`
- Added 'lncRNA' to GeneTranscriptType.
- Added "TAGENE" gene tag.
- Added the MANE_SELECT tag to FeatureTag.
- Added the STOP_CODON_READTHROUGH tag to FeatureTag.
- Updated the GTF versions that are parseable.
- Fixed a parsing error with new versions of gencode and the remap
positions (for liftover files).
- Added test for indexing new lifted over gencode GTF.
- Added Gencode_34 entries to MAF output map.
- Minor changes to FuncotatorIntegrationTest.java for code syntax.
- Pointed data source downloader at new data sources URL.
- Minor updates to workflows to point at new data sources.

Script updates:
- Updated retrieval scripts for dbSNP and Gencode.
- Added required field to gencode config file generation.
- Now gencode retrieval script enforces double hash comments at
top of gencode GTF files.

Bug Fixes:
Removing erroneous trailing tab in MAF file output.

- Fixes #6693
@jonn-smith jonn-smith force-pushed the jts_funcotator_datasources_update branch from bf2e2c8 to be7daea Compare September 10, 2020 14:53
@jonn-smith jonn-smith merged commit 3ff2a5a into master Sep 10, 2020
@jonn-smith jonn-smith deleted the jts_funcotator_datasources_update branch September 10, 2020 21:55
mwalker174 pushed a commit that referenced this pull request Nov 3, 2020
Code updates:
- Fixed issue with dbSNP source data for hg38.
- Now both hg19 and hg38 have the contig names translated to `chr__`
- Added 'lncRNA' to GeneTranscriptType.
- Added "TAGENE" gene tag.
- Added the MANE_SELECT tag to FeatureTag.
- Added the STOP_CODON_READTHROUGH tag to FeatureTag.
- Updated the GTF versions that are parseable.
- Fixed a parsing error with new versions of gencode and the remap
positions (for liftover files).
- Added test for indexing new lifted over gencode GTF.
- Added Gencode_34 entries to MAF output map.
- Minor changes to FuncotatorIntegrationTest.java for code syntax.
- Pointed data source downloader at new data sources URL.
- Minor updates to workflows to point at new data sources.

Script updates:
- Updated retrieval scripts for dbSNP and Gencode.
- Added required field to gencode config file generation.
- Now gencode retrieval script enforces double hash comments at
top of gencode GTF files.

Bug Fixes:
Removing erroneous trailing tab in MAF file output.

- Fixes #6693
@wqrao
Copy link

wqrao commented Dec 15, 2020

hello, I used this code"gatk/4.1.9.0 Funcotator --variant filter_snp.vcf --reference Homo_sapiens_assembly38.fasta --ref-version hg38 --data-sources-path funcotator_dataSources.v1.6.20190124s --output snp.maf --output-file-format MAF". But the results about "dbSNP_RS colum" were still empty. Why was it?

@jonn-smith
Copy link
Collaborator Author

@wqrao Hi - you must use the v1.7 Funcotator data sources to get those annotations. These datasources can only be used with GATK 4.1.9.0 or newer.

@wqrao
Copy link

wqrao commented Dec 16, 2020

@wqrao Hi - you must use the v1.7 Funcotator data sources to get those annotations. These datasources can only be used with GATK 4.1.9.0 or newer.

Thanks,I will try it again.

@wqrao
Copy link

wqrao commented Dec 18, 2020

@wqrao Hi - you must use the v1.7 Funcotator data sources to get those annotations. These datasources can only be used with GATK 4.1.9.0 or newer.

Hi,I have tried this code " java -jar ./gatk-4.1.9.0/gatk Funcotator --variant filter_snp.vcf --reference Homo_sapiens_assembly38.fasta --ref-version hg38 --data-sources-path funcotator_dataSources.v1.7.20200521s --out v17_anno.snp.maf --output-file-format MAF" again,but few seconds later ,this error("Error: Invalid or corrupt jarfile ./gatk-4.1.9.0/gatk") appeared. How to solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Funcotator appears to add extra tab at end of line
6 participants