-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update for funcotator data sources #6660
Conversation
2e74cf4
to
c769d21
Compare
@jonn-smith Did you add the fix for the extra tab at the end of the maf in this PR? |
@fleharty Yup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Back to you with my comments
src/test/java/org/broadinstitute/hellbender/tools/IndexFeatureFileIntegrationTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/funcotator/FuncotatorIntegrationTest.java
Show resolved
Hide resolved
I'm not sure this is the right place, but I saw that gencode V34 was tested on hg19, I have downloaded the funcotator datasource files V1.7 but it raised an error for gencode, I made a topic of it on the GATK forum, but it might be relevant for you to look at. Further I think Cosmic.db is not working correctly, I don't see any cosmic fields in my vcf output, while I know some of the mutations are in the Cosmic db and also in the Cosmic.db file provided. |
@Tjitskedv Typically we ask people make a new issue in github or create a forum post, rather than commenting on open pull requests. To answer your question directly - do not use v1.7 datasources with GATK 4.1.7.X or 4.1.8.X. Those data sources are not yet compatible with the Funcotator code in the Master branch. When we release GATK 4.1.9.0 they will work with Funcotator. This is not a bug in the code - it is an artifact of how we have to do data source releases. See issue #6708 for all the gory details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to you @jonn-smith with another round of comments.
...java/org/broadinstitute/hellbender/tools/funcotator/mafOutput/MafOutputRendererUnitTest.java
Show resolved
Hide resolved
@@ -498,7 +498,7 @@ private ArgumentsBuilder createBaselineArgumentsForFuncotator(final String varia | |||
// DO NOT ADD THIS TO ANY TEST GROUPS! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to add some kind of Funcotator
test to cover the fix for dbsnp. Anything reasonable that fails without the fix and passes with it is fine, provided that it covers the actual files published in the latest datasource release. How about a cloud test (@Test(groups={"cloud"})
) that does 4 small queries on the actual hosted dbsnp VCF from the latest Funcotator datasources in GCS: a tiny interval on each of chr1, chrM, chrX, and chrY. The test could just assert that you get a certain expected number of records from each query. You can do this trivially using a FeatureDataSource
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
@droazen OK - back to you for another round. I added that dbSNP test like we talked about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Approved for merge after you address my last few minor comments below.
/** String representing the hg38 version of the homo sapiens reference. */ | ||
protected static String FuncotatorReferenceVersionHg38 = "hg38"; | ||
@VisibleForTesting | ||
public static String FuncotatorReferenceVersionHg38 = "hg38"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can't be final? Also, generally when something is @VisibleForTesting
, the test class is in the same package and so default/package access is sufficient rather than public access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. They definitely should be final. Should have done that when I originally implemented this.
Unfortunately they cannot be package-private. DbSnpIntegrationTest
(org.broadinstitute.hellbender.tools.funcotator.dataSource
) is not in the same package as BaseFuncotatorArgumentCollection
(org.broadinstitute.hellbender.tools.funcotator
) so needs the extra visibility.
static Path SOMATIC_GCLOUD_DATASOURCES_PATH = IOUtils.getPath(SOMATIC_GCLOUD_DATASOURCES_BASEURL + ".tar.gz"); | ||
public static String SOMATIC_GCLOUD_DATASOURCES_BASEURL = BASE_URL + "s"; | ||
|
||
public static Path SOMATIC_GCLOUD_DATASOURCES_PATH = IOUtils.getPath(SOMATIC_GCLOUD_DATASOURCES_BASEURL + ".tar.gz"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question: can these be final and package access instead of public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of fields here can be made final
, but the same package issue exists here as in the above case.
: DB_SNP_HG19_FILE_PATH; | ||
|
||
// 2 - Create a FeatureDataSource from the dbSNP VCF: | ||
final FeatureDataSource<VariantContext> dbSnpDataSource = new FeatureDataSource<>(dbSnpFile.toUri().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open this FeatureDataSource
in a try-with-resources block to ensure that it gets closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Fixed!
Assert.assertEquals(dbSnpVariant.getAlleles().size(), 2); | ||
Assert.assertEquals(dbSnpVariant.getAlleles().get(0), expectedRefAllele, "Variant has incorrect ref allele: " + dbSnpVariant.getAlleles().get(0) + " != " + expectedRefAllele + " [" + interval + " in " + dbSnpFile + "]"); | ||
Assert.assertEquals(dbSnpVariant.getAlleles().get(1), expectedAltAllele); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test looks good otherwise 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks!
// catch (final IOException ex) { | ||
// throw new UserException("Unable to open data sources from gcloud!", ex); | ||
// } | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Good catch.
Fixed!
Code updates: - Now both hg19 and hg38 have the contig names translated to `chr__` - Added 'lncRNA' to GeneTranscriptType. - Added "TAGENE" gene tag. - Added the MANE_SELECT tag to FeatureTag. - Added the STOP_CODON_READTHROUGH tag to FeatureTag. - Updated the GTF versions that are parseable. - Fixed a parsing error with new versions of gencode and the remap positions (for liftover files). - Added test for indexing new lifted over gencode GTF. - Added Gencode_34 entries to MAF output map. - Minor changes to FuncotatorIntegrationTest.java for code syntax. - Pointed data source downloader at new data sources URL. - Minor updates to workflows to point at new data sources. Script updates: - Updated retrieval scripts for dbSNP and Gencode. - Added required field to gencode config file generation. - Now gencode retrieval script enforces double hash comments at top of gencode GTF files. Bug Fixes: Removing erroneous trailing tab in MAF file output. - Fixes #6693
bf2e2c8
to
be7daea
Compare
Code updates: - Fixed issue with dbSNP source data for hg38. - Now both hg19 and hg38 have the contig names translated to `chr__` - Added 'lncRNA' to GeneTranscriptType. - Added "TAGENE" gene tag. - Added the MANE_SELECT tag to FeatureTag. - Added the STOP_CODON_READTHROUGH tag to FeatureTag. - Updated the GTF versions that are parseable. - Fixed a parsing error with new versions of gencode and the remap positions (for liftover files). - Added test for indexing new lifted over gencode GTF. - Added Gencode_34 entries to MAF output map. - Minor changes to FuncotatorIntegrationTest.java for code syntax. - Pointed data source downloader at new data sources URL. - Minor updates to workflows to point at new data sources. Script updates: - Updated retrieval scripts for dbSNP and Gencode. - Added required field to gencode config file generation. - Now gencode retrieval script enforces double hash comments at top of gencode GTF files. Bug Fixes: Removing erroneous trailing tab in MAF file output. - Fixes #6693
hello, I used this code"gatk/4.1.9.0 Funcotator --variant filter_snp.vcf --reference Homo_sapiens_assembly38.fasta --ref-version hg38 --data-sources-path funcotator_dataSources.v1.6.20190124s --output snp.maf --output-file-format MAF". But the results about "dbSNP_RS colum" were still empty. Why was it? |
@wqrao Hi - you must use the v1.7 Funcotator data sources to get those annotations. These datasources can only be used with GATK 4.1.9.0 or newer. |
Thanks,I will try it again. |
Hi,I have tried this code " java -jar ./gatk-4.1.9.0/gatk Funcotator --variant filter_snp.vcf --reference Homo_sapiens_assembly38.fasta --ref-version hg38 --data-sources-path funcotator_dataSources.v1.7.20200521s --out v17_anno.snp.maf --output-file-format MAF" again,but few seconds later ,this error("Error: Invalid or corrupt jarfile ./gatk-4.1.9.0/gatk") appeared. How to solve it? |
Includes latest Gencode and an implicit fix for #6564. Had to make some code changes for latest liftover Gencode data(v34 -> hg19).
The associated DS test release correctly annotates data on hg19 and hg38.
Left to do:
Code updates:
chr__
positions (for liftover files).
Script updates:
top of gencode GTF files.
Bug Fixes:
Removing erroneous trailing tab in MAF file output.