Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funcotator Update for Datasource Release V1.8 #8512

Merged
merged 13 commits into from
Oct 11, 2023

Conversation

jamesemery
Copy link
Collaborator

Accompanying this branch is and will be an official new release of the Funcotator Datasource bundles:
gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.8.hg19.20230908g.tar.gz
gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.8.hg19.20230908s.tar.gz
gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.8.hg38.20230908g.tar.gz
gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.8.hg38.20230908s.tar.gz

Note that the format of the datasources bundles has changed somewhat, importantly they are now split into separate hg19 and hg38 bundles to cut down on size. In this branch are:

  • The necessary changes to the FuncotatorDownloaderScript to accomidate the new bundles
  • Changes to the various Funcotator datasource downloader scripts to point to newer releases of bundled sources used in this release
  • A fix for the MAF output renderer to handle Gencodev43 datasources

Fixes #8296

@jamesemery
Copy link
Collaborator Author

@jonn-smith I did not fully tackle the tests for the uplaoder script and I suspect they are pretty broken now... I have spot-checked that they work with all 4 bundles...

@gatk-bot
Copy link

gatk-bot commented Sep 1, 2023

Github actions tests reported job failures from actions build 6053420535
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 6053420535.12 logs
unit 17.0.6+10 6053420535.1 logs

@gatk-bot
Copy link

gatk-bot commented Sep 1, 2023

Github actions tests reported job failures from actions build 6053391476
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 6053391476.12 logs
unit 17.0.6+10 6053391476.1 logs

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I have a few questions / comments. Once tests are passing I'll review them

@@ -18,6 +18,7 @@ set -e

COSMIC_FILE=CosmicCompleteTargetedScreensMutantExport.tsv
OUT_DB_FILE="Cosmic.db"
OUT_TMP_FOLDER="~/tmp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to remove the temp dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other way around... i added a tmp folder and i needed to do so because i was running into space issues on-prem

@@ -2,5 +2,5 @@

# Downloads the HGNC data source from the HGNC website.

curl 'https://www.genenames.org/cgi-bin/download?col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_locus_type&col=gd_locus_group&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_name_aliases&col=gd_pub_chrom_map&col=gd_date_mod&col=gd_date_sym_change&col=gd_date_name_change&col=gd_pub_acc_ids&col=gd_enz_ids&col=gd_pub_eg_id&col=gd_pub_ensembl_id&col=gd_pubmed_ids&col=gd_pub_refseq_ids&col=family.id&col=family.name&col=gd_ccds_ids&col=gd_vega_ids&col=md_eg_id&col=md_mim_id&col=md_refseq_id&col=md_prot_id&col=md_ensembl_id&col=md_ucsc_id&status=Approved&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&hgnc_dbtag=on&submit=submit' > hgnc_download_$(date +%b%d%Y).tsv
curl 'https://www.genenames.org/cgi-bin/download/custom?col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_locus_type&col=gd_locus_group&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_name_aliases&col=gd_pub_chrom_map&col=gd_date_mod&col=gd_date_sym_change&col=gd_date_name_change&col=gd_pub_acc_ids&col=gd_enz_ids&col=gd_pub_eg_id&col=gd_pub_ensembl_id&col=gd_pubmed_ids&col=gd_pub_refseq_ids&col=family.id&col=family.name&col=gd_ccds_ids&col=gd_vega_ids&col=md_eg_id&col=md_mim_id&col=md_refseq_id&col=md_prot_id&col=md_ensembl_id&col=md_ucsc_id&status=Approved&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&hgnc_dbtag=on&submit=submi' > hgnc_download_$(date +%b%d%Y).tsv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last characters before the > look to be truncated.

Is it really submit=submi?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just checked and that link works... ask the HGNC devs? Evidently both submi and submit work there...

@@ -2094,7 +2096,7 @@ private static List<Event> simpleSplitIntoBiallelics(final VariantContext vc) {
final List<Event> result = new ArrayList<>();

if (vc.isBiallelic()) {
return Collections.singletonList(Event.ofWithoutAttributes(vc));
return Collections.singletonList(Event.ofWithoutAttributes(GATKVariantContextUtils.trimAlleles(vc, true, true)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you need this for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is the fix from #8445. Let me rebase and this should disappear from the diff..

@@ -53,7 +53,9 @@ private DataSourceUtils() {}
private static final String MANIFEST_SOURCE_LINE_START = "Source:";
private static final String MANIFEST_ALT_SOURCE_LINE_START = "Alternate Source:";
@VisibleForTesting
static final Pattern VERSION_PATTERN = Pattern.compile(MANIFEST_VERSION_LINE_START + "\\s+(\\d+)\\.(\\d+)\\.(\\d\\d\\d\\d)(\\d\\d)(\\d\\d)(.*)");
static final Pattern OLD_VERSION_PATTERN = Pattern.compile(MANIFEST_VERSION_LINE_START + "\\s+(\\d+)\\.(\\d+)\\.(\\d\\d\\d\\d)(\\d\\d)(\\d\\d)(.*)");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its in the tests... i'm somewhat remiss to get rid of the old syntax entirely since it does correspond to real releases that still exist and pretending every old version had the ref version included seems incorrect...

@jamesemery jamesemery force-pushed the je_fixFuncotatorDatasourcesScripts branch from b502006 to 62abde0 Compare September 11, 2023 18:39
@jamesemery
Copy link
Collaborator Author

@jonn-smith back to you. I fixed the broken test... Should be good for you to do your final datasource testing

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this looks pretty good.

I did those large "integration" tests, and it seems to work properly (modulo the fix in #8539). Feel free to merge this once tests pass. Then you can rebase and merge #8539.

@jamesemery jamesemery merged commit d40a485 into master Oct 11, 2023
@jamesemery jamesemery deleted the je_fixFuncotatorDatasourcesScripts branch October 11, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update funcotator bundled data sources
3 participants