Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grch37 HGNC, Mutation Assessor and ClinVar #96

Merged
merged 8 commits into from
Jan 21, 2025

Conversation

leexgh
Copy link
Member

@leexgh leexgh commented Jan 9, 2025

Fix: genome-nexus/genome-nexus#765, GRCh38 updates are in another pr.

  • Update HGNC to 2024-10
  • Update ClinVar to 20250106
  • Update Genome Nexus grch37_ensembl92 transcript data
  • Use Mutation Assessor V4 data from S3
  • Upload ClinVar to S3 bucket and delete local files, switch to getting data from S3
  • Separate data importing from image building. As the amount of data needing to be stored in the database continues to grow, the previous setup is no longer able to handle the increased data size. In this pull request, the image is no longer built with a fully set-up MongoDB. Instead, data importing is handled as a separate step, which occurs during the creation of the Docker container.
  • Run tests on pull requests as well.

@leexgh leexgh changed the title Update grch37 version Update grch37 HGNC to 202410 Jan 16, 2025
@leexgh leexgh changed the title Update grch37 HGNC to 202410 Update grch37 HGNC and Mutation Assessor Jan 16, 2025
@leexgh leexgh changed the title Update grch37 HGNC and Mutation Assessor Update grch37 HGNC, Mutation Assessor and ClinVar Jan 16, 2025
@leexgh leexgh requested a review from inodb January 16, 2025 18:04
@@ -1,10 +1,10 @@
name version type id description url
VEP grch37 mirrored vep VEP determines the effect of your variants(SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. https://grch37.ensembl.org/info/docs/tools/vep/index.html
HGNC 2023-10 mirrored hgnc The resource for approved human gene nomenclature. Genome Nexus uses HGNC gene symbols in annotation http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/
HGNC 2024-10 mirrored hgnc The resource for approved human gene nomenclature. Genome Nexus uses HGNC gene symbols in annotation http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leexgh for the version formatting of hgnc, can you align with @rmadupuri so it's similar to the proposed cBioPortal version. I think this was proposed:

hgnc_v2023.10.1 similar to the genesets version in the [cBioPortal] database to indicate the HGNC release date of the data used to create the tables

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with Ramya, change to hgnc_v2024.10.1

Comment on lines +58 to +61
"https://genome-nexus-static-data.s3.us-east-1.amazonaws.com/mutationassessor_v4_1.tsv.gz"
"https://genome-nexus-static-data.s3.us-east-1.amazonaws.com/mutationassessor_v4_2.tsv.gz"
"https://genome-nexus-static-data.s3.us-east-1.amazonaws.com/mutationassessor_v4_3.tsv.gz"
"https://genome-nexus-static-data.s3.us-east-1.amazonaws.com/mutationassessor_v4_4.tsv.gz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leexgh is the process described somewhere for how these files are updated on S3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a follow-up PR to describe files on S3 and how to update them

PDB latest external pdb A resource for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids. https://www.rcsb.org/
xrefs latest external xrefs Xref endpoint is a lookup of Ensembl Identifiers and retrieve their external references in other databases http://grch37.rest.ensembl.org/
PFAM 32.0 mirrored pfam A database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. https://www.ebi.ac.uk/interpro/
PTM dbPTM 2019 mirrored ptm A resource for protein post-translational modifications (PTMs). https://awi.cuhk.edu.cn/dbPTM/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTM url we have in the MAKEFILE is down, this is their new website. They don't have a version number on the new or previous website, I add dbPTM 2019 as the version because this is the time we added PTM so I assume we were adding the latest at that time

Comment on lines +15 to +16
Polyphen-2 2.2.3, release 405c external polyphen "PolyPhen-2 predicts the effect of an amino acid substitution on the structure and function of a protein using sequence homology, Pfam annotations, 3D structures from PDB where available, and a number of other databases and tools (including DSSP, ncoils etc.)" http://genetics.bwh.harvard.edu/pph2/
Sift 6.2.1 external sift SIFT predicts whether an amino acid substitution is likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version number here are from Ensembl since we get Polyphen and Sift from vep: https://useast.ensembl.org/info/genome/variation/prediction/protein_function.html

@leexgh leexgh merged commit 76c89aa into genome-nexus:master Jan 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update HGNC 2024-10 version
2 participants