How VarFish Databases are Built #35

holtgrewe · 2021-06-24T15:43:50Z

holtgrewe
Jun 24, 2021
Maintainer

When asked today, I realized that this is not well-documented yet. I don't have the time to write up a real documentation so this can start as a starting point. Besides, actual documentation should be tested and overall it would take a few days for the whole process to complete because of the large data sets.

Answered by holtgrewe

Jan 29, 2022

The description is top-down because one needs to know a bit about context and perspective to understand the process.

There are two databases. "Two" you ask?

Maybe even three.

The varfish-annotator tool has a H2 (embedded Java) database with the information needed to annotate files. These data are stored in tables, one for each dataset
- Thousand Genomes
- ExAC
- gnomAD exomes
- gnomAD gnomes
- Presence in ClinVar
- Presence in HGMD Public
Further, the varfish-annotator tool needs a RefSeq and Ensembl .ser file for the current genome build compatible with the Jannovar library embedded in varfish-annotator.
The database in varfish-server.

How do you build the databases?

There is a built-in comman…

View full answer

holtgrewe · 2022-01-29T10:07:46Z

holtgrewe
Jan 29, 2022
Maintainer Author

The description is top-down because one needs to know a bit about context and perspective to understand the process.

There are two databases. "Two" you ask?

Maybe even three.

The varfish-annotator tool has a H2 (embedded Java) database with the information needed to annotate files. These data are stored in tables, one for each dataset
- Thousand Genomes
- ExAC
- gnomAD exomes
- gnomAD gnomes
- Presence in ClinVar
- Presence in HGMD Public
Further, the varfish-annotator tool needs a RefSeq and Ensembl .ser file for the current genome build compatible with the Jannovar library embedded in varfish-annotator.
The database in varfish-server.

How do you build the databases?

There is a built-in command in varfish-annotator to build the H2 datbase from various source files from a VarFish data release tarball.
Nowadays, Jannovar has prebuilt .ser files for download.
There is a built-in command setup.py import_table that can import the data from a data release tarball.

"Data Release Tarball"? Where do I get that from?

There is a varfish-db-downloader project that contains a Snakemake workflow to download everything. The files are downloaded and properly reformatted for import into varfish-server and varfish-annotator. If upstream files shift location or become unavailable - game over as for reproducibility. You better keep the downloaded files...

About these files shifting location ...

Yes, we feel with you. We have not implemented this, but for other cases we have something called cache-filler that downloads files from the internet to an internal server cubi.cache. The proper way would be to have a caching step that copies the data to some public location. The main challenge here is that these files are HUGE. From there, the files can be downloaded. One option would be to have a public S3 bucket somewhere with one person holding the "cache filling token" and being responsible for copying data into the cache so there are no collisions.

Is that all?

Yes, mostly. In the case of an upgrade to GRCh38, one would probably have to touch the VarFish code here and there are a number of places where grch37 occurs in the varfish-server source code.

https://github.com/bihealth/varfish-server/search?q=grch37

Most occurences are tests and we should also write some tests for GRCh37 (no need to replicate everything, though). We would need additional sites for variant QC, e.g., we could get them from peddy. I would also -- for now -- recommend fixing for a case whether it is GRCh37 or GRCh38. Further, one would have to implement lift-over if one wants to move a GRCh37 case to GRCh38, but such things could come later.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VarFish

How VarFish Databases are Built #35

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments 1 reply

{{title}}

Select a reply

VarFish

How VarFish Databases are Built #35

holtgrewe Jun 24, 2021 Maintainer

There are two databases. "Two" you ask?

How do you build the databases?

Replies: 0 comments · 1 reply

holtgrewe Jan 29, 2022 Maintainer Author

There are two databases. "Two" you ask?

How do you build the databases?

"Data Release Tarball"? Where do I get that from?

About these files shifting location ...

Is that all?

holtgrewe
Jun 24, 2021
Maintainer

Replies: 0 comments 1 reply

holtgrewe
Jan 29, 2022
Maintainer Author