-
Notifications
You must be signed in to change notification settings - Fork 1
VCF to TileDB Import Design
The VCF import process performs:
-
Registration of MetaDB related items from a list of VCF files.
-
Sorting, compression, and indexing of input VCF files using bcftools.
-
Construction of required configs for GenomicsDB import
-
Optional loading of TileDB array
See utils/example_configs/vcf_import.config
for an example.
Field | Mandatory | Description |
---|---|---|
workspace | Yes | Full path to TileDB workspace where the array will exist. |
array | Yes | Name of array to import VCFs into. |
assembly | Yes | Name of assembly. This can be an existing assembly in MetaDB or new assembly to be registered from VCF contig tags. |
dburi | Yes | MetaDB Instance to make a connection to, as defined in alembic.ini: driver://user:pass@localhost/dbname |
source_idx | Yes | Used to specify the ordering of normal sample in the VCF file. |
target_idx | Yes | Used to specify the ordering of tumor sample in the VCF file. |
callset_loc | No | Required only if sample tags are used. Describes how to retrieve sample names if sample tags exist. |
Unlike the MAF importer, the VCF import process does not require an intermediate CSV before passing to the GenomicsDB loading process. The VCF import process for GenomicsDB requires that the VCFs are sorted, blocked compressed, and indexed (addressed by step 2 above). The import process also requires three import configuration files for the GenomicsDB vcf2tiledb import binary (much like the MAF/CSV import process): i) callset_mapping, ii) vid_mapping iii) loader config (addressed by step 3 and 4 above).
The current VCF import process is designed to import VCFs produced from a somatic variant calling pipeline. These VCFs contain two sample columns, one from the NORMAL sample and one from the TUMOR sample. These columns will each be represented as a CallSet in the variant store - meaning each VCF will have two callsets associated with it. The VCF importer is designed to read from a config to understand how to retrieve sample information from the VCF - required because there are two ways a sample name can be labeled in a VCF. These two types are addressed below: i) sample in header and ii) sample tag identifiers. See information about the config section for more on fields related to sample information handling.
In the most simple case, the sample information is available in the header line, ie. sample1N and sample1T below.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1N sample1T
Often, the sample information in the header is something more generic and cannot be used to uniquely identify the sample, ie. NORMAL and TUMOR below.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
If this is the case, there must be two sample tags in the header the provide the identifier for the sample, and the config file callset_loc
field should specify how to retrieve this information. For example, below would require "callset_loc": "SampleName"
.
##SAMPLE=<ID=NORMAL,Description="Wild type",Platform=ILLUMINA,Protocol=WGS,SampleName=sample1N>'
##SAMPLE=<ID=TUMOR,Description="Mutant",Platform=ILLUMINA,Protocol=WGS,SampleName=sample1T>'
ReferenceSet and Reference registration will be derived from the contig tags of the first VCF file. This section (truncated) is required for proper VCF import:
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
...
##contig=<ID=MT,assembly=b37,length=16569>
##contig=<ID=X,assembly=b37,length=155270560>
##contig=<ID=Y,assembly=b37,length=59373566>
- Variant Store
- Python API
- Utils
- MAF to TileDB Import
- VCF to TileDB Import
- Additional Info