Indexing Variant Data v0.8.0

🚧 This site in under construction 🚧

Indexing VCF files

VCF can be indexed using either the implemented pipeline in the CLI or the Java API. The aim of this indexation is to allow queries over the indexed data. Indexing data happens in two consecutive steps: transformation and load. During the transformation the VCF data is normalized and converted into an internal variant data model (see Data Models). During the load this normalized and validated file will be loaded in the active storage engine plugin. For more information about the indexation process, see OpenCGA Storage Overview.

For this testing area, we are going to use a sample VCF data from the 1000 Genomes Project. You can use any other file, but all the examples below use the VCF file ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz

Using the Command Line Interface

OpenCGA provides three ways of indexing VCF data:

Using opencga command line. This command line could be run from any computer and will access the webservices to order the indexation. In short, the webservice will create a job that will be detected by one of the daemons. The daemon will start the indexation job as soon as possible depending on the server capabilities.
Using opencga-analysis command line. This can only be run by users that have direct access to the server where openCGA is running. The indexation will start as soon as the command line is executed.
Using opencga-storage command line. This is the low level command line, which is completely independent of openCGA Catalog. You should only use this one if you already have a metadata server and only need Storage Indexing capabilities.

Indexing using methods 1 or 2 is the same in terms of openCGA catalog. The results will be exactly the same. However, using method 3, will not affect in any way to catalog.

Getting started with OpenCGA Catalog

You have the complete description of OpenCGA command line interface at Command Line, this is just a quick start example. This tutorial works for the version 0.8.0. To check this we can execute:

./opencga.sh --version
Version 0.8-dev
git version: develop ?????

We will assume that we already have a user account, project and study ready to work with with no files. Otherwise, look at Getting started.

Assumptions:

User id: user
Project alias: default
Study alias: study

First, we will have to link the vcf file downloaded. To do that, we will run the following command line:

./build/bin/opencga.sh files link -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -s study --path vcfs/ -P

We use --path to specify the virtual directory within catalog where the vcf file will be registered. We add -P to say that we want to create the "vcfs/" folder in case it doesn't exist yet. It is important knowing that the file will not be physically copied in the catalog structure, so if the file is deleted from disk, catalog will no longer be able to access it.

Method 1: OpenCGA catalog command line

In order this method to work, we will need the daemons to be running besides the webservices and mongoDB. To launch them, run:

./build/bin/opencga-admin.sh catalog daemon --start -p

If the whole indexation is wanted, run:

./build/bin/opencga.sh files index --id ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -o vcfs/ [--annotate --calculate-stats]

Notice that --annotate and --calculate-stats are not mandatory parameters and should be included only if the annotations as well as the genotype statistics are desired.

After running the command, a job will be created. One of the daemons will take that job and the indexation will start.

Method 2: OpenCGA analysis command line

This method assumes that the user has access to the server where openCGA is installed. This will avoid the creation of a job as in method 1 so no daemon will be needed. In this case, the user will need to provide the session id. This can be seen in ~/.opencga/session.json file. For the examples, we will assume that the session id is nfXzmz0EvxO7uU34DZSy.

"opencga-analysis.sh variant index" might have slightly different behaviours depending on the actions and parameters given as will be explained below. The mandatory parameters of this command line will be the "file-id" which will be the file to be transformed, loaded or indexed and "outdir" that will correspond to a directory outside catalog boundaries where temporary as well as transformation files will be stored.

Transformation

Only for this step, the user will be able to transform the vcf file into an avro file that might be or might not be stored and registered in catalog.

To obtain the transformation files and not register any of this in catalog, run:

./build/bin/opencga-analysis.sh variant index --outdir /tmp/test --file-id ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --sid nfXzmz0EvxO7uU34DZSy --transform

Later on, if the user wants to register the transformation files in catalog, the following command line will have to be run:

./build/bin/opencga.sh files link -i /tmp/test -s study --path vcfs/

However, if the user wants to store the transformed files in catalog, run the following:

./build/bin/opencga-analysis.sh variant index --outdir /tmp/test --file-id ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --sid nfXzmz0EvxO7uU34DZSy --transform --path vcfs/

Notice that in this case, the produced avro and json files will be stored together with the vcf file.

Load

In order to load, the vcf file as well as the corresponding transformed file should be registered in catalog. To start the load, the command line should be run as follows:

./build/bin/opencga-analysis.sh variant index --outdir /tmp/test --file-id ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --sid nfXzmz0EvxO7uU34DZSy --load [--annotate --calculate-stats]

⚠️ --transformed-files argument is present at the command line but still not working. Will be available soon to load externally transformed files.

Index

./build/bin/opencga-analysis.sh variant index --outdir /tmp/test --file-id ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --sid nfXzmz0EvxO7uU34DZSy --load [--annotate --calculate-stats]

Method 3: OpenCGA Storage command line

⚠️ This CLI is a low level CLI. Any metadata record must be done by the application.

A VCF indexation can be done in one or two steps, depending on if you want to delay the database load or not. //: # (It is more illustrative to do the two steps indexation)

A simple indexation may be done like the next command. Note that at this level you must manage your own ids. We will use 1 and 2 for instance:

./opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --database chr22_test_db

If your dataset is big and you want to do smaller steps, it is recommended to split the ETL process in two:

./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --transform 

./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz.variants.json.gz --database chr22_test_db --load

OpenCGA is an open source project and it is freely available.

General

OpenCGA Catalog

OpenCGA Storage

About

Provide feedback

Saved searches

Use saved searches to filter your results more quickly