Skip to content

Querying Variants using CLI

Nacho edited this page Sep 1, 2015 · 1 revision

Overview

There are two main ways of querying loaded data from OpenCGA Storage using Command Line Interface (CLI), these CLIs are:

  • opencga.sh: a top-level CLI for querying data using OpenCGA Catalog.
  • opencga-storage.sh: a low-level CLI for querying data using variant attributes such as region, gene, annotation or genotypes.

Both CLIs accept similar functionality and parameters for querying by variation attributes such as region, annotation or stats. The main difference between them is that top-level CLI can make use of OpenCGA Catalog and therefore use that information for making more complex queries such as querying by family or sample annotations.

They can be found in $OPENCGA_HOME/bin folder.

Using opencga-storage.sh

In version v0.6.0 this is the most complete way of querying data. This allows to query by:

  • genomic regions and feature IDs such as gene and SNPa
  • query by variant annotation such as consequence types, conservations scores, polyphen, sift or population frequencies
  • sample genotypes
  • variant stats in the study
  • some basic aggregations such as ranks, group-by or counts

All these filters can be combined. There are some query modifiers implemented:

  • skip and limit
  • count: this can be added to all CLIs and return the number of results

From the $OPENCGA_HOME folder you can execute to see all the parameters:

./bin/opencga-storage.sh fetch-variants -h

NOTE: for security reasons you need to login into OpenCGA if you want to use this CLI in a standard OpenCGA installation, this will guarantee you only access to the data you have permission, to login you only need to execute:

./bin/opencga.sh users login -u USER -p PASSWORD

A session token will be stored in your home directory and used internally by OpenCGA Storage.

Design considerations

There are some design decisions you must be aware of:

  1. Comma character ',' is used in different places in the CLI, this ',' can take two different behaviours. If the comma is used to enumerate query values such as regions, genes, SO terms, ... then this behaves as a logical OR as in region 1:1800000-1900000,1:2000000-2100000. But if comma is used to separate query fields such as "sift<0.2,polyphen<0.5" then it acts as a logical AND.

  2. Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2

  1. For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5" --return-study STUDY_ID

Example queries

Using variant attributes

To fetch variants for a specific region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000

and for several regions separating them by ',':

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000

you can also add a list of genes:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53

Note: remember all regions and genes are always a logical OR.

If you want SNV, INDELS or SV you can use --type parameter:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --type INDEL

Using variant annotation info

To query by SIFT or PolyPhen2 you use --protein-substitution:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2"

or using both, remember that here the ',' acts as a logical AND:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"

To only count the number of variants remember you can always add --count:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:1500000-2000000 --protein-substitution "sift<0.2" --count

To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

And you can always combine parameters in a logical AND, so next query will return variants annotated with those SO terms in the specified region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

To query using conservation scores you can use --conservation, next query use both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1 --conservation "phastCons<0.1,phylop<0.2" --count

You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter: ./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01" --count

or several populations together separated by comma, since they are different populations and query fields this is a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01,1000GENOMES_phase_1:AFR<0.01" --count

Sample genotype

To query by specific sample genotypes you can use --sample-genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --sample-genotype "15:0/0;20:0/1,1/1" --limit 15

variants which are 0/0 for sample 15 and 0/1 or 1/1 for sample 20 are returned (Note: in a few days sample names will be allowed)

Building more complex queries

You can combine all the parameters above to execute more complex queries:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 region 1:50000-3000000 --sample-genotype "15:0/0;20:0/1,1/1" --protein-substitution "sift<0.2,polyphen<0.5" --conservation "phastCons<0.1"

Some aggregations and rankings

To group variants per gene or consequence type you can use --group-by parameter:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --group-by gene

You can also rank genes or consequence type using --rank:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --rank gene

Clone this wiki locally