-
Notifications
You must be signed in to change notification settings - Fork 98
Querying Variants using CLI
There are two main ways of querying loaded data from OpenCGA Storage using Command Line Interface (CLI), these CLIs are:
- opencga.sh: a top-level CLI for querying data using OpenCGA Catalog.
- opencga-storage.sh: a low-level CLI for querying data using variant attributes such as region, gene, annotation or genotypes.
Both CLIs accept similar functionality and parameters for querying by variation attributes such as region, annotation or stats. The main difference between them is that top-level CLI can make use of OpenCGA Catalog and therefore use that information for making more complex queries such as querying by family or sample annotations.
They can be found in $OPENCGA_HOME/bin folder.
In version v0.6.0 this is the most complete way of querying data. This allows to query by:
- genomic regions and feature IDs such as gene and SNPa
- query by variant annotation such as consequence types, conservations scores, polyphen, sift or population frequencies
- sample genotypes
- variant stats in the study
- some basic aggregations such as ranks, group-by or counts
All these filters can be combined. There are some query modifiers implemented:
- skip and limit
- count: this can be added to all CLIs and return the number of results
From the $OPENCGA_HOME folder you can execute to see all the parameters:
./bin/opencga-storage.sh fetch-variants -h
NOTE: for security reasons you need to login into OpenCGA if you want to use this CLI in a standard OpenCGA installation, this will guarantee you only access to the data you have permission, to login you only need to execute:
./bin/opencga.sh users login -u USER -p PASSWORD
A session token will be stored in your home directory and used internally by OpenCGA Storage.
There are some design decisions you must be aware of:
-
Comma character ',' is used in different places in the CLI, this ',' can take two different behaviours. If the comma is used to enumerate query values such as regions, genes, SO terms, ... then this behaves as a logical OR as in region 1:1800000-1900000,1:2000000-2100000. But if comma is used to separate query fields such as "sift<0.2,polyphen<0.5" then it acts as a logical AND.
-
Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2
- For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5" --return-study STUDY_ID
To fetch variants for a specific region:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000
and for several regions separating them by ',':
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000
you can also add a list of genes:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53
Note: remember all regions and genes are always a logical OR.
If you want SNV, INDELS or SV you can use --type parameter:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --type INDEL
To query by SIFT or PolyPhen2 you use --protein-substitution:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2"
or using both, remember that here the ',' acts as a logical AND:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"
To only count the number of variants remember you can always add --count:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:1500000-2000000 --protein-substitution "sift<0.2" --count
To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count
And you can always combine parameters in a logical AND, so next query will return variants annotated with those SO terms in the specified region:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count
To query using conservation scores you can use --conservation, next query use both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical AND:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1 --conservation "phastCons<0.1,phylop<0.2" --count
You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01" --count
or several populations together separated by comma, since they are different populations and query fields this is a logical AND:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01,1000GENOMES_phase_1:AFR<0.01" --count
To query by specific sample genotypes you can use --sample-genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --sample-genotype "15:0/0;20:0/1,1/1" --limit 15
variants which are 0/0 for sample 15 and 0/1 or 1/1 for sample 20 are returned (Note: in a few days sample names will be allowed)
You can combine all the parameters above to execute more complex queries:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 region 1:50000-3000000 --sample-genotype "15:0/0;20:0/1,1/1" --protein-substitution "sift<0.2,polyphen<0.5" --conservation "phastCons<0.1"
To group variants per gene or consequence type you can use --group-by parameter:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --group-by gene
You can also rank genes or consequence type using --rank:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --rank gene
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About