-
Notifications
You must be signed in to change notification settings - Fork 8
Obtaining
Now that you've installed orchid, its time to get some data! There are two types of data required by orchid:
- Mutational Data - Information about the location of tumor variants in the genome
- Feature Data - Annotation information concerning the relevance of a given mutation
Various software can call variants from tumor sequence data, most often dumping the results into a vcf file. You are free to use any such file with orchid. Additionally, several consortia have begun to catalog tumor sequence data. For example, the International Cancer Genome Consortium (ICGC) has thousands of tumor variants across many tissue types. The variants are in a proprietary ICGC format, but orchid supports this file format as well. Feel free to explore the ICGC data portal here, and when you have a desired set of donors (patients), download their simple somatic mutation and (optionally) copy number information data. You may also want to download clinical information since this may contain important labels for machine learning!
Other data sources exist as well, including The Cancer Genome Atlas (TCGA), and Cosmic. If obtaining data from these sources, please use the vcf file format.
Once downloaded and extracted, orchid simply needs to know the name of the simple somatic mutation data file. This can be specified in the /config
file:
params.filename = 'simple_somatic_mutation_data.vcf'
Orchid will look for this file in the /data/mutations
folder by default, so place the mutation file there or soft link it to this location.
If you'd like to also include copy number variation (CNV) information downloaded from ICGC as a feature, you must first convert the file into a format compatible with orchid. The script /code/etc/parse_cnv.sh
can be used to do this. Simply provide the CNV file as the first argument and an output file as a second. For example:
/code/etc/parse_cnv.sh ICGC_CNV_file.tsv orchid_cnv_file.bed
Then, update orchid's /config
file with the name of the converted file:
params.copy_number_file = 'orchid_cnv_file.bed'
Again, orchid expects this file in the /data/mutations
folder, but you can soft link it to this location if you'd prefer.
NOTE: The CNV data is technically processed as a special feature by orchid but is stored with mutation data >and not in the feature folder. If not using CNV data, please comment out the params.copy_number_file line like >so:
//params.copy_number_file = 'orchid_cnv_file.bed'
Any number of potential features can be used to annotate mutation data. Orchid accepts properly formatted bed or tabix files out-of-the-box, but it can also be used with other formats as long as you provide a way to parse them. The UCSC genome browser has many genome-wide annotation tracks downloadable in bed format, or you can scour the internet for additional annotations. All features should be downloaded or linked to /data/features
folder. To download the features used for the publication of this software, head over here.
If going the way of bed or tabix files, your job is easy. Simply provide a feature definition in the config
file directly below the line that reads (~ line 125):
features = [:]
You'll notice a detailed description of how to write a feature definition at that location, but a minimal definition looks like this:
features['targetscans'] =
[
processor : 'bed',
feature_type : 'integer',
file : env.FEATURE_DIR + '/targetscans.bed',
]
In a nutshell, the feature name (e.g. 'targetscans') is placed in the first set of square brackets, and feature key: value parameters are placed the second. The three parameters shown here are required: processor
can be bed or tabix depending on the feature file type, feature_type
can be an integer (whole numbers), float (continuous numbers), category (text annotation), or boolean (True/False) value depending on the feature value type, and file
specifies the location and name of the feature file, which should be in the /data/feature
directory as defined by env.FEATURE_DIR
. Other parameters are described in the config file and modify the behavior of how orchid stores and processes the corresponding feature.
NOTE: If using a categorical feature, please try to keep the number of potential categories to a minimum, >or machine learning performance may suffer.
Several features, including trinucleotide context, kegg cancer pathway membership, frequency (of mutation within variant file), and functional consequence are provided by orchid free... you may not even need additional features for fairly performant machine learning models!
TIP: There are also function definitions written for quite a few features, along with descriptions and >download links. To use these, simply procure the feature file and uncomment the feature definition!
It is also possible to write your own parsing engine in the /workflow/annotate
file, but this will require knowledge of how Nextflow works and a bit of intuition about the orchid_db annotation
command. Feel free to email clint.cario ‘at’ ucsf.edu if you really need to do this and get stuck trying to figure it out. In the long run, its probably easier to convert such files to tabix or bed format.