diff --git a/README.md b/README.md index 535f6b46..11502e99 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # ***A tool for computational cancer driver discovery*** -![Github version](https://img.shields.io/badge/version-1.0.1-yellow.svg) +![Github version](https://img.shields.io/badge/version-1.0.2-yellow.svg) [![GitHub license](https://img.shields.io/badge/license-AGPL-blue.svg)](./LICENSE) [![PyPI version](https://badge.fury.io/py/DriverPower.svg)](https://badge.fury.io/py/DriverPower) [![Documentation Status](https://readthedocs.org/projects/driverpower/badge/?version=latest)](http://driverpower.readthedocs.io/en/latest/?badge=latest) diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 015b0c68..be434c92 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -147,14 +147,50 @@ aka, burden-test only: --feature test_feature.hdf5 \ --response test_y.tsv \ --model ./output/tutorial.GBM.model.pkl \ - --name 'DriverPower' \ + --name 'DriverPower_burden' \ --outDir ./output/ To use functional information, one or more types of functional measurements (e.g., CADD, EIGEN, LINSIGHT etc) -need to be collected first. The CADD scores can be retrived via its -`web interface `_ without downloading the large file for all possible SNVs (~80 G). +need to be collected first. The CADD scores can be retrieved via its +`web interface `_ (up tp 100K variants each time) without downloading the +large file for all possible SNVs (~80 G). If you have more than 100K variants, you can either split your file and run +the web app multiple times, or download the large file and try ``tabix``. +Other scores can be obtained using a similar method after download. After obtaining the per-mutation score, you can calculate the average score per element, which will be used by DriverPower. +Here we show how to score 1,000 mutations and calculate per-element score: + +.. code-block:: bash + + # We omit INDELs here; but CADD can score INDELs in VCF format + zcat ./random_mutations.tsv.gz | \ + awk 'BEGIN{OFS="\t"} $4 != "-" && $5 != "-" {print $1,$3,".",$4,$5}' | \ + head -100000 | gzip > random_mutations.1K.vcf.gz + # Upload formatted variants (random_mutations.1K.vcf.gz) to CADD's web interface + # and download the result file (something like GRCh37-v1.4_f8600bd0c0aa23d4f6abc99eb8201222.tsv.gz). + ##### + # Intersect the score file (we use the PHRED score) with test elements + zcat ./GRCh37-v1.4_f8600bd0c0aa23d4f6abc99eb8201222.tsv.gz | \ + tail -n +3 | awk 'BEGIN {OFS="\t"} {print "chr"$1, $2-1, $2, $6}' | \ + bedtools intersect -a ./test_elements.tsv -b stdin -wa -wb > CADD_ele.tsv + # The 4th column is the element ID and the 8th column is the CADD PHRED score + printf "binID\tCADD\n" > CADD_per_ele_score.tsv + bedtools groupby -i ./CADD_ele.tsv -g 4 -c 8 -o mean >> CADD_per_ele_score.tsv + +We can now supply the per-element score file to DriverPower and call driver candidates: + +.. code-block:: bash + + driverpower infer \ + --feature test_feature.hdf5 \ + --response test_y.tsv \ + --model ./output/tutorial.GBM.model.pkl \ + --name 'DriverPower_burden_function' \ + --outDir ./output/ \ + --funcScore CADD_per_ele_score.tsv \ + --funcScoreCut "CADD:0.01" + 4: Misc. -------- +TODO \ No newline at end of file diff --git a/driverpower/__init__.py b/driverpower/__init__.py index cd7ca498..a6221b3d 100644 --- a/driverpower/__init__.py +++ b/driverpower/__init__.py @@ -1 +1 @@ -__version__ = '1.0.1' +__version__ = '1.0.2' diff --git a/setup.py b/setup.py index e867a9e1..0a9400c4 100644 --- a/setup.py +++ b/setup.py @@ -20,6 +20,8 @@ 'scikit-learn >= 0.18', 'statsmodels >= 0.6.1', 'xgboost >= 0.6a', + 'pybedtools >= 0.7.10', + 'pytables >= 3.4.4', ], entry_points = { 'console_scripts': [