Merge pull request #2 from databio/dev

Development changes into master
databio · Apr 17, 2017 · 1b0503e · 1b0503e
2 parents f172220 + 47dc737
commit 1b0503e
Show file tree

Hide file tree

Showing 20 changed files with 522 additions and 181 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+*.pyc
+.~lock*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,20 @@
+# Change log
+All notable changes to this project will be documented in this file.
+
+## [0.2.0]
+### Added
+- FRiP can now be calculated based on reference peaks
+- Pipeline now reports Picard estimated library size statistic
+- Added option for pyadapt trimming
+- Added example project using 'gold standard' data
+- Added new resource package grades
+- Added preliminary 'exact cuts' scripts, but they are not yet used
+
+### Changed
+- Improved README
+- Changed filename of the TSS file
+- Reorganized structure of alignment code
+
+## [0.1.0]
+### Added
+- First release of ATAC-seq pypiper pipeline
diff --git a/README.md b/README.md
@@ -2,6 +2,22 @@
 
 This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs.
 
+## Pipeline features outlined
+
+**Decoy alignments.** Before aligning to the genome, we first align to decoy sequences. This has several advantages: it speeds up the process dramatically, reduces noise from erroneous alignments, and provides potential to analyze signal at repeats. The pipeline will align *sequentially* to these decoy sequences (if provided):
+
+- chrM (doubled; for non-circular aligners, to draw away reads from NuMTs)
+- Alu elements
+- alpha satellites
+- rDNA
+- repbase
+
+We have provided indexed assemblies for download for each of these **for human** in the [ref_decoy](https://github.com/databio/ref_decoy) repository (excluding repbase, which is not publicly available). Any assemblies not provided are skipped.
+
+**Fraction of reads in peaks (FRIP).** By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section).
+
+
+
 ## Installing
 
 **Prerequisites**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a pipeline for a single sample, and [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation). You can do a user-specific install of both like this:
@@ -18,13 +34,14 @@ export PATH=$PATH:~/.local/bin
 
 **Required executables**. To run the pipeline, you will also need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section.
 
-**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). The pipeline aligns serially to decoy sequences if you have them set up, which greatly improves pipeline performance. You can set up the decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
+**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You can set up the (optional) decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
 
 **Clone the pipeline**. Then, clone this repository using one of these methods:
 - using SSH: `git clone git@github.com:databio/ATACseq.git`
 - using HTTPS: `git clone https://github.com/databio/ATACseq.git`
 
 ## Configuring
+
 You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, there is a pre-made config file and project template. Follow the instructions on the [Chang lab configuration](examples/chang_project) page.
 
 Option 1: **Default configuration** ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)). 
@@ -68,6 +85,29 @@ Your annotation file must specify these columns:
 
 Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns).
 
+## TSS enrichments
+
+In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that.
+
+From refGene:
+
+```
+# Provide genome string and gene file
+GENOME="hg38"
+URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"
+
+wget -O ${GENOME}_TSS_full.txt.gz ${URL}
+zcat ${GENOME}_TSS_full.txt.gz | awk  '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}'  | LC_COLLATE=C sort -k1,1 -k2,2n -u > ${GENOME}_TSS.tsv
+echo ${GENOME}_TSS.tsv
+```
+
+Another option from Gencode GTF:
+
+```
+grep "level 1" ${GENOME}.gtf | grep "gene" | awk  '{if($7=="+"){print $1"\t"$4"\t"$4"\t"$7}else{print $1"\t"$5"\t"$5"\t"$7}}' | LC_COLLATE=C sort -u -k1,1V -k2,2n > ${GENOME}_TSS.tsv
+
+```
+
 ## Using a cluster
 
 Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project.  You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html).

diff --git a/cmd.sh b/cmd.sh
@@ -1,3 +1,3 @@
-# using pre-fix of fastq file 
-#python pipelines/ATACseq.py  -P 3 -M 100 -O test_out -R -S liver -G mm9  -Q paired  -C ATACseq.yaml  -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz  
-python pipelines/ATACseq.py  -P 3 -M 100 -O test_out -R -S liver -G hg19  -Q paired  -C ATACseq.yaml  -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz  
+# using pre-fix of fastq file
+#python pipelines/ATACseq.py  -P 3 -M 100 -O test_out -R -S liver -G mm9  -Q paired  -C ATACseq.yaml  -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
+python pipelines/ATACseq.py  -P 3 -M 100 -O test_out -R -S liver -G hg19  -Q paired  -C ATACseq.yaml  -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz  
diff --git a/config/pipeline_interface.yaml b/config/pipeline_interface.yaml
@@ -11,10 +11,21 @@ ATACseq.py:
     "--input2": read2
     "-gs mm": null
     "--single-or-paired": read_type
+  optional_arguments:
+    "--frip-ref-peaks": FRIP_ref
   resources:
     default:
       file_size: "0"
+      cores: "2"
+      mem: "4000"
+      time: "0-04:00:00"
+    normal:
+      file_size: "0.5"
+      cores: "4"
+      mem: "16000"
+      time: "2-00:00:00"
+    large:
+      file_size: "6"
       cores: "8"
       mem: "32000"
-      time: "2-00:00:00"
-      partition: "parallel"
+      time: "3-00:00:00"
diff --git a/config/protocol_mappings.yaml b/config/protocol_mappings.yaml
@@ -1 +1,2 @@
-ATAC: ATACseq.py
+ATAC: ATACseq.py
+ATAC-SEQ: ATACseq.py
diff --git a/examples/gold_atac/README.md b/examples/gold_atac/README.md
@@ -0,0 +1,22 @@
+
+# Gold ATAC
+
+Testing ATAC-seq pipeline on gold standard public ATAC-seq data.
+
+## Grab data, project setup
+
+Download raw `fastq.gz` files (use `fastq-dump` from SRA. You may also use `get_geo.py` to download raw ATAC-seq reads from SRA and metadata from GEO:
+
+```
+python get_geo.py -i ~/code/ATACseq/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq
+```
+
+I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv).
+
+I create project config file and sampled test data. The SRA fastq files should be stored in a folder `${SRAFQ}`, and then this will run with looper with no additional changes.
+
+## Run pipeline
+
+```
+looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d
+```
diff --git a/examples/gold_atac/metadata/annocomb_gold_atac_gse.csv b/examples/gold_atac/metadata/annocomb_gold_atac_gse.csv
@@ -0,0 +1,6 @@
+sample_name,Sample_title,Sample_source_name_ch1,organism,Sample_organism_ch1,library,Sample_library_selection,Sample_library_strategy,data_source,Sample_type,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model
+ATAC-seq_from_dendritic_cell_(ENCLB065VMV),ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000
+ATAC-seq_from_dendritic_cell_(ENCLB811FLK),ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000
+ATAC-seq_from_dendritic_cell_(ENCLB887PKE),ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500
+ATAC-seq_from_dendritic_cell_(ENCLB586KIS),ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000
+ATAC-seq_from_dendritic_cell_(ENCLB384NOX),ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000
diff --git a/examples/gold_atac/metadata/gold_atac_annotation.csv b/examples/gold_atac/metadata/gold_atac_annotation.csv
@@ -0,0 +1,7 @@
+sample_name,sample_description,treatment_description,organism,library,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model,read1,read2
+test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2
+gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
+gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
+gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2
+gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
+gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
diff --git a/examples/gold_atac/metadata/gold_atac_gse.csv b/examples/gold_atac/metadata/gold_atac_gse.csv
@@ -0,0 +1,5 @@
+GSE94182
+GSE94222
+GSE94177
+GSE94196
+GSE94173
diff --git a/examples/gold_atac/metadata/project_config.yaml b/examples/gold_atac/metadata/project_config.yaml
@@ -0,0 +1,27 @@
+# This project config file describes your project. See looper docs for details.
+
+metadata:  # relative paths are relative to this config file
+  sample_annotation: gold_atac_annotation.csv  # sheet listing all samples in the project
+  output_dir: ${PROCESSED}gold_atac  # ABSOLUTE PATH to the parent, shared space where project results go
+  pipelines_dir: "${CODEBASE}ATACseq"  # ABSOLUTE PATH the directory where looper will find the pipeline repository
+
+# in your sample_annotation, columns with these names will be populated as described 
+# in the data_sources section below
+derived_columns: [read1, read2]  
+
+data_sources:  # This section describes paths to your data
+  # specify the ABSOLUTE PATH of input files using variable path expressions
+  # These keys then correspond to values in your sample annotation columns.
+  # Variables specified using brackets are populated from sample_annotation columns. 
+  # Variable syntax: {column_name}. For example, use {sample_name} to populate
+  # the file name with the value in the sample_name column for each sample.
+  # example_data_source: "/path/to/data/{sample_name}_R1.fastq.gz"
+  SRA: "${SRABAM}{SRR}.bam"
+  SRA_1: "${SRAFQ}{SRR}_1.fastq.gz"
+  SRA_2: "${SRAFQ}{SRR}_2.fastq.gz"
+  TEST_1: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r1.fastq.gz"
+  TEST_2: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r2.fastq.gz"
+
+genomes:
+  human: hg38
+  mouse: mm10
diff --git a/examples/test_data/test1_r1.fastq.gz b/examples/test_data/test1_r1.fastq.gz
diff --git a/examples/test_data/test1_r2.fastq.gz b/examples/test_data/test1_r2.fastq.gz