Version 1.0
Date 11/12/2018
Description DPM-LGCP stands for Dirichlet Prior mixture for Log Gaussian Cox Process. It is a non-parametric Bayesian clustering technique based on non-homogeneous Poisson process which has been used to identify transcriptional regulatory modules in distinct chromatin states. The algorithm clusters transcription factors sharing similar intensity function (binding patterns). It has broader applications to cases requiring un-supervised clustering of large datasets with prior information on clusters. Author Sharmi Banerjee, Honxiao Zhu, Man Tang, Xiaowei Wu, Wu Feng, David Xie
The following content details on how to use the package in addition to a chromatin identification tool to predict Transcriptional Regulatoy Modules (TRMs) within the different chromatin states.
DP-LGCP is written in R software. Please install it on the system before moving on to the next steps.
DP-LGCP is independent of operating systems as it is written in R. Basic requirements for running the pipeline include installing R and the following libraries- ‘miceadds’, ‘foreach’, ‘doParallel’, ‘igraph’, ‘caTools’, ‘ComplexHeatmap’, ‘pvclust’, ‘MASS’, ‘circlize’, ‘RColorBrewer’, ‘pheatmap’. INLA package in provided here. Please see description below on how to add the package to the R libraries.
Name | Description | Scenario |
---|---|---|
INLA | Source files included with INLA package | Used in both 'Real' and 'Simulation' |
Real | Contains input, scripts and output for 'Real' scenario | 'Real |
Real/Data | Contains input files for 'Real' scenario | 'Real' |
Real/Main_code | Contains R scripts for clustering in the 'Real' scenario | 'Real' |
Real/Results | Contains output files generate after running the scripts for 'Real' scenario | 'Real' |
Simulation | Contains input, scripts and output for 'Simulation' scenario | 'Simulation' |
Simulation/data | Contains input files for 'Simulation' scenario | 'Simulation' |
Simulation/code | Contains R scripts for clustering in the 'Simulation' scenario | 'Simulation' |
Simulation/results | Contains output files generated for 'Simulation' scenario | 'Simulation' |
-
'INLA': contains the R INLA library
-
'Real': contains script, data and results for transcription factors from neural stem cells
-'Data': peak files for twenty-one transcription factors, diHMM generated domain bed files for thirty domains and mm10 refseq genes. The basic input data requirements towards using the peipeline are
-
File generate by a genome segmentation software such as diHMM. This file is similar to a bed file with the first three columns being Chromosome, Start, End, and the fourth column containing the segment label such as ‘D1’, representing a domain-level state. We followed the instructions from the diHMM website (https://github.com/gcyuan/diHMM) on how to use the software and generate the results. On the Github link the file name is ‘NSC_nD30_nB30_domainLevelStatesColor.bed’ which means the file was generated by using data from Neural Stem Cells (NSC), with 30 Domain-level and Nucleosome-level states (diHMM parameters specified on the diHMM website)
-
Peak files for different transcription factors. These files should be at least 3–column bed formatted files (Chromosome, Start, End) and can be generated by following a standard ChIP-seq pipeline. We downloaded the peak files from ChIP-Atlas (they used MACS) for peak calling
-'Main_Code': scripts to generate clusters in chromatin states
-'Plotting_Code': scripts to generate enrichment maps of TF peaks and TF modules in chromatin states
-'Results': clustering results (RData files) on chromatin states
-
-
'Simulation': has code and results for simulated data
-'code': scripts to simulate transcription factors binding sites with twenty TFs and three clusters and evaluate results by applying clustering algorithm using non-homogeneous Poisson point process
-'data': simulated data generated from scripts in ’code’ folder
-'results': clustering results on simulated data
-'Neal_Alg3_funs': has the main algorithm clustering functions
Download and copy INLA folder under LGCP-package to the R library of the system. For example for a R 3.3 version, the path is - /Library/Frameworks/R.framework/Versions/3.3/Resources/library
- Set working directory in R to the path up to the Simulation directory in LGCP package. E.g. (on R console) -
setwd('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Simulation/')
- Set a variable 'workpath' in R console. E.g. (on R console) -
workpath = "/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Simulation/code/”
- Source the INLA functions needed. E.g.(on R console)
source.all('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Neal_Alg3_funs/')
source('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Simulation/code/lgcp_simu_funs_final.R')
- Run simulation script. E.g. (on R console) -
source("script_simulate_data.R")
- Evaluate clustering result. E.g. (on R console)
source("script_run_Neal_alg3.R")
• Three 1D non-homogeneous Poisson processes with Log-Gaussian intensity were generated. Next, binding site locations of twenty transcription factors were simulated by drawing independent samples from these non-homogeneous Poisson processes
• The user does not need to provide any additional input. Simulation input data is put under the folder ‘Simulation/data’ folder with the name ‘lgcp_simu_data_v0.RData’. The user can generate her/his own simulation data by changing parameters in the script ‘script_simulate_data.R’ under ‘Simulation/code/’
• Using the simulated binding site locations, we applied the clustering algorithm. The TFs were randomly assigned to any cluster at initializations. The results are shown after ten iterations –
- initial error rate: 0.39
- estimation error rate: 0
- overall true marginal likelihood: 76.73
- overall estimated marginal likelihood: 76.73
- true marginal likelihood per cluster: 27.27, 14.95, 34.49
- estimated marginal likelihood per cluster: 27.27, 14.95, 34.49
- Set working directory in R to the path up to the Real directory in LGCP package. E.g. (on R console)-
setwd('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Real/Main_Code/')
- Set a variable 'workpath' in R console. E.g. (on R console) -
workpath = "/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/”
- Source the INLA functions needed E.g. (on R console)
source.all('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Neal_Alg3_funs/')
- Run the following scripts (on R console)–
source("script_generate_TF_peaks_in_diHMM_states.R")
This step generates a results (in .RData) format containing the TF peaks contained within each chromatin state
source("script_get_clusters_for_each_diHMM_state.R")
This step generates clustering results within each chromatin state using results from step (a)
-
Files generated
a. Clustering results will be stored in the ‘results’ folder under ‘Real’ with the file names diHMM_celltype_cluster_domainname.RData
b. TF peaks in each chromatin state will be generated with names celltype_TFpeaks_diHMM_chromatinstate.RData
- Set working directory in R to the path up to the Real directory in LGCP package. E.g.(on R console) -
setwd('/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/Real/Plotting_Code/')
- Set work path in R console. E.g. (on R console)-
workpath = “/Users/sharmibanerjee/Documents/Summer2017/LGCP_package/”
- Run the following scripts to generate heat map of TF peaks within different chromatin states. E.g. (on R console)
source("TF_peak_enrichment_in _states.R")