Skip to content

dcHiC V1 Wiki Documentation

ay-lab edited this page Nov 30, 2021 · 4 revisions

dcHiC V1 Documentation

We just pushed out our newest version of dcHiC, one complete with sizable methodology and implementation changes. As such, the previous dcHiC wiki pages are now being merged into one global one. See the sidebar for any information/topics of interest.

Enriched Gene Ontologies

Significant differential compartment changes directly reflect biological changes. To utilize this, we provide a utility with a broad set of features to perform experiment and experiment-group specific GO enrichment. It extracts significant biological relationships from compartment changes alone. The core idea is a three step process:

  1. Using MFA-normalized, comparable PC values, identify bins where specific cell line(s) have PC values that are comparatively greater/less than the others (different run modes explained below).
  2. Overlap these with significant differential bins (as identified by dcHiC) and find all genes.
  3. Using ToppGene's web API, use these genes to perform GO enrichment.

All of this can be run in one step using gofilter.py. Results will be output in a tab-delimited text file. There are a few user options:

Configuration

In a cohort of cell lines, you can examine biological processes based on compartment changes in one specific cell line or in a group of cell lines. Determining the comparison configuration—which cell line(s) to examine—is the first step.

This is done with a two-line configuration file, where the first line contains experiment(s) you specifically wish to examine and the second line contains all others (comma-separated). These should be the same as those in the dcHiC -inputFile.

For instance, let's say you want to compare five different cell lines: A, B, C, D, and E. If you want to see A-specific biological processes, this is the configuration file:

A
B,C,D,E

If A and B belong to one class of cells and C/D/E belong to another, you can examine A and B together (as a cell group) like this:

A,B
C,D,E

Orientation

Once you have decided which cell line/cell group to examine, you must choose whether you want to look for positively/negatively enriched biological processes. Specifically, there are two run orientations: "active to inactive" and "inactive to active."

The first will take all bins where the specified cell line/cell group is positively enriched (see "Run Option" for more specifics). The second will take all bins where the specified cell line/cell group is negatively enriched.

Run Option

There are 3 run options: "direction," "sign," and "both." The combination of run option + orientation determines the analysis procedure.

"Sign" is the most general option. It only looks at the PC value sign of the specified cell line/group. For "active to inactive," it takes all bins where the cell line/group has positive PC values. For "inactive to active," it takes bins where the cell line/group has negative values.

"Direction" is more specific. For "active to inactive," it takes bins where PC values of the specified cell line/group are greater than all others. For "inactive to active," it takes bins where PC values of the specified cell line/group are less than all others. This is the typically-used run option.

"Both" combines these. For "active to inactive," it takes bins where PC values of the specified cell line/group are greater than all others and positive. For "inactive to active," it takes bins where PC values of the specified cell line/group are less than all others and negative.

Slack

With the configuration, orientation, and run option, dcHiC can identify all bins where a cell line/group is positively or negatively enriched. The final step is to overlap these with significant differential bins and perform GO enrichment on those genes. In the overlap process, however, some important genes may lay in bins just outside of significance thresholds or adjacent to very significant PC changes. As such, users can specify a "slack" in this overlap process (the default is 0).

Typical Run Mode

The number of options may seem intimidating. To save you some time, we offer a word of advice from our testing on the best run modes. For orientation, we typically used "active to inactive" is used to find biologically processes that are positively enriched in a given cell line/group. In run option, "direction" and "both" provided more specificity than "sign," and we used a slack of 0.

Run Options

Argument Meaning
-dir Overarching run directory, with all chromosome ("chr_XX") directories
-diffcompt Differential compartment file (be sure not to use "full compartment details")
-config Experiment configuration text file
-outprefix Prefix for output files
-genome Genome: mm10/mm9/hg38/hg19
-geneBed Bed file containing gene positions (hg38/mm10 examples available in "files").
-runOption 1 for direction, 2 for sign, 3 for both
-orientation 1 for Active to Inactive, 2 for Inactive to Active
-slack Number of bins around a differential compartment to take genes from. Default = 0.

Example

Consider a scenario examining three cell lines: mESC (mouse embryonic stem cells), NPC (neural progenitors), and CN (cortical neurons). When examining the significant differential compartments called by dcHiC where mESC > NPC and mESC > CN, we expect to find important genes for mESC-specific processes. In this example, we would use -runOption 1 (direction). From there, identifying important mESC GO biological processes in these compartments is a two-step process.

#1) Create the configuration file:

mESC
NPC,CN

#2) Run the script.

python gofilter.py -config config.txt -genome mm10 -geneBed /path/to/mm10_gene_pos.bed -runOption 1 -outprefix mESC_direction -orientation 1 -dir /path/to/Directory -diffcompt /path/to/Differential_Compartments.bedGraph 

Look for a final file titled "GO_mESC_direction.txt" that will have GO results in a tab-delimited format.

If instead interested in neuronal-specific genes, you might want to look at genes in differential compartments where NPC and CN are both positive and have greater compartment values than mESC. In that case, the configuration file would look like this:

NPC,CN
mESC

Mice Neural Differentiation Tutorial

This is a quick tutorial on running dcHiC with neural differentiation data from "Multiscale 3D Genome Rewiring During Mice Neural Development," a 2017 paper from Bonev, et al. It will cover all steps, including HMFA, differential calling and visualization.

First, download some sample pre-processed data from this Dropbox link. This contains 100kb data for two replicates for chromosomes 16 and 17, the gene position bed file, and LaminB data for mESC/NPC. Also download the blacklist from "files" in this GitHub repository and the goldenPath GC/TSS data for eigenvector alignment here.

Initial Steps

First, make sure your environment contains all of the necessary pre-requisite packages, and that you've cloned the code repository to a directory. Create a separate repository to run everything. In there:

Create the tab-delimited input.txt file:

mESC.Rep1    mESC    /path/to/mESC_Rep1_Data
mESC.Rep2    mESC    /path/to/mESC_Rep2_Data
NPC.Rep1     NPC     /path/to/NPC_Rep1_Data
NPC.Rep2     NPC     /path/to/NPC_Rep2_Data
CN.Rep1      CN      /path/to/CN_Rep1_Data
CN.Rep2      CN      /path/to/CN_Rep2_Data

Create the chr.txt file:

16
17

Running dcHiC and Visualization

From there, we run dcHiC with the two chromosomes in sequence (will require 4+ GB of RAM). Be sure to change the paths to dchic.py and the mm10_goldenpathData. Also be sure to have the blacklist from our repository. You can download goldenpathData for both mice and human genomes here.

python /path/to/dchic.py -res 100000 -inputFile input.txt -chrFile chr.txt -input 2 -alignData /path/to/mm10_goldenpathData -genome mm10 -blacklist mm10blacklist_sorted.bed

This will produce all results specified in the documentation.

To visualize pairwise comparisons, create a tab-delimited visualization.txt file.

NPC_vs_mESC_full_compartment_details.bedGraph      NPC_vs_mESC    compartment
CN_vs_NPC_full_compartment_details.bedGraph        CN_vs_NPC      compartment
CN_vs_meSC_full_compartment_details.bedGraph       CN_vs_mESC     compartment
MultiComparison_full_compartment_details.bedGraph  MultiComp      compartment
mESC_LaminB1_mm10.10Kb.bedGraph                    mESC_LaminB1   lamin
NPC_LaminB1_mm10.10Kb.bedGraph                     NPC_LaminB1    lamin

Using this, run:

Rscript /path/to/igvtrack.r mm10 visualization.txt

Results

For a quick check, your visualization results should match the chr16/chr17 results here. This web file also has RNA-Seq TPM values.

You may also check the ~45-50 mb region of mESC to NPC/CN, where there is a well-documented compartment change of the Dppa4 gene domain. This gene, one associated with STEM-cell pluripotency, should be in the A compartment in mESC's and in the B compartment elsewhere.


Pre-Processing O/E Correlation Matrices

When performing compartment analysis on Hi-C datasets, the first step is to obtain O/E correlation matrices. dcHiC uses dense tab-delimited matrices as its input, like this one. We offer a local processing pipeline to convert sparse matrices (and corresponding bed files) to O/E correlation matrices; dcHiC also accepts input from HOMER and Juicebox. See input options below:

Hi-C Pro

Within the "data" section of the Hi-C Pro "hic_results" folder, there are validPairs files containing filtered reads for each data replicate processed in the format below. Because dcHiC learns parameters from replicates, it needs these validPairs interactions. These are not the allValidPairs, except possibly in cases of running dcHiC without replicates.

read name / chr_reads1 / pos_reads1 / strand_reads1 / chr_reads2 / pos_reads2 / strand_reads2 

You will further see in the "matrix" section of the "hic_results" directory that there are sparse matrices and their corresponding bed files.

Using the "build_matrix" executable (cpp here) under the HiC-Pro utilities folder, first create the sparse matrix/bed files for each replicate's validPairs. After doing this, you will need to run createMatrix.R, which requires an input text file for specifications.

It should be like the one below, with rows delimiting Hi-C profiles: the first column for matrices, the second for bed files, and the third for the file prefix. All matrices should be created from raw Hi-C data without any normalization (KR, etc.).

/path/to/HMEC_Rep_1.matrix    /path/to/HMEC_Rep_1.bed    HMEC.1
/path/to/HMEC_Rep_2.matrix    /path/to/HMEC_Rep_2.bed    HMEC.2
/path/to/T47D_Rep_1.matrix    /path/to/T47D_Rep_1.bed    T47D.1
... 

From here, run createMatrix.R. The first argument is the input text file, and the second is the user-desired name for a directory containing results.

Rscript createMatrix.R input.txt CorrelationMatrices

The resulting directory will contain a directory for each prefix (like "HMEC.1_mat") with ".matrix" files that are ready to use in dcHiC. Be sure to specify "-input 2" when running dchic.py.

.hic

To process .hic files, dcHiC uses the native Pearsons function of the Java JAR of Juicebox tools to perform calculations and merely does some reformatting in preprocess.py, whose arguments are shown below:

Argument Meaning
-input The O/E contact matrix from Juicebox or the .cool file (see below)
-mode A number to specify processing mode: 1 for Juicebox input and 2 for .cool input
-genomeFile Chromosome size files (two column tab-separated file with chromosomes and their sizes)
-res Resolution of processing (i,e. 100000)
-prefix For a Juicebox input, enter the chromosome of the file. For a .cool input, enter the prefix.

To run, follow appropriate steps to run O/E correlation matrix processing (Using the JAR, add a -p for resolutions higher than 500kb).

java -jar juicer_tools.jar pearsons NONE hicfile.hic 3 BP 100000 pearsons_100Kb.txt -p # pearsons_100Kb.txt is the O/E correlation of chr3
python preprocessing.py -input pearsons_100kb.txt -mode 1 -genomeFile hg38sizes.txt -res 100000 -prefix 3 

**After preprocess.py, O/E matrices are ready to use in dcHiC. Be sure to specify "-input 2" in dchic.py arguments. **

.cool

To process .cool files, dcHiC uses the cooler dump feature to obtain the sparse matrix and uses the provided -genomeFile to produce a corresponding bed index file. It accepts .mcool and .cool, and the resulting data can be processed into O/E matrices with createMatrix.R.

python preprocessing.py -input coolfile.mcool -mode 2 -genomeFile hg38sizes.txt -res 100000 -prefix coolfile

**NOTE: If your .cool/.mcool file only covers certain chromosome(s), change the -genomeFile so that it only specifies those. **

HOMER

For those familiar with HOMER, it can also be quickly repurposed to pre-process O/E matrices to use in dcHiC.

To begin, create a tag directory with Hi-C data in any way you find preferable. For HiC-Pro data, you can use the replicate validPairs and "makeTagDirectory" with the -HiCSummary option.

Afterward, run "runHiCpca.pl" with a few edits. Ensure that the functions "analyzeHiC" and "prepForR.pl" point to their original locations, and then comment out everything with the PCA step and after (as you will be performing MFA!) beyond the lines:

while ($id >= 0) {
        $id = wait();
}

Also remove "chrTmpFile2" in this line, as these files are the O/E correlation matrices.

`rm -f "$rInputFile" "$chrTmpFile" "$chrTmpFile2"`;

The resulting O/E files will have the extension "chr__.2.tmp." The complete command may look like this:

/path/to/makeTagDirectory mESC_vP -format HiCsummary /path/to/mESCrepValidPairs
/path/to/runHiCpca.pl mESC_out mESC_VP -res 100000 -genome mm10 -pc 2

**After HOMER processing, these files are ready to use in dcHiC. Be sure to specify "-input 1" in dchic.py program arguments. **


Running dcHiC Without Replicates

Differential calling with dcHiC "learns" the amount that PC (compartment) values vary between biological replicate datasets and uses those parameters for significance thresholds. However, it is also possible to run dcHiC from start to finish without replicates (for users using HiC-Pro, this means using the allValidPairs file).

In the input.txt file, put the same name for the "replicate" and "cell line" columns.

HMEC  HMEC    /path/to/HMEC
MCF7  MCF7    /path/to/MCF7
MCF10 MCF10   /path/to/MCF10

When running the application, specify the replicate parameters (for differential calling) using the -repParams option. The sample replicate parameter files for human datasets were created using Tier 1 ENCODE GM12878 and HMEC datasets for human samples/ The mice replicate parameter files were created using gold standard mice neural differentiation data (the same as that in our tutorial).

The final command might look like this:

python dchic.py -res 100000 -inputFile input.txt -chrFile chr.txt -input 2 -parallel 4 -genome mm10 -alignData /path/to/mm10_goldenPathData -repParams miceparams.txt -blacklist mm10blacklist_sorted.bed