This repository hosts a Nextflow workflow to call TAD cliques.
The workflow is largely based on this tutorial.
- Nextflow (at least version: v22.10.8. Pipeline was developed using v24.04.2)
- Docker or Singularity/Apptainer
Running the pipeline without containers is technically possible, but it is not recommended.
If you absolutely cannot use containers...
Have a look at the env.yml
for the list of dependencies to be installed.
To install the dependencies in a Conda environment named myenv
, run the following:
conda env update --name myenv --file env.yml --prune
You will also need to compile NCHG
from the source code available at paulsengroup/NCHG.
The workflow can be run in two ways:
- Using a sample sheet (recommended, supports processing multiple samples at once)
- By specifying options directly on the CLI or using a config
The samplesheet should be a TSV file with the following columns:
sample | hic_file | resolution | tads |
---|---|---|---|
sample_name | myfile.hic | 50000 | tads.bed |
4DNFI74YHN5W | 4DNFI74YHN5W.mcool::/resolutions/50000 | 50000 | 4DNFI74YHN5W_domains.bed |
- sample: Sample names/ids. This field will be used as prefix to in the output file names (see below).
- hic_file: Path to a file in .hic or Cooler format.
- resolution: Resolution to be used for the data analysis (50-100kbp are good starting points).
- tads (optional) : path to a BED3+ file with the list of TADs. When not specified, the workflow will use hicFindTADs from the HiCExplorer suite to call TADs.
URI syntax for multi-resolution Cooler files is supported (e.g. myfile.mcool::/resolutions/bin_size
).
Furthermore, all contact matrices (and TADs when provided) should use the same reference genome assembly.
Without using a samplesheet
To run the workflow without a samplesheet is not available, the following parameters are required:
- sample
- hic_file
- tads
Parameters have the same meaning as the header fields outlined in the previous section.
The above parameters can be passed directly through the CLI when calling nextflow run
:
nextflow run --sample='4DNFI74YHN5W' \
--hic_file='data/4DNFI74YHN5W.mcool' \
--resolution=50000
...
Alternatively, parameters can be written to a config
file:
user@dev:/tmp$ cat myconfig.txt
sample = '4DNFI74YHN5W'
hic_file = 'data/4DNFI74YHN5W.mcool'
resolution = 50000
and the config
file is then passed to nextflow run
:
nextflow run -c myconfig.txt ...
In addition to the mandatory parameters, the pipeline accepts the following parameters:
- cytoband: path to a cytoband file. Used to mask centromeric regions.
- assembly_gaps: path to a BED file with the list of assembly gaps/unmappable regions.
- custom_mask: path to a BED file with a list of custom regions to be masked out.
Note that NCHG by default uses the MAD-max
filter to remove bins with suspiciously high or low marginals, so providing the above files is usually not requirerd.
One exception is when dealing with genomes affected by structural variants, in which case we reccommend masking out these regions using custom_mask.
- hicexplorer_hic_norm: normalization to use when calling TADs from .hic files.
- hicexplorer_cool_norm: normalization to use when calling TADs from .[m]cool files.
- nchg_mad_max: cutoff used by NCHG when performing the
MAD-max
filtering. - nchg_bad_bin_fraction: bad bin fraction used by NCHG to discard domains overlapping with a high fraction of bad bins.
- nchg_fdr_cis: adjusted pvalue used by NCHG to filter significant cis interactions.
- nchg_log_ratio_cis: log ratio used by NCHG to filter significant cis interactions.
- nchg_fdr_trans: adjusted pvalue used by NCHG to filter significant trans interactions.
- nchg_log_ratio_trans: log ratio used by NCHG to filter significant trans interactions.
- clique_size_thresh: minimum clique size.
- call_cis_cliques: call cliques overlapping cis regions of the Hi-C matrix.
- call_trans_cliques: call cliques overlapping trans regions of the Hi-C matrix.
By default, the workflow results are published under result/
. The output folder can be customized through the outdir parameter.
For a complete list of parameters supported by the workflow refer to the workflow main config file.
First, download the example datasets using script utils/download_example_datasets.sh
.
# This will download files inside folder data/
utils/download_example_datasets.sh data/
Next, create a samplesheet.tsv
file like the follwing (make sure you are using tabs, not spaces!)
sample hic_file resolution tads
example data/4DNFI74YHN5W.mcool 50000
Finally, run the workflow with:
user@dev:/tmp$ nextflow run --max_cpus=8 \
--max_memory=16.GB \
--max_time=2.h \
--sample_sheet=samplesheet.tsv \
--outdir=data/results/ \
https://github.com/robomics/call_tad_cliques \
-r v0.4.0 \
-with-singularity # Replace this with -with-docker to use Docker instead
N E X T F L O W ~ version 24.04.2
Launching `https://github.com/robomics/call_tad_cliques` [desperate_easley] DSL2 - revision: f89b6c923c [v0.4.0]
executor > local (41)
[c4/ae5ade] SAMPLESHEET:CHECK_SYNTAX | 1 of 1 ✔
[6a/eca471] SAMPLESHEET:CHECK_FILES | 1 of 1 ✔
[43/a52d61] TADS:SELECT_NORMALIZATION_METHOD (example) | 1 of 1 ✔
[6b/6a7925] TADS:APPLY_NORMALIZATION (example (50000; weight)) | 1 of 1 ✔
[34/a35f83] TADS:HICEXPLORER_FIND_TADS (example) | 1 of 1 ✔
[- ] TADS:COPY -
[27/c7fe33] NCHG:GENERATE_MASK | 1 of 1 ✔
[3e/21fbf1] NCHG:MASK_DOMAINS (example) | 1 of 1 ✔
[4e/de0183] NCHG:EXPECTED (example) | 1 of 1 ✔
[a5/1d67bc] NCHG:GENERATE_CHROMOSOME_PAIRS (example) | 1 of 1 ✔
[c8/2ab36d] NCHG:DUMP_CHROM_SIZES (example) | 1 of 1 ✔
[68/a97434] NCHG:COMPUTE (example (chr1:chr1)) | 21 of 21 ✔
[ef/423514] NCHG:MERGE (example (cis)) | 1 of 1 ✔
[de/121f6a] NCHG:FILTER (example (cis)) | 1 of 1 ✔
[0e/f58cfb] NCHG:VIEW (example (cis)) | 1 of 1 ✔
[e0/15439e] NCHG:CONCAT (example) | 1 of 1 ✔
[0d/307019] NCHG:PLOT_EXPECTED (example) | 1 of 1 ✔
[7d/27fffc] NCHG:GET_HIC_PLOT_RESOLUTION (example) | 1 of 1 ✔
[ba/c1f985] NCHG:PLOT_SIGNIFICANT (example) | 1 of 1 ✔
[a1/87e559] CLIQUES:CALL (example) | 1 of 1 ✔
[26/a2cab8] CLIQUES:PLOT_MAXIMAL_CLIQUE_SIZE_DISTRIBUTION_BY_TAD (cis) | 1 of 1 ✔
[f8/a17299] CLIQUES:PLOT_CLIQUE_SIZE_DISTRIBUTION (cis) | 1 of 1 ✔
Completed at: 07-Jun-2024 16:38:56
Duration : 1m 21s
CPU hours : 0.1
Succeeded : 41
This will create a data/results/
folder with the following files:
cliques/example_cis_cliques.tsv.gz
- TSV with the list of cliques computed from cis significant interactions (i.e. both intra and inter-chromosomal interactions).cliques/example_cis_domains.bed.gz
- BED file with the list of domains part of cliques computed from cis significant interactions. The last column encodes the domain ID.nchg/example.filtered.tsv.gz
- TSV with the statistically significant interactions detected by NCHG.nchg/expected_values_example.cis.h5
- HDF5 file with the expected values computed by NCHG.plots/cliques/cis_clique_size_distribution*
- Plots showing the clique size distribution.plots/cliques/cis_tad_max_clique_size_distribution*
- Plots showing the maximal clique size distribution.plots/nchg/example/example.*.*.png
- Plots showing the log ratio computed by NCHG for each chromosome pair analyzed.plots/nchg/example/example_cis.png
- Plot showing the expected value profile computed by NCHG.tads/example_tads.bed.gz
- TADs used to generate the list of genomic coordinates to be tested for significance.
The list of pairs of interacting domains can be generated using bin/generate_cliques_bedpe.py
user@dev:/tmp$ bin/generate_cliques_bedpe.py data/results/example_cis_domains.bed.gz data/results/example_cis_cliques.tsv.gz |
chr3 94300000 95850000 chr3 94300000 95850000 CLIQUE_#0
chr3 94300000 95850000 chr3 108150000 108950000 CLIQUE_#0
chr3 94300000 95850000 chr3 116000000 116900000 CLIQUE_#0
chr3 94300000 95850000 chr3 137750000 138500000 CLIQUE_#0
chr3 94300000 95850000 chr3 152100000 153900000 CLIQUE_#0
chr3 108150000 108950000 chr3 94300000 95850000 CLIQUE_#0
chr3 108150000 108950000 chr3 108150000 108950000 CLIQUE_#0
chr3 108150000 108950000 chr3 116000000 116900000 CLIQUE_#0
chr3 108150000 108950000 chr3 137750000 138500000 CLIQUE_#0
chr3 108150000 108950000 chr3 152100000 153900000 CLIQUE_#0
Troubleshooting
If you get permission errors when using -with-docker
:
- Pass option
-process.containerOptions="--user root"
tonextflow run
If you get an error similar to:
Cannot find revision `v0.4.0` -- Make sure that it exists in the remote repository `https://github.com/robomics/call_tad_cliques`
try to remove folder ~/.nextflow/assets/robomics/call_tad_cliques
before running the workflow
If you are having trouble running the workflow feel free to reach out by starting a new discussion here.
Bug reports and feature requests can be submitted by opening an issue.