Skip to content

3. Run the pipeline

Sebastian Gregoricchio edited this page Jun 12, 2023 · 37 revisions

3.1 Snakemake principles

The basic snHiC pipeline requires only two files:

  • snHiC.snakefile, containing all the rules that will be run;
  • configuration.yaml file, in which the user can define and customize all the parameters for the different pipeline steps.

Then, if grouped or differential compartment analyses are required by the user, a third file is required. This file will contain the sample names (without the read file suffix and extension, e.g sample.A_R1.fastq.gz --> sample.A) and the group to which they belong. Here follows an example (tab-delimited txt file):

sample group
DU145_rep1 DU145
DU145_rep2 DU145
PC3_rep1 PC3
PC3_rep2 PC3

3.1.1 Dry-run and run

To partially avoid unexpected errors during the execution of the pipeline, a so called 'dry-run' is strongly recommended. Indeed, adding a -n at the end of the snakemake running command will allow snakemake to check that all links and file/parameters dependencies are satisfied before to run the "real" processes. This command will therefore help the debugging process.

snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12 \
-n

If no errors occur, the pipeline can be run with the same command line without the terminal -n:

snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12

3.1.2 Partial run

Notice that the absence of errors does not mean that the pipeline will run without any issues; the "dry-run" is only checking whether all the resources are available.
Furthermore, there is the possibility to run the pipeline only partially. An example of usage could be if someone wants to have a look to the fast quality controls (fastQC and multiQC reports) before to perform the alignment. To do that, it is sufficient to run a dry-run (-n mode), then pick the name of the rule at which you want the pipeline to stop and lastly type the following command:

snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12 \
--until <stop_rule_name>

3.1.3 Re-run from a "checkpoint"

Analyses can be re-run from a specific point/step by removing/renaming a specific output folder: snakemake will not find a specific set of output files and all the rules depending on this output will be re-run.



3.2 Snakefile

The snakefile are contained all the condition checks and rules (processing steps) that will be performed by the pipeline. In the following schematic mapping the main steps performed by the pipeline are depicted.
Briefly, first of all a quality control (fastQC) of the raw fastq data is performed. Then, bwa-mem2 is used to align the paired-end sequences, but separately, onto the genome of reference. The aligned reads are filtered and used to generate the Hi-C matrices at the lowest resolution using HiCExplorer (J. Wolff et. al, Nuc. Acids Res. 2020).
If necessary, the matrices' bins are merge to generate higher resolution matrices. Further, if required by the user, matrices belonging to the same group are summed and processed in parallel to the "single-sample" ones.
These matrices are then normalized and corrected and used to generate quality control plots as well preform sample correlation analyses.
Finally, TAD and Loops detection can be performed by HiCExplorer, while differential compartment analyses can be run using dcHiC on both individual and grouped samples.

More details on parameters and structure/meaning of the results can be found in the next paragraphs.

snHiC workflow

All the possible rule names are listed below. The example is based on (for details see the example description):

  • Sample: 4
  • Groups: 2 (2 samples per group)
  • Resolutions: 10, 20, 50, 100, 1000 kb
  • Cores provided: 10
  • System: HPC (GNU/Linux, x86_64), 165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 (5.4.0-148-generic)
job                                                                      count    min threads    max threads
---------------------------------------------------------------------  -------  -------------  -------------
AAA_initialization                                                           1              1              1
A_fastQC_raw                                                                 8              1              1
B_multiQC_raw                                                                1              1              1
C_bwa_align                                                                  8              5              5
D_generate_restriction_file_and_get_chrSizes                                 1              1              1
E1_interaction_matrix_generation_at_smallest_resolution                      4              5              5
E2_multiQC_report_for_HiC_matrices                                           1              1              1
E3_merging_interaction_matrix_bins_for_all_resolutions                      16              1              1
F1_matrices_normalization                                                    5              1              1
F2_samples_correlation                                                       1              1              1
G1_matrices_correction__diagnosticPlot_and_MAD                              20              1              1
G2_matrices_correction__getting_threshold_values                            20              1              1
G3_matrices_correction__correction                                          20              1              1
H1_matrices_format_conversion__cool                                          1              1              1
H2_matrices_format_conversion__hicpro                                        1              1              1
I_call_TADs_HiCexplorer                                                     20              2              2
J_plotting_intraChr_distances                                                5              1              1
L1_sum_matrices_by_group                                                     1              1              1
L2_merging_grouped_interaction_matrix_bins_for_all_resolutions               8              1              1
M_grouped_matrices_normalization                                             5              1              1
N1_summed_matrices_correction__diagnosticPlot_and_MAD                       10              1              1
N2_summed_matrices_correction__getting_threshold_values                     10              1              1
N3_summed_matrices_correction__correction                                   10              1              1
N4_summed_matrices_correction__cool_conversion                               1              1              1
N5_summed_matrices_correction__hicpro_conversion                             1              1              1
O_call_TADs_on_summed_matrices_HiCexplorer                                  10              2              2
P_detect_loops_singleSamples_HiCexplorer                                     8              2              2
Q_detect_loops_groupedSamples_HiCexplorer                                    4              2              2
R1_detect_compartments_dcHiC_singleSamples__inputFile_all_vs_all             1              1              1
R2_detect_compartments_dcHiC_singleSamples__call_compartments                1             10             10
R3_detect_compartments_dcHiC_singleSamples__bedGraphToBigWig                 1              1              1
R4_detect_compartments_dcHiC_singleSamples__call_compartments_combos         1             10             10
R5_detect_compartments_dcHiC_singleSamples__bedGraphToBigWig_combos          1              1              1
S1_detect_compartments_dcHiC_groupedSamples__inputFile_all_vs_all            1              1              1
S2_detect_compartments_dcHiC_groupedSamples__call_compartments               1             10             10
S3_detect_compartments_dcHiC_groupedSamples__bedGraphToBigWig                1              1              1
S4_detect_compartments_dcHiC_groupedSamples__call_compartments_combos        1             10             10
S5_detect_compartments_dcHiC_groupedSamples__bedGraphToBigWig_combos         1              1              1
T_differential_contacts_SELFISH_groupedSamples                               5              1              1
U1_stripe_detection_STRIPPEN_singleSamples                                   4             10             10
U2_stripe_detection_STRIPPEN_groupedSamples                                  2             10             10
---------------------------------------------------------------------  -------  -------------  -------------
total                                                                      222              1             10
---------------------------------------------------------------------  -------  -------------  -------------

3.3 Configuration file

The configuration file is a yaml-formatted file containing all the parameters that are passed to different steps of the pipelines such as the directory with the input files, reference genome, etc.
The snHiC configuration file is divided in two sections:

  • experiment-specific, with al the parameters that most likely are changing from experiment to experiment;
  • common, with parameters that are quite stable independently of the experiments design. The latter should be changed only for very specific needs.
    Hereafter, the meaning of the different parameters is described.

3.3.1 Experiment-specific section

Parameter Description
runs_directory The full path to the directory were the input fastq files are contained, e.g. $HOME/HiC/00_runs/. Importantly, the name of the files, deprived of the read suffix (e.g., _R1/_R2) and file extension (e.g., .fastq.gz) will be used as sample name.
output_directory The full path to the folder in which the results should be stored, e.g. "$HOME/HiC_results/". If not already existing, it will be generated automatically.
fastq_extension Default: .fastq.gz. The extension of the input fastq files. This string will be removed from the input file names to obtain the sample names. Examples: ".fastq.gz", ".fq.gz", ".fasta", ".fq".
runs_suffix Default: ["_R1", "_R2"]. A list (python format) with the two reads suffixes corresponding to the read1 and read2 for the paired-end sequencing.
character_subsitution_dashes_and_points_sample_name Default "" (nothing). dcHiC does not allow any . or - in the sample names. Thus, this parameter is used to define which character should be used to substitute dots and dashes in the sample name before preforming the compartment analyses.
chr_filtering_string To optimize dcHiC compartment computation, low coverage chromosomes should be excluded. Provide a string with the name of the chromosomes to exclude separated by a `
genome_fasta The full path to a fasta-formatted file containing the sequence of there reference genome into which the reads will be aligned. If the index (.fai) file is not present in the same folder, it will be generated by the pipeline during its execution (this process may take long time to be finalized). The reference genomes can be downloaded, for instance, from the UCSC golden path.
genome_assembly_name A string with the name of the genome (e.g., "hg19"). This will be used to download/generate some files.
matrix_resolution A numeric list (e.g., [10,20,40,50,100]) indicated all the matrices resolutions that should be generated expressed in kilo-base-pairs (kb). Importantly, each resolution must be a multiple of the lowest one, which must be greater or equal to 1kb.
generate_bam Default: False. A logical value (True/False) indicating whether during the generation of the lowest resolution matrices a bam file with the filtered reads should be written. Notably, this process may be significantly time consuming.
restriction_enzyme A string indicating the enzyme used to generate the HiC libraries. This parameter is used to identify the restriction sites on the genome. Possible choices are: 'DpnII', 'MboI', 'NlaIII', 'Csp6I', 'CviQI', 'HindIII', 'EcoRI', 'BamHI', 'BglII', 'Sau3AI', 'Arima_4combo'. Other enzymes can be added in the pipeline by modifying the restriction_table at the beginning of the snHiC.Snakefile.
TAD_caller Default "HiCExplorer"`. Possible options: "HiCExplorer", "GENOVA".
groups
  • perform_grouped_analyses: Deafult: True. Logical value to indicate whether to perform grouped analyses.
  • sample_metadata: The path to a tab-delimited txt file containing two columns: sample and the name of the group at which it belongs. This sample names must be indicated without the read file suffix and extension (e.g sample.A_R1.fastq.gz --> sample.A).
loops
  • detect_loops: default True. Logical value to indicate whether to perform loop calling.
  • loop_caller: default "HiCExplorer". Possible choices: "HiCExplorer", "Mustache".
  • max_resolution_loops: default 20kb. An integer indicating the maximum resolution (in kilo base-pairs, kb) at which detect loops.
compartments
  • detect_compartments: Default True. A logical value (True/False) indicating whether differential compartment analyses should be performed by dcHiC. This function requires the parameter sample_metadata.
  • minimal_resolution_compartments: Deafult 5kb (dcHiC has been show to work up to 5kb resolution). A number indicating the minimal resolution needed to run the compartment analyses, in kilo-base-pairs (bp).
  • dcHiC_analyses_type: Default: "cis". A string to indicate whether the compartment analyses will be performed on "cis" or "trans" interaction matrices.
perform_differential_contacts_analyses Default True. A logical value (True/False) indicating whether perform differential Hi-C contacts analyses among different groups (all combination will be performed) by SELFISH.
perform_stripes_analyses Default True. A logical value (True/False) indicating whether detect stripes by Stripenn.



3.3.2 Common section

Parameter Description
mapQ_cutoff Default: 15. All reads with a mapping quality (MAPQ) score lower than this value will be filtered out from the bam files.
min_distance Default: 300bp. Minimum distance between restriction sites: restriction sites that are closer than this distance are merged into one.
max_distance Default: 1000bp. This is equivalent to the maxLibraryInsertSize. More details on this parameter can be found at the hicBuildMatrix documentation page.
heatmap_color Default: 'RdBu'. A string indicating the color gradient pattern to use for the correlation heatmaps. This value is passed to matplotlib. Therefore, available options (see matplotlib page for examples) are the following: 'Accent', 'Blues', 'BrBG', 'BuGn', 'BuPu', 'CMRmap', 'Dark2', 'GnBu', 'Greens', 'Greys', 'OrRd', 'Oranges', 'PRGn', 'Paired', 'Pastel1', 'Pastel2', 'PiYG', 'PuBu', 'PuBuGn', 'PuOr', 'PuRd', 'Purples', 'RdBu', 'RdGy', 'RdPu', 'RdYlBu', 'RdYlGn', 'Reds', 'Set1', 'Set2', 'Set3', 'Spectral', 'Wistia', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd', 'afmhot', 'autumn', 'binary', 'bone', 'brg', 'bwr', 'cividis', 'cool', 'coolwarm', 'copper', 'cubehelix', 'flag', 'gist_earth', 'gist_gray', 'gist_heat', 'gist_ncar', 'gist_rainbow', 'gist_stern', 'gist_yarg', 'gnuplot', 'gnuplot2', 'gray', 'hot', 'hsv', 'icefire', 'inferno', 'jet', 'magma', 'mako', 'nipy_spectral', 'ocean', 'pink', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'tab10', 'tab20', 'tab20b', 'tab20c', 'terrain', 'twilight', 'twilight_shifted', 'viridis', 'vlag', 'winter'.
correlation_method Default: 'pearson'. Possible choices: 'pearson', 'spearman'. Method to use for the sample correlation.
normalization_method Default: 'smallest'. Possible choices: 'norm_range' (0-to-1 range), 'smallest' (all matrices to the lowest read count of the given matrices), 'multiplicative'. Method to use for the matrices normalization. Values passed to hicNormalize.
correction_method Default: "ICE". Possible choices: 'ICE', 'KR'. Method to use for the matrices correction. Values passed to hicCorrectMatrix. KR balances a matrix using a fast balancing algorithm introduced by Knight and Ruiz (2012). ICE: iterative correction of a Hi-C matrix (Imakaev et al. 2012, Nature Methods).
extra_findTAD_parameters Default: '--thresholdComparisons 0.01'. A string containing any additional parameter, separated by a space (e.g., '--paramA X --paramB Y'), to pass to hicFindTADs.
hicDetectLoops_params
  • maxLoopDistance: default: 2000000bp (2Mb). Maximum loop distance, in base-pairs (bp), to be used to detect loops by hicDetectLoops.
  • loop_windowSize: default: 10kb. The window size for the neighborhood region the peak is located in. All values from this region (exclude the values from the peak region) are tested against the peak region for significant difference. For more info see hicDetectLoops.
  • loop_peakWidth: default: 6kb. The width of the peak region in bins. The square around the peak will include (2 * peakWidth)^2 bins. For more info see hicDetectLoops.
  • loop_pValuePreselection: default: 0.05. Only candidates with p-values less the given threshold will be considered as candidates. For each genomic distance a negative binomial distribution is fitted and for each pixel a p-value given by the cumulative density function is given. This does NOT influence the p-value for the neighborhood testing. For more info see hicDetectLoops.
  • loop_pValue: default: 0.05. Rejection level for Anderson-Darling or Wilcoxon-rank sum test for H0. H0 is peak region and background have the same distribution. For more info see hicDetectLoops.
mustache_params
  • pThreshold: Default 0.1. P-value threshold for an interaction to be reported in the final output file.
  • extra_params: Default "". A string indicating any other optional parameter to pass to Mustache (for details visit Moustache GitHub page).
selfish_params
  • qValue_threshold: Default 0.05. Q-value threshold for filtering of the output file.
  • qValue_threshold: Default "". A string indicating any other optional parameter to pass to SELFISH (for details visit SELFISH GitHub page).
stripenn_params
  • max_stripes_resolution: Default 20kb. A number to indicate the maximum resolution at which compute the stripes detection (higher resolutions analyses have an increased chance to fail). The number is expressed in kilo base-pairs (kb).
  • pValue_threshold: default 0.1. P-value cutoff for stripe significance.
  • extra_params: Default "". A string indicating any other optional parameter to pass to Mustache (for details visit Stripenn GitHub page, compute function).