-
Notifications
You must be signed in to change notification settings - Fork 0
3. Run the pipeline
The basic snHiC
pipeline requires only two files:
-
snHiC.snakefile
, containing all the rules that will be run; -
configuration.yaml
file, in which the user can define and customize all the parameters for the different pipeline steps.
Then, if grouped or differential compartment analyses are required by the user, a third file is required. This file will contain the sample names (without the read file suffix and extension, e.g sample.A_R1.fastq.gz --> sample.A) and the group to which they belong. Here follows an example (tab-delimited txt file):
sample | group |
---|---|
DU145_rep1 | DU145 |
DU145_rep2 | DU145 |
PC3_rep1 | PC3 |
PC3_rep2 | PC3 |
To partially avoid unexpected errors during the execution of the pipeline, a so called 'dry-run' is strongly recommended. Indeed, adding a -n
at the end of the snakemake running command will allow snakemake to check that all links and file/parameters dependencies are satisfied before to run the "real" processes. This command will therefore help the debugging process.
snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12 \
-n
If no errors occur, the pipeline can be run with the same command line without the terminal -n
:
snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12
Notice that the absence of errors does not mean that the pipeline will run without any issues; the "dry-run" is only checking whether all the resources are available.
Furthermore, there is the possibility to run the pipeline only partially. An example of usage could be if someone wants to have a look to the fast quality controls (fastQC and multiQC reports) before to perform the alignment. To do that, it is sufficient to run a dry-run (-n
mode), then pick the name of the rule at which you want the pipeline to stop and lastly type the following command:
snakemake \
-s </target/folder>/snHiC/workflow/snHiC.snakefile \
--configfile </target/folder>/snHiC/config/snHiC_config.yaml \
--cores 12 \
--until <stop_rule_name>
Analyses can be re-run from a specific point/step by removing/renaming a specific output folder: snakemake will not find a specific set of output files and all the rules depending on this output will be re-run.
The snakefile are contained all the condition checks and rules (processing steps) that will be performed by the pipeline. In the following schematic mapping the main steps performed by the pipeline are depicted.
Briefly, first of all a quality control (fastQC) of the raw fastq data is performed. Then, bwa-mem2 is used to align the paired-end sequences, but separately, onto the genome of reference. The aligned reads are filtered and used to generate the Hi-C matrices at the lowest resolution using HiCExplorer (J. Wolff et. al, Nuc. Acids Res. 2020).
If necessary, the matrices' bins are merge to generate higher resolution matrices. Further, if required by the user, matrices belonging to the same group are summed and processed in parallel to the "single-sample" ones.
These matrices are then normalized and corrected and used to generate quality control plots as well preform sample correlation analyses.
Finally, TAD and Loops detection can be performed by HiCExplorer, while differential compartment analyses can be run using dcHiC on both individual and grouped samples.
More details on parameters and structure/meaning of the results can be found in the next paragraphs.
All the possible rule names are listed below. The example is based on (for details see the example description):
- Sample: 4
- Groups: 2 (2 samples per group)
- Resolutions: 10, 20, 50, 100, 1000 kb
- Cores provided: 10
- System: HPC (GNU/Linux, x86_64), 165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 (5.4.0-148-generic)
job count min threads max threads
--------------------------------------------------------------------- ------- ------------- -------------
AAA_initialization 1 1 1
A_fastQC_raw 8 1 1
B_multiQC_raw 1 1 1
C_bwa_align 8 5 5
D_generate_restriction_file_and_get_chrSizes 1 1 1
E1_interaction_matrix_generation_at_smallest_resolution 4 5 5
E2_multiQC_report_for_HiC_matrices 1 1 1
E3_merging_interaction_matrix_bins_for_all_resolutions 16 1 1
F1_matrices_normalization 5 1 1
F2_samples_correlation 1 1 1
G1_matrices_correction__diagnosticPlot_and_MAD 20 1 1
G2_matrices_correction__getting_threshold_values 20 1 1
G3_matrices_correction__correction 20 1 1
H1_matrices_format_conversion__cool 1 1 1
H2_matrices_format_conversion__hicpro 1 1 1
I_call_TADs_HiCexplorer 20 2 2
J_plotting_intraChr_distances 5 1 1
L1_sum_matrices_by_group 1 1 1
L2_merging_grouped_interaction_matrix_bins_for_all_resolutions 8 1 1
M_grouped_matrices_normalization 5 1 1
N1_summed_matrices_correction__diagnosticPlot_and_MAD 10 1 1
N2_summed_matrices_correction__getting_threshold_values 10 1 1
N3_summed_matrices_correction__correction 10 1 1
N4_summed_matrices_correction__cool_conversion 1 1 1
N5_summed_matrices_correction__hicpro_conversion 1 1 1
O_call_TADs_on_summed_matrices_HiCexplorer 10 2 2
P_detect_loops_singleSamples_HiCexplorer 8 2 2
Q_detect_loops_groupedSamples_HiCexplorer 4 2 2
R1_detect_compartments_dcHiC_singleSamples__inputFile_all_vs_all 1 1 1
R2_detect_compartments_dcHiC_singleSamples__call_compartments 1 10 10
R3_detect_compartments_dcHiC_singleSamples__bedGraphToBigWig 1 1 1
R4_detect_compartments_dcHiC_singleSamples__call_compartments_combos 1 10 10
R5_detect_compartments_dcHiC_singleSamples__bedGraphToBigWig_combos 1 1 1
S1_detect_compartments_dcHiC_groupedSamples__inputFile_all_vs_all 1 1 1
S2_detect_compartments_dcHiC_groupedSamples__call_compartments 1 10 10
S3_detect_compartments_dcHiC_groupedSamples__bedGraphToBigWig 1 1 1
S4_detect_compartments_dcHiC_groupedSamples__call_compartments_combos 1 10 10
S5_detect_compartments_dcHiC_groupedSamples__bedGraphToBigWig_combos 1 1 1
T_differential_contacts_SELFISH_groupedSamples 5 1 1
U1_stripe_detection_STRIPPEN_singleSamples 4 10 10
U2_stripe_detection_STRIPPEN_groupedSamples 2 10 10
--------------------------------------------------------------------- ------- ------------- -------------
total 222 1 10
--------------------------------------------------------------------- ------- ------------- -------------
The configuration file is a yaml-formatted file containing all the parameters that are passed to different steps of the pipelines such as the directory with the input files, reference genome, etc.
The snHiC configuration file is divided in two sections:
- experiment-specific, with al the parameters that most likely are changing from experiment to experiment;
-
common, with parameters that are quite stable independently of the experiments design. The latter should be changed only for very specific needs.
Hereafter, the meaning of the different parameters is described.
Parameter | Description |
---|---|
runs_directory | The full path to the directory were the input fastq files are contained, e.g. $HOME/HiC/00_runs/ . Importantly, the name of the files, deprived of the read suffix (e.g., _R1/_R2) and file extension (e.g., .fastq.gz) will be used as sample name. |
output_directory | The full path to the folder in which the results should be stored, e.g. "$HOME/HiC_results/" . If not already existing, it will be generated automatically. |
fastq_extension | Default: .fastq.gz . The extension of the input fastq files. This string will be removed from the input file names to obtain the sample names. Examples: ".fastq.gz" , ".fq.gz" , ".fasta" , ".fq" . |
runs_suffix | Default: ["_R1", "_R2"] . A list (python format) with the two reads suffixes corresponding to the read1 and read2 for the paired-end sequencing. |
character_subsitution_dashes_and_points_sample_name | Default "" (nothing). dcHiC does not allow any . or - in the sample names. Thus, this parameter is used to define which character should be used to substitute dots and dashes in the sample name before preforming the compartment analyses. |
chr_filtering_string | To optimize dcHiC compartment computation, low coverage chromosomes should be excluded. Provide a string with the name of the chromosomes to exclude separated by a ` |
genome_fasta | The full path to a fasta-formatted file containing the sequence of there reference genome into which the reads will be aligned. If the index (.fai) file is not present in the same folder, it will be generated by the pipeline during its execution (this process may take long time to be finalized). The reference genomes can be downloaded, for instance, from the UCSC golden path. |
genome_assembly_name | A string with the name of the genome (e.g., "hg19" ). This will be used to download/generate some files. |
matrix_resolution | A numeric list (e.g., [10,20,40,50,100]) indicated all the matrices resolutions that should be generated expressed in kilo-base-pairs (kb). Importantly, each resolution must be a multiple of the lowest one, which must be greater or equal to 1kb. |
generate_bam | Default: False . A logical value (True/False) indicating whether during the generation of the lowest resolution matrices a bam file with the filtered reads should be written. Notably, this process may be significantly time consuming. |
restriction_enzyme | A string indicating the enzyme used to generate the HiC libraries. This parameter is used to identify the restriction sites on the genome. Possible choices are: 'DpnII', 'MboI', 'NlaIII', 'Csp6I', 'CviQI', 'HindIII', 'EcoRI', 'BamHI', 'BglII', 'Sau3AI', 'Arima_4combo'. Other enzymes can be added in the pipeline by modifying the restriction_table at the beginning of the snHiC.Snakefile. |
TAD_caller | Default "HiCExplorer" `. Possible options: "HiCExplorer", "GENOVA". |
groups |
|
loops |
|
compartments |
|
perform_differential_contacts_analyses | Default True . A logical value (True/False) indicating whether perform differential Hi-C contacts analyses among different groups (all combination will be performed) by SELFISH. |
perform_stripes_analyses | Default True . A logical value (True/False) indicating whether detect stripes by Stripenn. |
Parameter | Description |
---|---|
mapQ_cutoff | Default: 15 . All reads with a mapping quality (MAPQ) score lower than this value will be filtered out from the bam files. |
min_distance | Default: 300 bp. Minimum distance between restriction sites: restriction sites that are closer than this distance are merged into one. |
max_distance | Default: 1000 bp. This is equivalent to the maxLibraryInsertSize. More details on this parameter can be found at the hicBuildMatrix documentation page . |
heatmap_color | Default: 'RdBu' . A string indicating the color gradient pattern to use for the correlation heatmaps. This value is passed to matplotlib. Therefore, available options (see matplotlib page for examples) are the following: 'Accent', 'Blues', 'BrBG', 'BuGn', 'BuPu', 'CMRmap', 'Dark2', 'GnBu', 'Greens', 'Greys', 'OrRd', 'Oranges', 'PRGn', 'Paired', 'Pastel1', 'Pastel2', 'PiYG', 'PuBu', 'PuBuGn', 'PuOr', 'PuRd', 'Purples', 'RdBu', 'RdGy', 'RdPu', 'RdYlBu', 'RdYlGn', 'Reds', 'Set1', 'Set2', 'Set3', 'Spectral', 'Wistia', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd', 'afmhot', 'autumn', 'binary', 'bone', 'brg', 'bwr', 'cividis', 'cool', 'coolwarm', 'copper', 'cubehelix', 'flag', 'gist_earth', 'gist_gray', 'gist_heat', 'gist_ncar', 'gist_rainbow', 'gist_stern', 'gist_yarg', 'gnuplot', 'gnuplot2', 'gray', 'hot', 'hsv', 'icefire', 'inferno', 'jet', 'magma', 'mako', 'nipy_spectral', 'ocean', 'pink', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'tab10', 'tab20', 'tab20b', 'tab20c', 'terrain', 'twilight', 'twilight_shifted', 'viridis', 'vlag', 'winter'. |
correlation_method | Default: 'pearson' . Possible choices: 'pearson' , 'spearman' . Method to use for the sample correlation. |
normalization_method | Default: 'smallest' . Possible choices: 'norm_range' (0-to-1 range), 'smallest' (all matrices to the lowest read count of the given matrices), 'multiplicative' . Method to use for the matrices normalization. Values passed to hicNormalize . |
correction_method | Default: "ICE" . Possible choices: 'ICE' , 'KR' . Method to use for the matrices correction. Values passed to hicCorrectMatrix . KR balances a matrix using a fast balancing algorithm introduced by Knight and Ruiz (2012). ICE : iterative correction of a Hi-C matrix (Imakaev et al. 2012, Nature Methods). |
extra_findTAD_parameters | Default: '--thresholdComparisons 0.01' . A string containing any additional parameter, separated by a space (e.g., '--paramA X --paramB Y'), to pass to hicFindTADs. |
hicDetectLoops_params |
|
mustache_params |
|
selfish_params |
|
stripenn_params |
|