Skip to content

DNA methylation analysis pipeline for reduced representation bissulfite sequencing data

License

Notifications You must be signed in to change notification settings

AnaValente/DNA-methylation-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA-methylation-analysis

DOI

This RRBS Nextflow pipeline was created to discover the genes associated with differentially methylated regions from the CpG methylation patterns using MethylDackel and Metilene.

The pipeline inputs BAM files, and outputs multiple txt and bedGraph files (according to the number of samples):

  • Per base methylation metrics (.bedGraph)
  • Differentially methylated regions (.bedGraph)
  • Correlation matrix and PCA (.png)
  • Heatmap with signature differences between the controls and samples (.png)
  • Genomic distribution across the hg38 reference genome of CpGs with different methylation frequencies between samples and controls (.png)
  • Genomic distribution across the hg38 reference genome of differentially methylated regions (.png)
  • Closest RefSeq genes (version from 2024-09-11) to the differentially methylated regions (.txt/.bedGraph)
  • Venn diagram of the closest genes (only if two or more samples were used as input) (.png)

image

Note

This pipeline uses by default the hg38 reference genome

Other reference genomes are available to use, but the genomic regions output will be skipped

Install conda environment

To use this pipeline you need to have installed conda and Nextflow.

git clone https://github.com/AnaValente/DNA-methylation-analysis/
cd DNA-methylation-analysis
conda env create -f methylation_env.yml
conda activate methylation

Usage

Mandatory inputs:

  • --files            Path to scripts and samples folder
  • --samples         [String] sample names separated by comma (always write the control name first!)
  • --replicates    [Integer] number of sample replicates
  • --genome           Path to the hg38 or other reference genome file (.fa.gz) (available in: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz)
  • --refseqGenes Path to the refseq genes file in BED format (version from 2024-09-11 can be found in scripts folder for hg38 genome)

Optional inputs:

  • --concat       Option to concatenate BAM files from different runs
  • --cellTpm     Optional file containing two collumns, one with gene names and the other with expression levels in transcripts per million (TPM) for a cell line or cell type identical or similar to the cells under study (available in: https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770/Results) for gene name filtering
  • --noHg38    Option to skip genomic regions output if not using hg38 reference genome
  • --cutoffRegions    [Integer] cutoff (from 1 to 100) for the difference between samples methylation frequency vs control methylation frequency for genomic annotations (default: 75)
  • --cutoffHeatmap    [Integer] cutoff (from 1 to 100) for the difference between samples methylation frequency vs control methylation frequency for clustering analysis (default: 100)
  • --inclusionBed    Optional BED file containing regions of interest (A BED file containing CpG islands regions for hg38 reference genome can be found in scripts folder)

Important

  • All samples and additional files must be placed in the scripts folder

  • BAM files should start with the samples name provided in the –-samples input, followed by replicate number

  • If there are no replicates, input the number 1

  • Example:

    • File names: Control_rep1_library.bam; Sample1_rep1_library.bam
    • Pipeline : --samples "Control","Sample1" --replicates 1

Examples

Example

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome hg38.fa.gz --refseqGenes RefSeq_genes.bed

Example with BAM concatenation

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome hg38.fa.gz --refseqGenes RefSeq_genes.bed --concat

Example with genes filtered by file and regions of interest

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome hg38.fa.gz --refseqGenes RefSeq_genes.bed --cellTpm E-MTAB-2770-query-results.tsv --inclusionBed CpGIslands_mod.bed

Example with different cutoffs

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome hg38.fa.gz --refseqGenes RefSeq_genes.bed --cutoffRegions 50 --cutoffHeatmap 75