pb-CpG-tools provides the tool aligned_bam_to_cpg_scores
, which can generate site methylation probabilities from
mapped HiFi reads, including probabilities for each haplotype when reads are haplotagged. The site probabilities are
reported as both bed and bigwig file outputs. The example below from HG002 shows how these bigwig tracks can be used to
visualize both combined and haplotype-specific methylation probabilities in IGV
To use aligned_bam_to_cpg_scores
download the latest release tarball compiled for 64-bit linux platforms on the
github release channel, then unpack the tar file.
As an example, the v2.3.2 release can be obtained as follows:
wget https://github.com/PacificBiosciences/pb-CpG-tools/releases/download/v2.3.2/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz
tar -xzf pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz
# Run help option to test binary and see latest usage details:
pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores --help
aligned_bam_to_cpg_scores
includes a number of ways to summarize site propabilities for 5mC methylation,
detailed below. The recommended default workflow will use the model
pileup mode and denovo
modsites mode.
Continuing from the example above, the script below runs the tool on a mapped WGS bam for HG002:
pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores \
--bam HG002.hg38.pbmm2.bam \
--output-prefix HG002.hg38.pbmm2 \
--model pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite \
--threads 8
See the cmdline usage help for the full list of other commandline options:
aligned_bam_to_cpg_scores --help
The input alignment file should contain mapped reads in BAM or CRAM format and have an associated index file (.bai
,
.csi
or .crai
). 5mC base modification values are read from the MM
and ML
auxiliary tags which encode base
modifications and confidence values. These tags are further described in the
SAM tag specification document.
These are the default tags generated by jasmine for 5mC modifications
on HiFi reads.
Methylated reads must be mapped using a method that does not hard-clip supplementary alignments, such as:
- pbmm2 (recommended)
- minimap2 with the
-Y
flag.
The reason for this restriction is that any process which shortens reads in the BAM file without also recalculating
the MM
and ML
auxiliary tags will lead to an invalid decoding of the methylation values.
Note the following additional requirements on the input alignment file:
- CRAM format is supported but the corresponding fasta reference file must be specified in this case.
- Multiple base modification types may be present in the file's
MM
andML
tags, but only theC+m
modification key will be processed, other modification keys such asA+a
will be ignored. 5mC modifications encoded with a multiple-modification key, such asC+mh
, cannot be used.
There are bed (.bed
) and bigwig (.bw
) files generated for the complete read set and each separate
haplotype (when haplotype tags are available in the bam). The bed files are a text format that is easier to review and
contains additional site data, the bigwig files are an indexed binary format designed to be loaded into IGV for
visualization.
The following 2 files are always generated:
[output-prefix].combined.bed
[output-prefix].combined.bw
If haplotype information is present in the input alignment file, an additional 4 output files are expected:
[output-prefix].hap1.bed
[output-prefix].hap1.bw
[output-prefix].hap2.bed
[output-prefix].hap2.bw
The bed file columns will differ between the model
and count
pileup methods, but both share the first six columns:
- reference name
- start coordinate
- end coordinate
- modification score
- haplotype
- coverage
Here the modification score is the methylation probability at the site expressed as a percentage. See the section
on --pileup-mode
below for a description of how this is calculated for each mode.
Additional columns in the bed output file depend on which pileup mode is selected. For the count
pileup mode,
four additional columns are present:
- modified site count
- unmodified site count
- average mod score
- average unmod score
For the model
pileup mode, three additional columns are present:
- estimated modified site count (extrapolated from model modification probability)
- estimated unmodified site count (extrapolated from model modification probability)
- discretized modification probability (calculated from estimated mod/unmod site counts)
The bigwig files are an indexed binary format which contain columns 1-4 listed above, and are preferred for loading 5mC tracks in IGV.
The --pileup-mode
argument selects the modification probability calculation
model
: (default) This is the recommended pileup mode. It uses distributions of modification scores and a machine-learning model to calculate the modification probabilities across CpG sites. When using this option a path to a tflite model file must be provided with the--model
argument. The current recommended model ispileup_calling_model.v1.tflite
found in the models directory.count
: For a given site, all bases with a 5mC modification probability greater than 0.5 are classified as modified, and all other bases are classified as unmodified. The methylation probability for a site is set to the proportion of bases classified as modified. In addition, the count and average modification probability (expressed as a percentage), are provided for both the modified and unmodified sets of bases in the bed file output.
The --modsites-mode
argument determines which sites will be reported
denovo
: (default) This option will identify and output all CG sites found in the consensus sequence from the reads in the pileup. This allows reporting of CG sites with zero modification probability. This mode does not ensure that the reference also displays a CG site (e.g., there could be sequence mismatches between the reads and reference).reference
: This option will identify and output all CG sites found in the reference sequences. This allows reporting of CG sites with zero modification probability. This mode does not ensure that aligned reads also display a CG site (e.g., there could be sequence mismatches between the reads and reference).
Using the --hap-tag
flag allows an arbitrary SAM tag to be used to identify haplotypes, rather than the default HP
tag. The haplotype values must be 0
, 1
, and 2
, where 0
is not assigned/ambiguous.
An aligned BAM file containing HiFi reads with 5mC tags (HG002.GRCh38.haplotagged.bam
) is freely available for
download: https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/
The sample is HG002/NA24385 from the Human Pangenome Reference Consortium HG002 Data Freeze v1.0, and is aligned to GRCh38. There are also four unaligned bam files containing the HiFi reads.
Given a human ~30x WGS aligned BAM, such as the example data above, it should take approximately 40 minutes to complete default mode pileup track generation on 8 threads using 6 Gb of memory. Runtime should scale approximately linearly with thread count until limited by the filesystem's ability to read the bam file in parallel. Peak memory demand should be approximately constant for a human sample at 6 Gb regardless of thread count.
5mC site analysis was previously provided by the aligned_bam_to_cpg_scores.py
python script in version 1.x releases.
The version 2 release of aligned_bam_to cpg_scores
should have approximate feature parity with the previous script,
although the command-line details have changed.
Given the same bam file and selected output modes, version 2 results should logically match the version 1 script, however the results will contain a number of minor expected differences:
- Floating point thresholds and rounding differences lead to a small number of scoring differences.
- In
denovo
modsites mode, tie-breaking details of the python implementation lead to a very small number of marginallydenovo
detected CpG not being included in this version. - The version 1 script output has some minor artifacts at its parallel 'chunksize' boundaries. These issues do not occur in version 2.
- In count pileup mode, the bam ML tag values are translated into probabilities by taking bin mid-points, so all input probabilities are shifted by +(1/512) compared to version 1, which has a corresponding influence on the summarized site output.
The python version of the code is deprecated, but if needed the final version of this script before the version 2 update is available here.
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.