Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nonCG, single context reporting #102

Closed
amvondras opened this issue Jan 27, 2022 · 5 comments
Closed

nonCG, single context reporting #102

amvondras opened this issue Jan 27, 2022 · 5 comments

Comments

@amvondras
Copy link

I'm using methtuple (v1.6.0) with the —mt flag, ex.

$MT --gzip --nfff --mt CG -m 2  ${i}.bam  
$MT --gzip --nfff --mt CHG -m 2  ${i}.bam  
$MT --gzip --nfff --mt CHH -m 2  ${i}.bam  

I noticed that —mt CG is outputting CG tuples, ex. CS06.h1.CG.2.tsv.gz

chr	strand	pos1	pos2	MM	MU	UM	UU
VITVvi_vCabFran04_v1.1.hap1.chr01	+	240	251	1	0	0	0 ***
VITVvi_vCabFran04_v1.1.hap1.chr01	+	251	259	1	0	0	0 ***
VITVvi_vCabFran04_v1.1.hap1.chr01	+	259	278	1	0	0	0
VITVvi_vCabFran04_v1.1.hap1.chr01	+	278	290	1	0	0	0

but —mt CHH is outputting CHH and CG, ex. CS06.h1.CG_CHH.2.tsv.gz

chr	strand	pos1	pos2	MM	MU	UM	UU
VITVvi_vCabFran04_v1.1.hap1.chr01	+	234	235	0	0	0	1
VITVvi_vCabFran04_v1.1.hap1.chr01	+	235	240	0	0	1	0
VITVvi_vCabFran04_v1.1.hap1.chr01	+	240	251	1	0	0	0 ***
VITVvi_vCabFran04_v1.1.hap1.chr01	+	251	259	1	0	0	0 ***
VITVvi_vCabFran04_v1.1.hap1.chr01	+	259	274	0	1	0	0
VITVvi_vCabFran04_v1.1.hap1.chr01	+	274	276	0	0	0	1

Am I making a use or conceptual error? I had thought —mt CHH would only output CHH tuples.

@PeteHaitch
Copy link
Owner

PeteHaitch commented Feb 10, 2022

Hi @amvondras,

Thanks for the report and my apologies for it taking me a while to reply.
You're correct in your diagnosis, this is in fact a long-standing bug in the methtuple that I never noticed! (or if I did then I've forgotten the details ...)

This is how the option --mt is parsed:

Options.add_argument('--mt', '--methylation-type',
choices = ['CG', 'CHG', 'CHH', 'CNN'],
default = ['CG'],
action = 'append',
help = 'The methylation type. Multiple methylation types may be analysed jointly by repeated use of this argument, e.g., --methylation-type CG --methylation-type CHG')

You can see that the default value is CG and that if the user supplies --mt then it is appended.
My intention there was to allow a user to specify multiple methylation loci contexts in the one call, e.g., --mt CHG --mt CHH.
However, it turns out that argparse always includes the default, e.g. running --mt CHH is equivalent to running --mt CG --mt CHH, which is not what I intended!
This is documented behaviour for argparse but I didn't know that (it's a long time since I wrote any Python code and even then I was a bit of a novice); it seems to be due to the issue described in https://bugs.python.org/issue16399.

So far, the suggested ways to deal with this seem to be setting the default value to None and then doing some post-processing to make it such that this means CG methylation loci are analysed (e.g., https://stackoverflow.com/a/43663319).
I'm working on a patch that implements this.
In the meantime, can you post-hoc filter out CG loci from your CG_CHH output files?
Sorry for the inconvenience.

reprex

(for my own records; note the Finished extracting CG/CHH 2-tuples. in the output)

pip install methtuple
git clone git@github.com:PeteHaitch/methtuple.git
# This is v1.6.0
peter@PC1331:~/GitHub/methtuple (master)*$ ~/.local/bin/methtuple -m 2 --methylation-type CHH data/se_directional.fq.gz_bismark_bt2.bam
[E::idx_find_and_load] Could not retrieve index file for 'data/se_directional.fq.gz_bismark_bt2.bam'
methtuple (v1.5.3)

Input BAM file = data/se_directional.fq.gz_bismark_bt2.bam
Output file of CG/CHH 2-tuples = data/se_directional.fq.gz_bismark_bt2.CG_CHH.2.tsv
Reads that fail to pass QC filters will be written to = data/se_directional.fq.gz_bismark_bt2.reads_that_failed_QC.txt 

Assuming quality scores are Phred33

Assuming BAM file was created with Bismark version >= 0.8.3.

Ignoring improper read-pairs
Using all positions of each read (if data are single-end) or of each read_1 (if data are paired-end).
Using all positions of each read_2 (if data are paired-end).
Ignoring methylation calls with base-quality less than 0
Ignoring reads with mapQ less than 0
Ignoring any overlapping positions of paired-end reads where the XM-tags disagree and counting once the remaining overlapping positions.

Creating CG/CHH 2-tuples
WARNING: 2-tuples may still have intervening methylation loci (i.e. NIC > 0). Such 2-tuples generally occur in paired-end reads with non-overlapping mates but can also be caused by filtering methylation calls by base quality, read-position, etc. You may wish to post-hoc filter 2-tuples with NIC > 0.

Verified that the XR-, XG- and XM-tags are set for the first mapped read.

Finished extracting CG/CHH 2-tuples.
Now writing output to data/se_directional.fq.gz_bismark_bt2.CG_CHH.2.tsv ...

Summary of the number of DNA fragments processed by methtuple
Number of DNA fragments in file = 8266
Number of DNA fragments skipped due to being marked as PCR duplicates = 0
Number of DNA fragments skipped due to failing the --min-mapq filter = 0
Number of DNA fragments skipped due to the read or its mate being unmapped = 0
Number of DNA fragments skipped due to the read-pair being improperly paired = 0
Number of DNA fragments skipped due to mates mapping to different chromosomes = 0
Number of DNA fragments skipped due to a complicated CIGAR in the read or its mate = 0
Number of DNA fragments informative for CG/CHH 2-tuples = 8254 (99.9% of total fragments)

Writing histogram with the number of CG/CHH methylation loci per DNA fragment that passed QC filters to data/se_directional.fq.gz_bismark_bt2.CG_CHH_per_read.hist ...

@PeteHaitch
Copy link
Owner

In the meantime, can you post-hoc filter out CG loci from your CG_CHH output files?

I've realised this probably won't give you what you want because the way the tuples are created in v1.6.0 means that the tuples include those that are comprised of a CG and a CHH, which I think will mean it 'misses' some tuples of the same type.
E.g., (CHH, CG, CHH) in the reference will create 2-tuples: (CHH, CG) and (CG, CHH) but not (CHH, CHH).

@PeteHaitch
Copy link
Owner

I'm working on a patch that implements this.

This is currently available on https://github.com/PeteHaitch/methtuple/tree/issue-102
I need to do a little more testing and then, once its working, remember how to make it available via PyPI ...

PeteHaitch added a commit that referenced this issue Feb 10, 2022
@PeteHaitch
Copy link
Owner

Hi @amvondras,

I've made a new version of methtuple (v1.7.0) that fixes the issue.
However, in doing so I've (inadvertanly) broken compatibility with Python 2.7.
I'm strongly inclined to just remove any support for Python 2 since it reached end-of-life on 2020-01-01 (https://www.python.org/downloads/) and because I don't have much time to support methtuple.
Are you able to use Python3 and install methtuple (v1.7.0)?

Cheers,
Pete

@amvondras
Copy link
Author

new version of methtuple (v1.7.0)

I will try this using Python3 and report back. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants