Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comethylation may not work with soft-clipped reads or reads containing indels #45

Closed
PeteHaitch opened this issue Jan 14, 2014 · 10 comments
Assignees
Labels

Comments

@PeteHaitch
Copy link
Owner

comethylation. uses read.alen to infer read length. This may give odd results for soft-clipped reads or reads containing indels. I need to investigate which of the length methods is most appropriate: alen, inferred_length, qlen or rlen.

To be on the safe side, the initial public release of comethylation.py will simply skip reads containing indels.

@ghost ghost assigned PeteHaitch Jan 14, 2014
@PeteHaitch PeteHaitch mentioned this issue Jan 14, 2014
@PeteHaitch
Copy link
Owner Author

Bismark does not allow soft-clipping of reads.

Bismark with Bowtie1
Bowtie1 does not do soft-clipping.

Bismark with Bowtie2
While Bowtie2 can do soft-clipping, Bismark does not permit Bowtie2 to soft-clip reads, i.e. it runs Bowtie2 with the --end-to-end flag and not the --local flag.
From the Bismark (v10.1) manual:

Bismark limits Bowtie2 to only perform end-to-end alignments, i.e. searches for alignments involving all read characters (also called untrimmed or unclipped alignments).

@PeteHaitch
Copy link
Owner Author

This is a (non-bisulfite) read aligned with bwa mem that is soft-clipped.

DJTPB5M1:306:D278CACXX:4:1311:16397:99259       147     chr1    3002039 60      90M10S  =       3001933 -196    TAATTTTGTTAAAGATATTTGCTGGTCCTTTAAGTTGAAAATCTTCATTCTCACCTACTCTTGTTGTCCATATGTTTGGGCTTTTCATTGTGTCCTAGAT    BB@ABBCCBAA@CC@@BBBACCABAACC??>?A@AAB?AA?ACBBC>ABAABACB>@A@BBAA@A@ACC?@?A@@@ABBCC@A@AC?AAAAA@CB>A>=?    BD:Z:LMMEIQKPPNGPLPMMNFNOROOOLNPNEMKMLLMMKDDLMJMLJNLLJIJGKNNMKIJMLGLMGKMNNKKMGLDMMIOQMDDLJNLNHHHMORRQKOLL       PG:Z:MarkDuplicates.8   RG:Z:277_D278CACXX_GTGAAA_L004  BI:Z:RNRJNUTSTUNTSTUTTKSRTRUPQSQQJQRRQPRSPKKNOPQPOQRPPPOOSQPRPPPQRPORPPROPSQRPPIQSNQRQIIONPQRQOQQSQPSQRPP       NM:i:0  MQ:i:60 AS:i:90 XS:i:22

The last 10 cycles of the read have been soft-clipped:

>>> read.cigarstring
'90M10S'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen.
read.alen

>>> read.alen
90

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
90

read.rlen

>>> read.rlen
100

@PeteHaitch
Copy link
Owner Author

This is a (non-bisulfite) read aligned with bwa mem that contains an insertion.

DJTPB5M1:306:D278CACXX:3:2209:2706:73074        147     chr1    3089005 0       70M1I29M        =       3088915 -189    GTCCCCTATCTGATTTAGGATAGGTAAAGATCCTTTCCCAATCTGTTGNTGGTCTTTTTGTCTTATTGACAAGTGTCTTTTGCCTTACAGAAGCTTTGCA    DDDDEEAACECD@AA?BABBACB@?@@CB@ACC@@ACCC??AAABAA?#AB@AB@@@@A@AC@@@@AB@C?A@A@ABA@@ABCBA?@BBCABCC??@>??    BD:Z:LNJJTSOQMQPQMFOOPPOLNOONLENKOMMOMDLMINMLMJNGLLKKKNNKILCDDLGJILMKKLMNGMMLGGKJLCDLNOOMMMGNKLNQRPGPOPLL       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:QSNNUTUSSTUTTLTTSRSTSSPRSKSQRPSQQJPSNPQNPPQPPRPPPSNQOPIIIQPQPQQQQQRROPQPNOPPQIIQPQPPQQOQQPRTSQJSQRPP       NM:i:4  MQ:i:0  AS:i:80 XS:i:90

There is a one base insertion at cycle 71 in the read.

>>> read.cigarstring
'70M1I29M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen.
read.alen

>>> read.alen
99

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
100

read.rlen

>>> read.rlen
100

@PeteHaitch
Copy link
Owner Author

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion and soft-clipped cycles.

DJTPB5M1:306:D278CACXX:3:2307:21152:57403       83      chr1    3006697 60      5S60M1D35M      =       3006563 -230    GAAATTGGAAGGGCCGGGAGGGAGGAAAAGAAAAAAATTATGCAATTATATTTTAATATCCAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAA    ####################################################################?A?A@BAA???A@ABBCA?@@;BD@@BBB?@@    BD:Z:LLLLPPOOONNMMMMMLLLLLLLLLLLKKKKKKKKKKKKKKKKJKKKKKKKKJJJJJJJJJJJJKKKKDDDDDDDCCDDDDMJJLNLMMOPNOQRNOLLL       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:PPPPTSSSSSRRRRRRQQQQQQQPPPPPPOPPPPPOPPPPOPPOPPPPPPPPPOPOOOOOOOOOOOOOJJJJJJJJJJJJJQPOPQQQROQPNTUPQPPP       NM:i:8  MQ:i:60 AS:i:53 XS:i:32

The first 5 cycles are soft-clipped and there is a deletion at cycle 66 of the read.

>>> read.cigarstring
'5S60M1D35M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen.
read.alen

>>> read.alen
96

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
95

read.rlen

>>> read.rlen
100

@PeteHaitch
Copy link
Owner Author

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion.

DJTPB5M1:306:D278CACXX:5:2216:11367:15711       83      chr1    3006733 70      24M1D76M        =       3006639 -195    GAAATTATATTGTAATACCAAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAAAAAAAAAGAAAGAAAAAAGAAAAAGAAAAGAAAAAGAAAAA    ####################AB=;088BBBBBBBBBBBAAACCCB?BA:BB??BBBA????????BC??AB?????AB????AB???BCAAA@BB@@>@A    OC:Z:17M1D83M   BD:Z:LLLLPPOONNNNMMMMLLLLEEEEEEDDDDDDDDDDDDDCMJJLMLLKMNKLLMHMJCCCCCCDMJKDMJKDDDDLIJDDDMJKDENKMFFGPNOIEELL       PG:Z:MarkDuplicates.7   RG:Z:277_D278CACXX_GTGAAA_L005  BI:Z:PPPPTSSSSSSRRRQQQQQQLLLKKKKKKKJKKKKJKKKKRQPQQRQROPPMSSNPOJJJJJJJQPOJQPOJJJJQPOJJJQPOJJQPOKKKRRQLKKPP       NM:i:5  MQ:i:60 AS:i:81 XS:i:33

There is a deletion at cycle 25 of the read.

>>> read.cigarstring
'24M1D76M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen.
read.alen

>>> read.alen
101

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
100

read.rlen

>>> read.rlen
100

@PeteHaitch
Copy link
Owner Author

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion, an insertion and soft-clipped cycles.

DJTPB5M1:306:D278CACXX:3:1101:8598:39906        163     chr1    4995124 28      40S22M4I12M1D22M        =       4995234 152     CAATGAGATCCTTGCAGCCAGGAATCCCATTGCATTCTCCTGGAAAGAATGTTTGATGAAATTTAAAAAAAAAAAAATCTTGCCTTGCTGGGGATTGCAA    ?@=?A@CA@CBBACBABBBABCA@?BBBA??CBA??BABBACCA@@BA@?BAAAB@?BA@@????@@?@?B@BAAA?ACCA@DDBADDCEEEEBABBCDC    BD:Z:LLMLRMLLOOOOMMONNPONNOLKKNNINMLMONMLKJIMMNNMLCLILKNGLDMJMNJLDKLDKMDDDDDEEEEEELOKNNQQPOOQRQQLMQQQNPOM       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:PPRRQOPPPRQRRPQPPRQSPPRPRQPNSRMPRPRMOPPQRQOROIPOPQOPQKQOPPOPJRNKRRJJJJJJJKJKKSSRSRSSTTSTVSSQQVSRQRQR       NM:i:6  MQ:i:28 AS:i:34 XS:i:33

The first 40 cycles are soft-clipped, there is an insertion at cycle 63 and a deletion at cycle 79 of the read.

>>> read.cigarstring
'40S22M4I12M1D22M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen.
read.alen

>>> read.alen
57

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
60

read.rlen

>>> read.rlen
100

@PeteHaitch
Copy link
Owner Author

Another point to check when using soft-clipped reads will be the definition of n_overlap, which is currently:

n_overlap = read_1.alen + read_2.alen - abs(read_1.tlen)

@PeteHaitch
Copy link
Owner Author

comethylation will only process a read if it contains M in the CIGAR string, i.e. even = and X are disallowed.

@PeteHaitch
Copy link
Owner Author

comethylation correctly handles reads with indels as of v1.3.0 and should also correctly handle soft-clipped reads. However, since Bismark does not allow soft-clipping the latter needs to be more thoroughly tested once I have BS-seq data aligned with software that allows soft-clips (such as bwa-meth).

@PeteHaitch
Copy link
Owner Author

Closing as comethylation now supports indels and, in principle, soft-clipped reads. As support of soft-clipped reads is only in principle I will open a new issue to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant