-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CIGARs with >65535 operations in BAM files #227
Conversation
SAMv1.tex
Outdated
This workaround is applied to BAM files \emph{only}. SAM and CRAM files are not | ||
affected. If tag {\tt CG} is present, BAM parsing libraries are expected to | ||
seamlessly update {\sf n\_cigar\_op} and {\sf cigar} with the real {\sf CIGAR} | ||
stored in the {\tt CG} tag.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus "and remove the now-redundant CG tag"?
SAMtags.tex
Outdated
@@ -58,6 +58,7 @@ \section{Standard tags} | |||
{\tt BC} & Z & Barcode sequence \\ | |||
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\ | |||
{\tt CC} & Z & Reference name of the next hit \\ | |||
{\tt CG} & B,I & BAM-only tag to store the real {\sf CIGAR} if it contains $>$65535 operations\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it should be allowed to leave the CG tag in a sam or a cram. the "recommended practice" can talk about the fact that only bam needs it, but there's no reason to complicate thing in the spec by making it illegal in anything but bam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, point taken. How about "Intended to store real CIGAR if it contains >65535 operations"? I would like to choose wording such that this tag won't be abused to frequently store normal CIGARs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We intend to use CG over CIGAR in bam only if we have more than 64k, therefore the phrase "intended to store the real CIGAR..." is misleading. We have no explicit intentions of deliberately using this for all formats.
However when all said and done we need to decide what we wish to do when we see a CG tag in SAM or in CRAM. If we can decide on the behaviour then we can write the spec correctly to indicate this.
Do we ignore it and honour the real CIGAR, do we complain (it's an indication of using an out-dated piece of software that didn't correctly translate back), or do we try to patch the situation up and migrate CG to the cigar field once more?
My gut feeling is ignoring it is easiest and cleanest, but I could see a reason why issuing a warning may be useful.
I have updated this PR to encode long cigar with fake cigar Do we also want to add a |
I don't understand the CG:Y/N header....how can we tell in advance if there will be records with CG tags? |
You can't. I guess the intended use is that a user needs to ask the mapper to output a One possible scenario, though, is that everyone might be outputting CG:Y regardless of long cigars. |
that will be useless then...
…On Fri, Nov 3, 2017 at 8:49 PM, Heng Li ***@***.***> wrote:
You can't. I guess the intended use is that a user needs to ask the mapper
to output a @hd CG:Y if he/she expects long-cigar alignment. This tag has
no effect if there are no long cigars. However, if there is a long cigar
and there is no CG:Y, samtools will throw an error.
One possible scenario, though, is that everyone might be outputting CG:Y
regardless of long cigars.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0hCajBAKskFXNJR9uhqPbvJV_Pyvks5sy2A5gaJpZM4OYbHD>
.
|
Tools that don't know how to deal with long cigars won't be parsing and checking the CG:Y header. Tools that do know how to deal with long cigars don't need a header to inform them of their presence. I forget now why it came up, but apparently in response to something I said. Hmm. I'm tempted however to suggest that we ought to bump the SAM version number, even though this is technically a BAM only change. Why? Because there is no BAM version number other than the sledgehammer complete format change and there is the risk that this BAM only change will leak into SAM anyway, hence having a fixed version where we know it can occur (or vice versa, where we know it cannot) acts in a similar way to CG header tag, but more useful IMO. Plus, like it or not, the BAM and SAM formats have been inextricably linked in the same document since day 1. |
I believe
|
I would defer the decision on CG:Y to another PR. It is not essential, but affects more users and is likely to require lengthy discussions on the precise behavior and implementation. We should merge this PR, the core component of the proposal first. |
I understand samtools/htslib#560 will take time to test. Any concerns with this PR? |
SAMtags.tex
Outdated
@@ -60,7 +60,8 @@ \section{Standard tags} | |||
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\ | |||
{\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\ | |||
{\tt CC} & Z & Reference name of the next hit \\ | |||
{\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\ | |||
{\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the other lines have a space before the \\
...please keep with that style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention here that it's for BAM format only
SAMv1.tex
Outdated
@@ -857,6 +857,19 @@ \subsection{The BAM format} | |||
\footnotetext{As noted in Section~\ref{sec:alnrecord}, reserved {\sf FLAG} bits | |||
should be written as zero and ignored on reading by current software.} | |||
\stepcounter{footnote} | |||
\footnotetext{With 16 bits, {\sf n\_cigar\_op} can keep at most 65535 CIGAR | |||
operations in BAM files. For an alignment with more CIGAR operations, BAM | |||
stores the real {\sf CIGAR}, in its binary form, to the {\tt CG} optional tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to->in
SAMtags.tex
Outdated
@@ -131,6 +132,13 @@ \subsection{Additional Template and Mapping data} | |||
\item[CC:Z:\tagvalue{rname}] | |||
Reference name of the next hit; `{\tt =}' for the same chromosome. | |||
|
|||
\item[CG:B:I,\tagvalue{encodedCigar}] | |||
Real CIGAR in its binary form if it contains $>$65535 operations. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is in binary form, we need to be more specific about which binary format and what parts of the binary format....would it be easier to just store it as a string? What would happen is an "old tool" took the bam and converted it to a sam? how would the tag be encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the wording a little bit, emphasizing that CG
is encoded exactly the same way as the cigar field in BAM. This allows you to move the entire cigar around with memory copy/move, much more efficient and easier to implement than keeping text CIGAR in the CG tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if-> if (and only if)
This PR is a useful description of the CG tag solution to this 64K-cigar-operations issue, a proof of concept of the spec changes necessary if you will. There are various details that need to be in the spec: I was pleased to see that the text describes how an implementation should recognise that a CG tag is being used, but for example the spec needs to describe exactly how the CG tag array is laid out (e.g. what order are the operator and length in in the array?) and doesn't at all at the moment, as @yfarjoun has also noticed. I do not think this PR should be merged into the spec as is. I am in the process of expanding and reorganising the added text in IMHO a better way and covering missing details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barring some minor updates, I approve. However please also remove the commited SAMv1.pdf to ease merging this pull request.
SAMtags.tex
Outdated
@@ -60,7 +60,8 @@ \section{Standard tags} | |||
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\ | |||
{\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\ | |||
{\tt CC} & Z & Reference name of the next hit \\ | |||
{\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\ | |||
{\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "Intended" here as the specification isn't about what we'd like, but about what must happen.
How about "BAM only: stores the real {\sf CIGAR} field if it contains
SAMtags.tex
Outdated
@@ -131,6 +132,13 @@ \subsection{Additional Template and Mapping data} | |||
\item[CC:Z:\tagvalue{rname}] | |||
Reference name of the next hit; `{\tt =}' for the same chromosome. | |||
|
|||
\item[CG:B:I,\tagvalue{encodedCigar}] | |||
Real CIGAR in its binary form if it contains $>$65535 operations. This is | |||
intended to be a BAM file only tag as a workaround of BAM's incapability to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove "intended to be". Otherwise I'm happy with the wording.
@jmarshall If you're in the process of revising things like footnotes, please also look at the first commit of Rob's BAMv2 proposal. This was a revamp of the BAM table to avoid the implementation specific bit-packing methods and to replace overly heavy footnotes with their own sections. I was thinking of reviewing and likely merging this after we get Heng's CG tag done. |
This was a mistake. How can I remove it from the pull request? Checkout an older version of the PDF and then commit again? |
git checkout origin/master -- SAMv1.pdf
git commit --amend SAMv1.pdf |
ed6e333
to
924ed9f
Compare
Hi,
It looks like you're deleting the CM tag! Why is this in 65535 CIGAR
thread?
We've always used CM as number of colour mismatches and NM as the number of
bases mismatches. A SNP will be one NM and 2 CMs and a single CM may not be
a even add to NM. There are still people using Colour Space.
Best Regards, Colin
…On 9 November 2017 at 22:47, Yossi Farjoun ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In SAMtags.tex
<#227 (comment)>:
> @@ -60,7 +60,8 @@ \section{Standard tags}
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
{\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\
{\tt CC} & Z & Reference name of the next hit \\
- {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\
+ {\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\
all the other lines have a space before the \\...please keep with that
style.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#227 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcc-iBtsVrw2PmZnSpBXEC423RurMACks5s0xCOgaJpZM4OYbHD>
.
|
I somehow deleted a space at the end of CM line, so you saw the line changed, but it is still there. I didn't delete the line. I have just added the space back. |
Hi Heng,
Okay, no problem. I wondered if it was mistake after I posted. It did
clearly show as - CM line and + CG line.
Best Regards, Colin
…On 10 November 2017 at 10:26, Heng Li ***@***.***> wrote:
I somehow deleted a space at the end of CM line, so you saw the line
changed, but it is still there. I didn't delete it. I have just added the
space back.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcc-j4KlnyQmvfJpqQU2YjvY0ZoM5xuks5s07RJgaJpZM4OYbHD>
.
|
Can we have a progress report on this. @jmarshall are you revising the text? If so can we get a draft? As far as I'm concerned, the text is fine but the placement and ever-increasing series of footnotes is not. However that change is more wide ranging that just this PR so I propose to merge this and then look at the 1st commit of Rob's BAM2 PR which tidies up the excessive footnotes (with an amendment for the new one added here too). Thoughts @yfarjoun? |
Can we merge this PR? I think all the concerns have been addressed so far. |
We will not be merging this PR before this week's meeting. |
@jmarshall can you clarify, are you preparing an amended pull request in time for the meeting? |
Yes. In any case, the topic is already on the agenda for the meeting — so no PR on this topic would be being merged before then. |
8a0e52b
to
7583e0b
Compare
Changed "if" to "if (and only if)" as is requested by @yfarjoun. Rebased and squashed to a single commit. |
I disagree with See #227 (comment) |
This raises an interesting point. Also, we would probably like to test the CG tags using shorter than 64K operations. My concerns is that folks will shift to using exclusive CG tags in BAM regardless of the length of their CIGARS (to be on the "safe side" in case a large cigar comes by).... Thoughts? |
I don't think that's likely. People will likely continue to use old CIGAR for compatibility with older toolchains as long as they can. Only those motivated by extremely long reads will likely migrate to CG tags. |
I think we have reached a consensus during the call? We should merge this first. Create a new PR if you like and I will comment there. |
I guess it is better to start with the stricter option and possibly relax
it than vice versa.
…On Sat, Nov 18, 2017 at 5:22 PM Heng Li ***@***.***> wrote:
I think we have reached a consensus during the call? We should merge this
first. Create a new PR if you like and I will comment there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0tZxpJcePqLp_pPcNL5KoEvk3A0uks5s31iugaJpZM4OYbHD>
.
|
I'm fairly abivalent about it really and can see either way has merits. I'm tempted to say keep it open and say "if" rather than "if and only if", but I'll go along with the latter if everyone else agrees it's best. As for testing it with shorter CGs, I did that for the htslib implementation, including long "fake" cigars and very short real ones placed in CG to stress test it (this found issues, now resolved). However that is all implementation detail - it can and now does work with whatever we seem to throw at it so the spec is workable with is about the only implementation detail we need for the spec decision. |
It works by encoding the real CIGAR at the CG tag and writing a fake CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has implemented the method and been merged.
@jmarshall can we get an update on this? Above you said you'd post an updated PR before the meeting, but we're ~1 week after and still no feedback. We need to get moving on this, but cannot do so if we don't know what the blocking issues are and nor can we review a revised PR. |
Still waiting on an update. What's the delay please? Both myself and @yfarjoun are in agreement, as is the original PR author @lh3. @jmarshall - you indicated you had concerns and an updated PR, but have so far been unable to tell us what these are nor offer any updated text. I propose we merge this PR as it currently stands and your updates can arrive as a subsequent PR if and when ready. |
The new CG tag field BAM representation for long CIGAR strings (PR samtools#227, merged as dab57f4) will be unnoticed by older code. Such code will see the placeholder CIGAR string, so it needs to be possible to signal the presence of CG tags via the @HD-VN version number header field.
The new CG tag field BAM representation for long CIGAR strings (PR samtools#227, merged as dab57f4) will be unnoticed by older code. Such code will see the placeholder CIGAR string, so it needs to be possible to signal the presence of CG tags via the @HD-VN version number header field.
This commit addresses #40. It added optional tag
CG
andexplained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.