Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CIGARs with >65535 operations in BAM files #227

Merged
merged 1 commit into from
Nov 28, 2017

Conversation

lh3
Copy link
Member

@lh3 lh3 commented Jul 14, 2017

This commit addresses #40. It added optional tag CG and
explained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.

@lh3 lh3 mentioned this pull request Jul 14, 2017
SAMv1.tex Outdated
This workaround is applied to BAM files \emph{only}. SAM and CRAM files are not
affected. If tag {\tt CG} is present, BAM parsing libraries are expected to
seamlessly update {\sf n\_cigar\_op} and {\sf cigar} with the real {\sf CIGAR}
stored in the {\tt CG} tag.}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus "and remove the now-redundant CG tag"?

SAMtags.tex Outdated
@@ -58,6 +58,7 @@ \section{Standard tags}
{\tt BC} & Z & Barcode sequence \\
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
{\tt CC} & Z & Reference name of the next hit \\
{\tt CG} & B,I & BAM-only tag to store the real {\sf CIGAR} if it contains $>$65535 operations\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it should be allowed to leave the CG tag in a sam or a cram. the "recommended practice" can talk about the fact that only bam needs it, but there's no reason to complicate thing in the spec by making it illegal in anything but bam.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, point taken. How about "Intended to store real CIGAR if it contains >65535 operations"? I would like to choose wording such that this tag won't be abused to frequently store normal CIGARs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jkbonfield jkbonfield Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intend to use CG over CIGAR in bam only if we have more than 64k, therefore the phrase "intended to store the real CIGAR..." is misleading. We have no explicit intentions of deliberately using this for all formats.

However when all said and done we need to decide what we wish to do when we see a CG tag in SAM or in CRAM. If we can decide on the behaviour then we can write the spec correctly to indicate this.

Do we ignore it and honour the real CIGAR, do we complain (it's an indication of using an out-dated piece of software that didn't correctly translate back), or do we try to patch the situation up and migrate CG to the cigar field once more?

My gut feeling is ignoring it is easiest and cleanest, but I could see a reason why issuing a warning may be useful.

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

I have updated this PR to encode long cigar with fake cigar <readLength>S<refLength>N.

Do we also want to add a CG:Y/N tag at the @HD line with this PR? I am thinking to merge this and samtools/htslib#560 first and then to address the CG header tag with another pair of pull requests to both hts-specs and htslib.

@yfarjoun
Copy link
Contributor

yfarjoun commented Nov 3, 2017

I don't understand the CG:Y/N header....how can we tell in advance if there will be records with CG tags?

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

You can't. I guess the intended use is that a user needs to ask the mapper to output a @HD CG:Y if he/she expects long-cigar alignment. This tag has no effect if there are no long cigars. However, if there is a long cigar and there is no CG:Y, samtools will throw an error.

One possible scenario, though, is that everyone might be outputting CG:Y regardless of long cigars.

@yfarjoun
Copy link
Contributor

yfarjoun commented Nov 3, 2017 via email

@jkbonfield
Copy link
Contributor

Tools that don't know how to deal with long cigars won't be parsing and checking the CG:Y header. Tools that do know how to deal with long cigars don't need a header to inform them of their presence. I forget now why it came up, but apparently in response to something I said. Hmm.

I'm tempted however to suggest that we ought to bump the SAM version number, even though this is technically a BAM only change. Why? Because there is no BAM version number other than the sledgehammer complete format change and there is the risk that this BAM only change will leak into SAM anyway, hence having a fixed version where we know it can occur (or vice versa, where we know it cannot) acts in a similar way to CG header tag, but more useful IMO. Plus, like it or not, the BAM and SAM formats have been inextricably linked in the same document since day 1.

@sjackman
Copy link

sjackman commented Nov 6, 2017

I believe CG:Y or CG:long was initially my suggestion, though I'm having trouble finding the original post now. The idea was that tools should output either entirely short CG or long CG, but not a mixture of both, to prevent an old tool from silently ignoring just those reads, and producing incorrect results. The following suggestion allows the entire pipeline to use either short or long CG as necessary, without changing the API. The default behaviour of the tools should just work in most situations.

  • A SAM writer, like a long-read aligner, should output either CG:Y or CG:N. The default behaviour is up to the tool author. The behaviour may be controlled by a command line option.
  • A BAM writer should output all short CG or long CG, but not a mixture of both.
  • A BAM reader and writer (like samtools view or sort) that does not see CG:Y in its input header should output exclusively short CG, and an error message if it encounters a long CIGAR.
  • A BAM reader and writer that sees CG:Y in its input header should output exclusively long CG.

@lh3
Copy link
Member Author

lh3 commented Nov 7, 2017

I would defer the decision on CG:Y to another PR. It is not essential, but affects more users and is likely to require lengthy discussions on the precise behavior and implementation. We should merge this PR, the core component of the proposal first.

@lh3
Copy link
Member Author

lh3 commented Nov 9, 2017

I understand samtools/htslib#560 will take time to test. Any concerns with this PR?

SAMtags.tex Outdated
@@ -60,7 +60,8 @@ \section{Standard tags}
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
{\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\
{\tt CC} & Z & Reference name of the next hit \\
{\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\
{\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the other lines have a space before the \\...please keep with that style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention here that it's for BAM format only

SAMv1.tex Outdated
@@ -857,6 +857,19 @@ \subsection{The BAM format}
\footnotetext{As noted in Section~\ref{sec:alnrecord}, reserved {\sf FLAG} bits
should be written as zero and ignored on reading by current software.}
\stepcounter{footnote}
\footnotetext{With 16 bits, {\sf n\_cigar\_op} can keep at most 65535 CIGAR
operations in BAM files. For an alignment with more CIGAR operations, BAM
stores the real {\sf CIGAR}, in its binary form, to the {\tt CG} optional tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to->in

SAMtags.tex Outdated
@@ -131,6 +132,13 @@ \subsection{Additional Template and Mapping data}
\item[CC:Z:\tagvalue{rname}]
Reference name of the next hit; `{\tt =}' for the same chromosome.

\item[CG:B:I,\tagvalue{encodedCigar}]
Real CIGAR in its binary form if it contains $>$65535 operations. This is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is in binary form, we need to be more specific about which binary format and what parts of the binary format....would it be easier to just store it as a string? What would happen is an "old tool" took the bam and converted it to a sam? how would the tag be encoded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the wording a little bit, emphasizing that CG is encoded exactly the same way as the cigar field in BAM. This allows you to move the entire cigar around with memory copy/move, much more efficient and easier to implement than keeping text CIGAR in the CG tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if-> if (and only if)

@jmarshall
Copy link
Member

This PR is a useful description of the CG tag solution to this 64K-cigar-operations issue, a proof of concept of the spec changes necessary if you will.

There are various details that need to be in the spec: I was pleased to see that the text describes how an implementation should recognise that a CG tag is being used, but for example the spec needs to describe exactly how the CG tag array is laid out (e.g. what order are the operator and length in in the array?) and doesn't at all at the moment, as @yfarjoun has also noticed.

I do not think this PR should be merged into the spec as is. I am in the process of expanding and reorganising the added text in IMHO a better way and covering missing details.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barring some minor updates, I approve. However please also remove the commited SAMv1.pdf to ease merging this pull request.

SAMtags.tex Outdated
@@ -60,7 +60,8 @@ \section{Standard tags}
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
{\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\
{\tt CC} & Z & Reference name of the next hit \\
{\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\
{\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "Intended" here as the specification isn't about what we'd like, but about what must happen.

How about "BAM only: stores the real {\sf CIGAR} field if it contains $&gt;65535$ operations".

SAMtags.tex Outdated
@@ -131,6 +132,13 @@ \subsection{Additional Template and Mapping data}
\item[CC:Z:\tagvalue{rname}]
Reference name of the next hit; `{\tt =}' for the same chromosome.

\item[CG:B:I,\tagvalue{encodedCigar}]
Real CIGAR in its binary form if it contains $>$65535 operations. This is
intended to be a BAM file only tag as a workaround of BAM's incapability to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove "intended to be". Otherwise I'm happy with the wording.

@jkbonfield
Copy link
Contributor

@jmarshall If you're in the process of revising things like footnotes, please also look at the first commit of Rob's BAMv2 proposal. This was a revamp of the BAM table to avoid the implementation specific bit-packing methods and to replace overly heavy footnotes with their own sections.

9ab9e3a

I was thinking of reviewing and likely merging this after we get Heng's CG tag done.

@lh3
Copy link
Member Author

lh3 commented Nov 9, 2017

However please also remove the commited SAMv1.pdf to ease merging this pull request.

This was a mistake. How can I remove it from the pull request? Checkout an older version of the PDF and then commit again?

@sjackman
Copy link

sjackman commented Nov 9, 2017

git checkout origin/master -- SAMv1.pdf
git commit --amend SAMv1.pdf

@lh3 lh3 force-pushed the cigar-64k branch 2 times, most recently from ed6e333 to 924ed9f Compare November 9, 2017 17:34
@colinhercus
Copy link

colinhercus commented Nov 10, 2017 via email

@lh3
Copy link
Member Author

lh3 commented Nov 10, 2017

I somehow deleted a space at the end of CM line, so you saw the line changed, but it is still there. I didn't delete the line. I have just added the space back.

@colinhercus
Copy link

colinhercus commented Nov 10, 2017 via email

@jkbonfield
Copy link
Contributor

Can we have a progress report on this. @jmarshall are you revising the text? If so can we get a draft?

As far as I'm concerned, the text is fine but the placement and ever-increasing series of footnotes is not. However that change is more wide ranging that just this PR so I propose to merge this and then look at the 1st commit of Rob's BAM2 PR which tidies up the excessive footnotes (with an amendment for the new one added here too).

Thoughts @yfarjoun?

@lh3
Copy link
Member Author

lh3 commented Nov 14, 2017

Can we merge this PR? I think all the concerns have been addressed so far.

@jmarshall
Copy link
Member

We will not be merging this PR before this week's meeting.

@tk2
Copy link

tk2 commented Nov 14, 2017

@jmarshall can you clarify, are you preparing an amended pull request in time for the meeting?

@jmarshall
Copy link
Member

Yes. In any case, the topic is already on the agenda for the meeting — so no PR on this topic would be being merged before then.

@lh3 lh3 force-pushed the cigar-64k branch 3 times, most recently from 8a0e52b to 7583e0b Compare November 18, 2017 15:31
@lh3
Copy link
Member Author

lh3 commented Nov 18, 2017

Changed "if" to "if (and only if)" as is requested by @yfarjoun. Rebased and squashed to a single commit.

@sjackman
Copy link

sjackman commented Nov 18, 2017

Real CIGAR in its binary form if it contains $>$65535 operations.

I disagree with if and only if and prefer if. A BAM file that uses exclusively either long CG tags or short CIGAR strings is preferable in my opinion to one that uses mixed encodings. Old software that reads a mixed encoding would ignore just the long alignments and produce inaccurate results that are not readily apparent. It's better to use CG tag for all alignments, so old software would ignore all alignments, and its failure would be more apparent. I'm not arguing that this be required behaviour. Changing the wording to if and only if however precludes using a file composed of only CG tags for all reads, which is in my opinion the better format.

See #227 (comment)
and https://github.com/samtools/hts-specs/pull/227/files#r151760409

@yfarjoun
Copy link
Contributor

This raises an interesting point.

Also, we would probably like to test the CG tags using shorter than 64K operations.

My concerns is that folks will shift to using exclusive CG tags in BAM regardless of the length of their CIGARS (to be on the "safe side" in case a large cigar comes by)....

Thoughts?

@sjackman
Copy link

My concerns is that folks will shift to using exclusive CG tags in BAM regardless of the length of their CIGARS (to be on the "safe side" in case a large cigar comes by)

I don't think that's likely. People will likely continue to use old CIGAR for compatibility with older toolchains as long as they can. Only those motivated by extremely long reads will likely migrate to CG tags.

@lh3
Copy link
Member Author

lh3 commented Nov 18, 2017

I think we have reached a consensus during the call? We should merge this first. Create a new PR if you like and I will comment there.

@yfarjoun
Copy link
Contributor

yfarjoun commented Nov 18, 2017 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 18, 2017

I'm fairly abivalent about it really and can see either way has merits. I'm tempted to say keep it open and say "if" rather than "if and only if", but I'll go along with the latter if everyone else agrees it's best.

As for testing it with shorter CGs, I did that for the htslib implementation, including long "fake" cigars and very short real ones placed in CG to stress test it (this found issues, now resolved). However that is all implementation detail - it can and now does work with whatever we seem to throw at it so the spec is workable with is about the only implementation detail we need for the spec decision.

It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
@jkbonfield
Copy link
Contributor

@jmarshall can we get an update on this? Above you said you'd post an updated PR before the meeting, but we're ~1 week after and still no feedback.

We need to get moving on this, but cannot do so if we don't know what the blocking issues are and nor can we review a revised PR.

@jkbonfield
Copy link
Contributor

Still waiting on an update. What's the delay please? Both myself and @yfarjoun are in agreement, as is the original PR author @lh3. @jmarshall - you indicated you had concerns and an updated PR, but have so far been unable to tell us what these are nor offer any updated text.

I propose we merge this PR as it currently stands and your updates can arrive as a subsequent PR if and when ready.

@jkbonfield jkbonfield merged commit dab57f4 into samtools:master Nov 28, 2017
@jkbonfield
Copy link
Contributor

Merged. Next up we need to look at 9ab9e3a (from #259) to see which bits of the tidy up we should merge in, but this is a separate issue.

jmarshall added a commit to jmarshall/hts-specs that referenced this pull request May 2, 2018
The new CG tag field BAM representation for long CIGAR strings
(PR samtools#227, merged as dab57f4)
will be unnoticed by older code. Such code will see the placeholder
CIGAR string, so it needs to be possible to signal the presence of
CG tags via the @HD-VN version number header field.
jmarshall added a commit to jmarshall/hts-specs that referenced this pull request May 22, 2018
The new CG tag field BAM representation for long CIGAR strings
(PR samtools#227, merged as dab57f4)
will be unnoticed by older code. Such code will see the placeholder
CIGAR string, so it needs to be possible to signal the presence of
CG tags via the @HD-VN version number header field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants