Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam_parse_cigar possibly adds data to a wrong place #1650

Closed
oleksii-nikolaienko opened this issue Jul 17, 2023 · 6 comments · Fixed by #1651
Closed

bam_parse_cigar possibly adds data to a wrong place #1650

oleksii-nikolaienko opened this issue Jul 17, 2023 · 6 comments · Fixed by #1651

Comments

@oleksii-nikolaienko
Copy link

I am trying to create a BAM file from scratch and using bam_set1 to create records. When bam_set1 is called with 0 for size_t n_cigar and NULL for const uint32_t *cigar and those fields are filled later using bam_parse_cigar, then resulting BAM file appeared to be corrupted. Seems that bam_parse_cigar adds CIGAR data not before seq but after qual.

I'm not sure I'm using it correctly though, I'm sorry if it appears to be my mistake...

int test_write_bam ()
{
  int result = 0;
  htsFile *out_fp = hts_open("/tmp/tmp.bam", "wb");
  bam_hdr_t *out_hdr = sam_hdr_init();
  bam1_t *out_rec = bam_init1();
  
  size_t n_cigar = 0;
  uint32_t *cigar = NULL;
  result = sam_parse_cigar("4M", NULL, &cigar, &n_cigar);
  cout << "sam_parse_cigar: " << result << std::endl;
  
  result = sam_hdr_add_lines(out_hdr, "@SQ\tSN:seq1\tLN:100", 0);
  cout << "sam_hdr_add_lines: " << result << std::endl;
  result = sam_hdr_write(out_fp, out_hdr);
  cout << "sam_hdr_write: " << result << std::endl;
  //*bam, l_qname, *qname, flag, tid, pos, mapq, n_cigar, *cigar, mtid, mpos, isize, l_seq, *seq, *qual, l_aux
  result = bam_set1(out_rec, 2, "q1", 4, 0, 0, 60, n_cigar, cigar, 0, 0, 4, 4, "TGAC", "    ", 0);
  cout << "bam_set1: " << result << std::endl;
  
  cout << "data: ";
  for (int i=0; i<out_rec->l_data; i++) Rprintf("%u,", out_rec->data[i]);
  cout << std::endl;
  
  result = bam_parse_cigar("2M2I", NULL, out_rec);
  cout << "bam_parse_cigar: " << result << std::endl;
  
  cout << "data: ";
  for (int i=0; i<out_rec->l_data; i++) Rprintf("%u,", out_rec->data[i]);
  cout << std::endl;
  
  cout << "n_cigar: " << out_rec->core.n_cigar << std::endl;
  
  result = sam_write1(out_fp, out_hdr, out_rec);
  cout << "sam_write1: " << result << std::endl;
  
  bam_hdr_destroy(out_hdr);
  bam_destroy1(out_rec);
  hts_close(out_fp);
  
  return result;
}

Output:

sam_parse_cigar: 1
sam_hdr_add_lines: 0
sam_hdr_write: 0
bam_set1: 14
data: 113,49,0,0,64,0,0,0,132,18,32,32,32,32,
bam_parse_cigar: 2
data: 113,49,0,0,64,0,0,0,132,18,32,32,32,32,32,0,0,0,33,0,0,0,
n_cigar: 1
sam_write1: 57
@oleksii-nikolaienko
Copy link
Author

And another little thing - this definition couple of times says "integer" instead of "float". Not very important, of course

@jkbonfield
Copy link
Contributor

Obviously a cut and paste error. Thanks for reporting.

I'll take a look at the cigar thing.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 17, 2023

I think the problem here is bam_parse_cigar was never intended to be used to modify an existing BAM record, and instead is part of the existing parsing of SAM, and hence it's only called at the appropriate time in the decode loop.

It appeared in 9a55e4e and it's unclear if the plan was for it to later become than it currently is, and to act as a generic BAM modification tool. As writyen it clearly just appends to b->data + b->l_data, which in the case where it's called from the SAM parsing loop is the same as bam_get_cigar(b). When we already have an existing BAM record however, this is not true.

The code could therefore be modified to check for this and become a genuine CIGAR rewriting tool. There is potentially a use for this, even if it's just clearing a CIGAR.

@whitwham
Copy link
Contributor

Agreed. To use bam_set1 I think you need all the data ready before it is called (barring the aux data maybe). I don't think we have anything to modify the cigar string in memory.

@jkbonfield
Copy link
Contributor

It's not that hard to do though. It just needs a memmove sticking in there for the case where b->data + b->l_data != bam_get_cigar(b). I'm just testing if it has performance impacts on the more normal case though.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jul 17, 2023
partially parsed ones.

It makes little sense for this to exist as a public API when it's only
capable of handling the internal during-SAM-parse situation, and the
changes are relatively minor.

Also fixes an undocumented assumption that end == &in.

Fixes samtools#1650
@oleksii-nikolaienko
Copy link
Author

oleksii-nikolaienko commented Jul 17, 2023

Btw, for my information: what values one should pass as mtid and mpos in bam_set1 for a correct single-end record? Is it -1?

whitwham pushed a commit that referenced this issue Jul 31, 2023
partially parsed ones.

It makes little sense for this to exist as a public API when it's only
capable of handling the internal during-SAM-parse situation, and the
changes are relatively minor.

Also fixes an undocumented assumption that end == &in.

Fixes #1650
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this issue Aug 17, 2023
partially parsed ones.

It makes little sense for this to exist as a public API when it's only
capable of handling the internal during-SAM-parse situation, and the
changes are relatively minor.

Also fixes an undocumented assumption that end == &in.

Fixes samtools#1650
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants