Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bedtools tag broken for small input files since 2.27.1 #762

Closed
mvdbeek opened this issue Sep 6, 2019 · 12 comments
Closed

bedtools tag broken for small input files since 2.27.1 #762

mvdbeek opened this issue Sep 6, 2019 · 12 comments

Comments

@mvdbeek
Copy link
Contributor

mvdbeek commented Sep 6, 2019

bedtools tag -names -i srma_in3.bam -files tagBed1.bed|samtools view works fine on 2.27.1, but is broken on 2.28.0 and 2.29.0 and returns

[main_samview] truncated file.

You can find the test data in https://github.com/galaxyproject/tools-iuc/blob/3e955a4c015cdb4df68fb530ace165be9877f94a/tools/bedtools/test-data/srma_in3.bam?raw=true

Let me know if you need more details.

@mvdbeek mvdbeek changed the title bedtools tag broken since 2.27.1 bedtools tag broken for small input files since 2.27.1 Sep 6, 2019
@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

This might have to do with the very tiny input file, larger files seem to work fine.

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

Actually it's probably because the header doesn't contain the @HD line, adding that fixes the test. I'm not sure if that's really a regression, but even without this the input seems to be a valid BAM file and samtools view has no issue with it.

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

A similar issue also seem to happen with bedtools bedpetobam: bedtools bedpetobam -mapq 255 -i bedpeToBamBed1.bed -g mm9.len|samtools view, test data is in https://github.com/galaxyproject/tools-iuc/tree/master/tools/bedtools/test-data

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

The generated header looks fine fwiw:

samtools view -h output.bam                            ✔  10121  15:05:22  .venv3  (jupyter)
@HD	VN:1.0	SO:unsorted
@PG	ID:BEDTools_bedpeToBam	VN:Vv2.29.0
@SQ	SN:chr1	AS:test-data/mm9.len	LN:197195432
@SQ	SN:chr10	AS:test-data/mm9.len	LN:129993255
@SQ	SN:chr11	AS:test-data/mm9.len	LN:121843856
@SQ	SN:chr12	AS:test-data/mm9.len	LN:121257530
@SQ	SN:chr13	AS:test-data/mm9.len	LN:120284312
@SQ	SN:chr13_random	AS:test-data/mm9.len	LN:400311
@SQ	SN:chr14	AS:test-data/mm9.len	LN:125194864
@SQ	SN:chr15	AS:test-data/mm9.len	LN:103494974
@SQ	SN:chr16	AS:test-data/mm9.len	LN:98319150
@SQ	SN:chr16_random	AS:test-data/mm9.len	LN:3994
@SQ	SN:chr17	AS:test-data/mm9.len	LN:95272651
@SQ	SN:chr17_random	AS:test-data/mm9.len	LN:628739
@SQ	SN:chr18	AS:test-data/mm9.len	LN:90772031
@SQ	SN:chr19	AS:test-data/mm9.len	LN:61342430
@SQ	SN:chr1_random	AS:test-data/mm9.len	LN:1231697
@SQ	SN:chr2	AS:test-data/mm9.len	LN:181748087
@SQ	SN:chr3	AS:test-data/mm9.len	LN:159599783
@SQ	SN:chr3_random	AS:test-data/mm9.len	LN:41899
@SQ	SN:chr4	AS:test-data/mm9.len	LN:155630120
@SQ	SN:chr4_random	AS:test-data/mm9.len	LN:160594
@SQ	SN:chr5	AS:test-data/mm9.len	LN:152537259
@SQ	SN:chr5_random	AS:test-data/mm9.len	LN:357350
@SQ	SN:chr6	AS:test-data/mm9.len	LN:149517037
@SQ	SN:chr7	AS:test-data/mm9.len	LN:152524553
@SQ	SN:chr7_random	AS:test-data/mm9.len	LN:362490
@SQ	SN:chr8	AS:test-data/mm9.len	LN:131738871
@SQ	SN:chr8_random	AS:test-data/mm9.len	LN:849593
@SQ	SN:chr9	AS:test-data/mm9.len	LN:124076172
@SQ	SN:chr9_random	AS:test-data/mm9.len	LN:449403
@SQ	SN:chrM	AS:test-data/mm9.len	LN:16299
@SQ	SN:chrUn_random	AS:test-data/mm9.len	LN:5900358
@SQ	SN:chrX	AS:test-data/mm9.len	LN:166650296
@SQ	SN:chrX_random	AS:test-data/mm9.len	LN:1785075
@SQ	SN:chrY	AS:test-data/mm9.len	LN:15902555
@SQ	SN:chrY_random	AS:test-data/mm9.len	LN:58682461
[main_samview] truncated file.

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

I should note that bamtofastq also fails with a segmentation fault on the https://github.com/galaxyproject/tools-iuc/blob/3e955a4c015cdb4df68fb530ace165be9877f94a/tools/bedtools/test-data/srma_in3.bam?raw=true
I guess this is all linked ?

mvdbeek added a commit to mvdbeek/tools-iuc that referenced this issue Sep 6, 2019
@jmarshall
Copy link
Contributor

BAM files contain two representations of the headers: the same text as in a SAM file, and a binary table of the @SQ chromosome names and lengths. Your srma_in3.bam file has a binary table listing dummy_chr and no text headers at all. (It is uncommon but valid to have binary sequence tables entries but no corresponding text @SQ headers in a BAM file.)

samtools view outputs SAM by synthesising @SQ text for any binary table entries if @SQ text is otherwise missing. So samtools view srma_in3.bam displays @SQ SN:dummy_chr … even though the input file doesn't really contain that in textual form.

bedtools tag is outputting the headers by just passing the text form of the input headers directly, which is empty, which leads to the truncated file error you saw. Instead, it needs to do the same thing as samtools view: output the headers by calling sam_hdr_write on the input bam_hdr_t* rather than via an abbreviated text form of the headers.

The bamtofastq segfault is a separate problem and is a simple typo:

index 30c0a968..1f88a351 100644
--- a/src/bamToFastq/bamToFastq.cpp
+++ b/src/bamToFastq/bamToFastq.cpp
@@ -38,8 +38,8 @@ BamToFastq::~BamToFastq(void) {}
 void BamToFastq::SingleFastq() {
     // open the 1st fastq file for writing
     _fq = new ofstream(_fastq1.c_str(), ios::out);
-    if ( !*_fq1 ) {
-        cerr << "Error: The first fastq file (" << _fastq1 << ") could not be opened.  Exiting!" << endl;
+    if ( !*_fq ) {
+        cerr << "Error: The fastq file (" << _fastq1 << ") could not be opened.  Exiting!" << endl;
         exit (1);
     }
     // open the BAM file

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

Awesome, thanks for the detailed explanation @jmarshall! I guess I was able to fix the test file by doing a bam->sam->bam conversion. That only leaves the bedpetobam issue as unexplained.

@arq5x
Copy link
Owner

arq5x commented Sep 6, 2019

Ugh, nice catch @jmarshall - burnt by a missing test again.

@mvdbeek
Copy link
Contributor Author

mvdbeek commented Sep 6, 2019

We caught this with our test suite in galaxyproject/tools-iuc#2399, if you ping us before a release we're happy to flag any surprising changes or bugs.

@jmarshall
Copy link
Contributor

Indeed, @SQ text headers were synthesised during the bam->sam step of your conversion.

I was expecting the bedpetobam issue to be the same problem, but it turns out to be another separate problem with a simple fix:

index 4afa3cdc..7e6721d9 100644
--- a/src/bedpeToBam/bedpeToBam.cpp
+++ b/src/bedpeToBam/bedpeToBam.cpp
@@ -189,8 +189,8 @@ void ProcessBedPE(BedFilePE *bedpe, GenomeFile *genome,  int mapQual, bool uncom
 
             if (bedpe->bedType >= 10) {
                 ConvertBedPEToBam(bedpeEntry, bamEntry1, bamEntry2, chromToId,  mapQual, lineNum);
-                writer->SaveAlignment(bamEntry1);
-                writer->SaveAlignment(bamEntry2);
+                writer->SaveAlignment(bamEntry1, true);
+                writer->SaveAlignment(bamEntry2, true);

However somewhere in or nearby SyncExtraData() the QNAME is getting its final character truncated, so there appears to be a buffer length off-by-1 problem in there. Another opportunity for a few more tests 😄

@arq5x
Copy link
Owner

arq5x commented Sep 6, 2019

Thanks @jmarshall - @38 could you address this and add a few tests?

@arq5x
Copy link
Owner

arq5x commented Feb 1, 2020

Sorry for the delay, I finally applied and pushed @jmarshall's suggested patch above. I missed this previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants