Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAM output is corrupt in S3 #117

Closed
fnothaft opened this issue Feb 25, 2016 · 5 comments
Closed

ADAM output is corrupt in S3 #117

fnothaft opened this issue Feb 25, 2016 · 5 comments
Assignees

Comments

@fnothaft
Copy link
Contributor

Brief synopsis is in #116, but the TL;DR is that the BAM written by ADAM is corrupt when downloaded from S3. I'm working to sort out whether something is going wrong when ADAM writes the BAM out, or if something is going wrong when the BAM is uploaded to S3. The header for the file is good (w00t) but the rest of the file can't be read.

@fnothaft
Copy link
Contributor Author

This issue is somewhere in ADAM. I've opened bigdatagenomics/adam#962 in parallel for tracking.

@fnothaft
Copy link
Contributor Author

OK, we've got a fix at bigdatagenomics/adam#964. I'm going to update the ADAM Docker container to pull this fix in, and then will test on the cluster and report back.

@fnothaft
Copy link
Contributor Author

No dice. From inside of the GATK:

##### ERROR MESSAGE: SAM/BAM/CRAM file toil.bam is malformed: java.lang.Integer cannot be cast to java.lang.String 

I'm wondering if we wrote out a bad tag? Time to dig in more...

@fnothaft
Copy link
Contributor Author

We are writing record group IDs as integers, not strings:

$ samtools view http://s3-us-west-2.amazonaws.com/fnothaft-fc-test-west-2/analysis/SRR062643/SRR062643.adam.bam | head -n 1
[knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged.
SRR062643.6349712       147     HLA-A*01:01:01:01       36      60      100M    chr6    299422020 CCAGGCGTGGCTCTCAGGGTCTCAGGCCCCGAAGGCGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAGTTTCTTTTCTCCCTC     ##########@:@>@A666?A?@=?/=<A8@?AB@1=193;:@.@ABBAAA@=>?<BAB<DABBDAAAAB@BDDBEACBD;ABD?BDDDEEFFDCCDBC!     LB:Z:SRR062643  RG:i:0  PU:Z:12345 

From the SAM spec:

Tag | Type | Description
RG | Z | Read group. Value matches the header RG-ID tag if @RG is present in the header.

Should be a simple fix, although I'm a bit perplexed as to how we did that.

@fnothaft
Copy link
Contributor Author

Fix at fnothaft/adam@14b41d5. Retesting on the cluster...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants