Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMFormatException: Unrecognized tag type: ^@ #1657

Closed
heuermh opened this issue Aug 3, 2017 · 9 comments
Closed

SAMFormatException: Unrecognized tag type: ^@ #1657

heuermh opened this issue Aug 3, 2017 · 9 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented Aug 3, 2017

INFO rdd.ADAMContext: Loading hdfs://spark-master:8020/data/sample.bam as BAM/CRAM/SAM and converting to AlignmentRecords.
INFO rdd.ADAMContext: Loaded header from hdfs://spark-master:8020/data/sample.bam
...
INFO read.RDDBoundAlignmentRecordRDD: Saving data in ADAM format
...
WARN scheduler.TaskSetManager: Lost task 135.0 in stage 0.0 (TID 147, ip-10-0-0-9.ec2.internal):
htsjdk.samtools.SAMFormatException: Unrecognized tag type: ^@
        at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:351)
        at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
        at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:313)
        at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:293)
        at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2004)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:795)
        at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.<init>(BAMFileReader.java:947)
        at htsjdk.samtools.BAMFileReader.getIterator(BAMFileReader.java:482)
        at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:172)
        at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
        at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:156)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
...
INFO scheduler.TaskSetManager: Lost task 135.1 in stage 0.0 (TID 176) on executor
ip-10-0-0-9.ec2.internal: htsjdk.samtools.SAMFormatException (Unrecognized tag type: ^@) [duplicate 1]
...
INFO scheduler.TaskSetManager: Lost task 135.2 in stage 0.0 (TID 200) on executor
ip-10-0-0-9.ec2.internal: htsjdk.samtools.SAMFormatException (Unrecognized tag type: ^@) [duplicate 2]
@fnothaft
Copy link
Member

fnothaft commented Aug 3, 2017

Do you have a line with that?

@heuermh
Copy link
Member Author

heuermh commented Aug 3, 2017

Will investigate. Odd that only one file of 455 from the same source pipeline would have that as a tag type.

@heuermh
Copy link
Member Author

heuermh commented Aug 8, 2017

Can't find an occurrences of ^@ or any parts of that with various escaping in less

$ samtools view -h sample.bam | less

Samtools itself doesn't seem to complain

$ samtools flagstat sample.bam
986020586 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
103643908 + 0 duplicates
928658758 + 0 mapped (94.18% : N/A)
986020586 + 0 paired in sequencing
493010293 + 0 read1
493010293 + 0 read2
896125586 + 0 properly paired (90.88% : N/A)
910853506 + 0 with itself and mate mapped
17805252 + 0 singletons (1.81% : N/A)
3849910 + 0 with mate mapped to a different chr
1764304 + 0 with mate mapped to a different chr (mapQ>=5)

and an excerpt to SAM format transformed ok

$ samtools view -h \
  sample.bam \
  chr1:99000-100000 > sample-chr1-99000-100000.sam

$ hadoop fs -put \
  sample-chr1-99000-100000.sam \
  /data/sample-chr1-99000-100000.sam

$ adam-submit \
  transformAlignments \
  /data/sample-chr1-99000-100000.sam \
  /data/sample-chr1-99000-100000.alignments.adam
...

@fnothaft
Copy link
Member

fnothaft commented Aug 8, 2017

This might be a bad Hadoop-BAM split? @ryan-williams has been tracking these down...

@heuermh
Copy link
Member Author

heuermh commented Aug 8, 2017

The error was reported more than once, on the same executor though, so I suppose it could be a bad split. I've asked the data producer to help us confirm the BAM hasn't been corrupted since it was created, and I'll try to do something with htsjdk directly next.

@fnothaft
Copy link
Member

fnothaft commented Aug 8, 2017

If the file reads OK on a single node, it looks a lot like a bad split to me...

@ryan-williams
Copy link
Member

I have a bunch of utilities for investigating the bad split possibility. lmk if I can be of assistance / would love to get a look at the BAM

@fnothaft
Copy link
Member

fnothaft commented Jan 9, 2018

@heuermh ping to retest with latest Hadoop-BAM in ToT.

@heuermh
Copy link
Member Author

heuermh commented Jan 26, 2018

Works for me with ADAM version 0.23.0.

@heuermh heuermh closed this as completed Jan 26, 2018
@heuermh heuermh added this to the 0.24.0 milestone Jan 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants