Missing SAM header when using pipe API #2322

heuermh · 2021-06-14T18:27:52Z

Created to track separate issue in #2321, see comment
#2321 (comment)

heuermh · 2021-06-14T18:31:22Z

Hello @hxdhan, I created this separate issue to track your comment linked above.

Have you tried https://github.com/bigdatagenomics/cannoli to call STAR instead of using the pipe API directly?

$ cannoli-submit --help
                              _ _
                             | (_)
   ___ __ _ _ __  _ __   ___ | |_
  / __/ _` | '_ \| '_ \ / _ \| | |
 | (_| (_| | | | | | | | (_) | | |
  \___\__,_|_| |_|_| |_|\___/|_|_|

Usage: cannoli-submit [<spark-args> --] <cannoli-args>

Choose one of the following commands:

CANNOLI
...
                star : Align paired-end reads in a fragment dataset with STAR-Mapper.

bioxmd · 2021-06-14T21:33:50Z

Hello,
I think that I have a similar problem. I have tried to run Bowtie2 using Cannoli, and the header section of the SAM file seems to be incomplete and it does not look like a header from Bowtie2 running locally.

hxdhan · 2021-06-22T10:25:41Z

@heuermh
I think the issue is reader.iterator(), not include the reader sam header information.

` private case class SAMIteratorConverter(val reader: SamReader) extends Iterator[Alignment] {

val iter = reader.iterator()

// make converter and empty dicts
val converter = new AlignmentConverter

def hasNext: Boolean = {
iter.hasNext
}

def next: Alignment = {
assert(iter.hasNext)
converter.convert(iter.next)
}
} `

heuermh · 2021-06-22T23:58:35Z

I have tried to run Bowtie2 using Cannoli, and the header section of the SAM file seems to be incomplete and it does not look like a header from Bowtie2 running locally.

The header section of the SAM file written by Cannoli is written by Cannoli, not by Bowtie2, so it is not necessary that it looks exactly the same.

What is incomplete?

hxdhan · 2021-06-26T09:59:56Z

Hello @hxdhan, I created this separate issue to track your comment linked above.

Have you tried https://github.com/bigdatagenomics/cannoli to call STAR instead of using the pipe API directly?
$ cannoli-submit --help
                              _ _
                             | (_)
   ___ __ _ _ __  _ __   ___ | |_
  / __/ _` | '_ \| '_ \ / _ \| | |
 | (_| (_| | | | | | | | (_) | | |
  \___\__,_|_| |_|_| |_|\___/|_|_|

Usage: cannoli-submit [<spark-args> --] <cannoli-args>

Choose one of the following commands:

CANNOLI
...
                star : Align paired-end reads in a fragment dataset with STAR-Mapper.

the cannoli STAR Just call adam pipe api . so the result is same.

heuermh · 2021-06-27T17:29:25Z

ADAM and Cannoli by extension do write out valid SAM format headers.

Interleaved FASTQ format includes no metadata, so if you need additional metadata in the SAM format header from bowtie2 or star pipe API transformations or any other transformation, you need to specify them.

For example

$ ./bin/cannoli-shell

scala> import org.bdgenomics.adam.ds.ADAMContext._
scala> import org.bdgenomics.cannoli.Cannoli._
scala> import org.bdgenomics.cannoli.Bowtie2Args
scala> import org.bdgenomics.adam.models.{ ReadGroup, ReadGroupDictionary }

scala> val fragments = sc.loadInterleavedFastqAsFragments("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.ifq")
fragments: org.bdgenomics.adam.ds.fragment.FragmentDataset = RDDBoundFragmentDataset
  with 0 reference sequences, 0 read groups, and 0 processing steps

scala> val args = new Bowtie2Args()
args: org.bdgenomics.cannoli.Bowtie2Args = org.bdgenomics.cannoli.Bowtie2Args@9d2ddb1

scala> args.indexPath = "hs37d5.fa"
args.indexPath: String = hs37d5.fa

scala> val readGroup = ReadGroup("NA12878", "NA12878")
readGroup: org.bdgenomics.adam.models.ReadGroup = ReadGroup(NA12878,NA12878,None,None,None,None,None,None,None,None,None)

scala> val readGroups = ReadGroupDictionary(Seq(readGroup))
readGroups: org.bdgenomics.adam.models.ReadGroupDictionary = ReadGroupDictionary(ReadGroup(NA12878,NA12878,None,None,None,None,None,None,None,None,None))

scala> val references = sc.loadSequenceDictionary("hs37d5.dict")
references: org.bdgenomics.adam.models.SequenceDictionary =
SequenceDictionary{1->249250621, ...}

scala> val alignments = fragments.alignWithBowtie2(args)
alignments: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 0 reference sequences, 0 read groups, and 0 processing steps

scala> alignments.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.sam", asSingleFile = true)

scala> val withReadGroups = alignments.replaceReadGroups(readGroups)
withReadGroup: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 0 reference sequences, 1 read groups, and 0 processing steps

scala> withReadGroups.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.sam", asSingleFile = true)

scala> val withReferences = withReadGroups.replaceReferences(references)
withReferences: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 86 reference sequences, 1 read groups, and 0 processing steps

scala> withReferences.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.references.sam", asSingleFile = true)

scala> :quit

With fragments from interleaved FASTQ format only

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.sam
@HD	VN:1.6	SO:unsorted
20FUKAAXX100202:2:22:10007:37406	83	21	10149576	1	39M1D62M	=	10149299	-379	ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA	@GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	MD:Z:39^T44G17	XG:i:1	NM:i:2	XM:i:1	XN:i:0	XO:i:1	AS:i:-13	XS:i:-13	YS:i:0	YT:Z:CP

With fragments from interleaved FASTQ and read groups

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.sam
@HD	VN:1.6	SO:unsorted
@RG	ID:NA12878	SM:NA12878
20FUKAAXX100202:2:22:10007:37406	83	21	10149576	1	39M1D62M	=	10149299	-379	ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA	@GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	MD:Z:39^T44G17	XG:i:1	NM:i:2	XM:i:1	XN:i:0	XO:i:1	AS:i:-13	XS:i:-13	YS:i:0	YT:Z:CP

With fragments from interleaved FASTQ, read groups, and references

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.references.sam
@HD	VN:1.6	SO:unsorted
@SQ	SN:1	LN:249250621	M5:1B22B98CDEB4A9304CB5D48026A85128	UR:file:/hs37d5.fa
@SQ	SN:2	LN:243199373	M5:A0D9851DA00400DEC1098A9255AC712E	UR:file:/hs37d5.fa
...
@SQ     SN:GL000192.1   LN:547496       M5:325BA9E808F669DFEEE210FDD7B470AC     UR:file:/hs37d5.fa
@SQ     SN:NC_007605    LN:171823       M5:6743BD63B3FF2B5B8985D8933C53290A     UR:file:/hs37d5.fa
@SQ     SN:hs37d5       LN:35477943     M5:5B6A4B3A81A2D3C134B7D14BF6AD39F1     UR:file:/hs37d5.fa
@RG     ID:NA12878      SM:NA12878
20FUKAAXX100202:2:22:10007:37406        83      21      10149576        1       39M1D62M        =       10149299        -379    ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA   @GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH   MD:Z:39^T44G17  XG:i:1  NM:i:2  XM:i:1  XN:i:0  XO:i:1  AS:i:-13        XS:i:-13        YS:i:0  YT:Z:CP

Alternatively, you can start from unaligned BAM/CRAM/SAM format or preferably Fragments in Apache Parquet format (the default/preferred format for ADAM and Cannoli data on disk), which all include these metadata.

Hope this helps!

hxdhan · 2021-06-29T08:18:20Z

I try to use hs37d5.dict for meta data . but I found it is not as same as meta data generated by STAR.

hs37d5.zip
star.zip

heuermh · 2021-07-02T17:32:06Z

$ head hs37d5.dict star.dict
==> hs37d5.dict <==
@HD	VN:1.6
@SQ	SN:1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:2	LN:243199373	M5:a0d9851da00400dec1098a9255ac712e	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:3	LN:198022430	M5:fdfd811849cc2fadebc929bb925902e5	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:4	LN:191154276	M5:23dccd106897542ad87d2765d28a19a1	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:5	LN:180915260	M5:0740173db9ffd264d728f32784845cd7	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:6	LN:171115067	M5:1d3a93a248d92a729ee764823acbbc6b	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:7	LN:159138663	M5:618366e953d6aaad97dbe4777c29375e	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:8	LN:146364022	M5:96f514a9929e410c6651697bded59aec	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:9	LN:141213431	M5:3e273117f15e0a400f01055d9f393768	UR:file:/mnt/d/code/dev/hs37d5.fa

==> star.dict <==
@HD	VN:1.6
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717

RIght, you have to use the same reference as provided to STAR.

If you need to generate a .dict file, you can use picard

$ java -jar picard.jar CreateSequenceDictionary R=ref.fa O=ref.dict

or dsh-bio

$ dsh-bio create-sequence-dictionary -i ref.fa -o ref.dict

heuermh · 2021-08-23T19:44:37Z

Closing as Invalid, please feel to reopen if necessary

hxdhan mentioned this issue Jun 23, 2021

add SAMFileHeader #2323

Closed

heuermh added this to the 0.37.0 milestone Aug 23, 2021

heuermh closed this as completed Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing SAM header when using pipe API #2322

Missing SAM header when using pipe API #2322

heuermh commented Jun 14, 2021

heuermh commented Jun 14, 2021

bioxmd commented Jun 14, 2021 •

edited

Loading

hxdhan commented Jun 22, 2021 •

edited

Loading

heuermh commented Jun 22, 2021

hxdhan commented Jun 26, 2021

heuermh commented Jun 27, 2021 •

edited

Loading

hxdhan commented Jun 29, 2021

heuermh commented Jul 2, 2021

heuermh commented Aug 23, 2021

Missing SAM header when using pipe API #2322

Missing SAM header when using pipe API #2322

Comments

heuermh commented Jun 14, 2021

heuermh commented Jun 14, 2021

bioxmd commented Jun 14, 2021 • edited Loading

hxdhan commented Jun 22, 2021 • edited Loading

heuermh commented Jun 22, 2021

hxdhan commented Jun 26, 2021

heuermh commented Jun 27, 2021 • edited Loading

hxdhan commented Jun 29, 2021

heuermh commented Jul 2, 2021

heuermh commented Aug 23, 2021

bioxmd commented Jun 14, 2021 •

edited

Loading

hxdhan commented Jun 22, 2021 •

edited

Loading

heuermh commented Jun 27, 2021 •

edited

Loading