Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing SAM header when using pipe API #2322

Closed
heuermh opened this issue Jun 14, 2021 · 9 comments
Closed

Missing SAM header when using pipe API #2322

heuermh opened this issue Jun 14, 2021 · 9 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented Jun 14, 2021

Created to track separate issue in #2321, see comment
#2321 (comment)

@heuermh
Copy link
Member Author

heuermh commented Jun 14, 2021

Hello @hxdhan, I created this separate issue to track your comment linked above.

Have you tried https://github.com/bigdatagenomics/cannoli to call STAR instead of using the pipe API directly?

$ cannoli-submit --help
                              _ _
                             | (_)
   ___ __ _ _ __  _ __   ___ | |_
  / __/ _` | '_ \| '_ \ / _ \| | |
 | (_| (_| | | | | | | | (_) | | |
  \___\__,_|_| |_|_| |_|\___/|_|_|

Usage: cannoli-submit [<spark-args> --] <cannoli-args>

Choose one of the following commands:

CANNOLI
...
                star : Align paired-end reads in a fragment dataset with STAR-Mapper.

@bioxmd
Copy link

bioxmd commented Jun 14, 2021

Hello,
I think that I have a similar problem. I have tried to run Bowtie2 using Cannoli, and the header section of the SAM file seems to be incomplete and it does not look like a header from Bowtie2 running locally.

@hxdhan
Copy link

hxdhan commented Jun 22, 2021

@heuermh
I think the issue is reader.iterator(), not include the reader sam header information.

` private case class SAMIteratorConverter(val reader: SamReader) extends Iterator[Alignment] {

val iter = reader.iterator()

// make converter and empty dicts
val converter = new AlignmentConverter

def hasNext: Boolean = {
iter.hasNext
}

def next: Alignment = {
assert(iter.hasNext)
converter.convert(iter.next)
}
} `

@heuermh
Copy link
Member Author

heuermh commented Jun 22, 2021

I have tried to run Bowtie2 using Cannoli, and the header section of the SAM file seems to be incomplete and it does not look like a header from Bowtie2 running locally.

The header section of the SAM file written by Cannoli is written by Cannoli, not by Bowtie2, so it is not necessary that it looks exactly the same.

What is incomplete?

@hxdhan
Copy link

hxdhan commented Jun 26, 2021

Hello @hxdhan, I created this separate issue to track your comment linked above.

Have you tried https://github.com/bigdatagenomics/cannoli to call STAR instead of using the pipe API directly?

$ cannoli-submit --help
                              _ _
                             | (_)
   ___ __ _ _ __  _ __   ___ | |_
  / __/ _` | '_ \| '_ \ / _ \| | |
 | (_| (_| | | | | | | | (_) | | |
  \___\__,_|_| |_|_| |_|\___/|_|_|

Usage: cannoli-submit [<spark-args> --] <cannoli-args>

Choose one of the following commands:

CANNOLI
...
                star : Align paired-end reads in a fragment dataset with STAR-Mapper.

the cannoli STAR Just call adam pipe api . so the result is same.

@heuermh
Copy link
Member Author

heuermh commented Jun 27, 2021

ADAM and Cannoli by extension do write out valid SAM format headers.

Interleaved FASTQ format includes no metadata, so if you need additional metadata in the SAM format header from bowtie2 or star pipe API transformations or any other transformation, you need to specify them.

For example

$ ./bin/cannoli-shell

scala> import org.bdgenomics.adam.ds.ADAMContext._
scala> import org.bdgenomics.cannoli.Cannoli._
scala> import org.bdgenomics.cannoli.Bowtie2Args
scala> import org.bdgenomics.adam.models.{ ReadGroup, ReadGroupDictionary }

scala> val fragments = sc.loadInterleavedFastqAsFragments("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.ifq")
fragments: org.bdgenomics.adam.ds.fragment.FragmentDataset = RDDBoundFragmentDataset
  with 0 reference sequences, 0 read groups, and 0 processing steps

scala> val args = new Bowtie2Args()
args: org.bdgenomics.cannoli.Bowtie2Args = org.bdgenomics.cannoli.Bowtie2Args@9d2ddb1

scala> args.indexPath = "hs37d5.fa"
args.indexPath: String = hs37d5.fa

scala> val readGroup = ReadGroup("NA12878", "NA12878")
readGroup: org.bdgenomics.adam.models.ReadGroup = ReadGroup(NA12878,NA12878,None,None,None,None,None,None,None,None,None)

scala> val readGroups = ReadGroupDictionary(Seq(readGroup))
readGroups: org.bdgenomics.adam.models.ReadGroupDictionary = ReadGroupDictionary(ReadGroup(NA12878,NA12878,None,None,None,None,None,None,None,None,None))

scala> val references = sc.loadSequenceDictionary("hs37d5.dict")
references: org.bdgenomics.adam.models.SequenceDictionary =
SequenceDictionary{1->249250621, ...}

scala> val alignments = fragments.alignWithBowtie2(args)
alignments: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 0 reference sequences, 0 read groups, and 0 processing steps

scala> alignments.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.sam", asSingleFile = true)

scala> val withReadGroups = alignments.replaceReadGroups(readGroups)
withReadGroup: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 0 reference sequences, 1 read groups, and 0 processing steps

scala> withReadGroups.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.sam", asSingleFile = true)

scala> val withReferences = withReadGroups.replaceReferences(references)
withReferences: org.bdgenomics.adam.ds.read.AlignmentDataset = RDDBoundAlignmentDataset
  with 86 reference sequences, 1 read groups, and 0 processing steps

scala> withReferences.saveAsSam("CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.references.sam", asSingleFile = true)

scala> :quit

With fragments from interleaved FASTQ format only

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.sam
@HD	VN:1.6	SO:unsorted
20FUKAAXX100202:2:22:10007:37406	83	21	10149576	1	39M1D62M	=	10149299	-379	ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA	@GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	MD:Z:39^T44G17	XG:i:1	NM:i:2	XM:i:1	XN:i:0	XO:i:1	AS:i:-13	XS:i:-13	YS:i:0	YT:Z:CP

With fragments from interleaved FASTQ and read groups

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.sam
@HD	VN:1.6	SO:unsorted
@RG	ID:NA12878	SM:NA12878
20FUKAAXX100202:2:22:10007:37406	83	21	10149576	1	39M1D62M	=	10149299	-379	ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA	@GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	MD:Z:39^T44G17	XG:i:1	NM:i:2	XM:i:1	XN:i:0	XO:i:1	AS:i:-13	XS:i:-13	YS:i:0	YT:Z:CP

With fragments from interleaved FASTQ, read groups, and references

$ head CEUTrio.HiSeq.WGS.b37.NA12878.20.21.tiny.unaligned.bowtie2.readGroups.references.sam
@HD	VN:1.6	SO:unsorted
@SQ	SN:1	LN:249250621	M5:1B22B98CDEB4A9304CB5D48026A85128	UR:file:/hs37d5.fa
@SQ	SN:2	LN:243199373	M5:A0D9851DA00400DEC1098A9255AC712E	UR:file:/hs37d5.fa
...
@SQ     SN:GL000192.1   LN:547496       M5:325BA9E808F669DFEEE210FDD7B470AC     UR:file:/hs37d5.fa
@SQ     SN:NC_007605    LN:171823       M5:6743BD63B3FF2B5B8985D8933C53290A     UR:file:/hs37d5.fa
@SQ     SN:hs37d5       LN:35477943     M5:5B6A4B3A81A2D3C134B7D14BF6AD39F1     UR:file:/hs37d5.fa
@RG     ID:NA12878      SM:NA12878
20FUKAAXX100202:2:22:10007:37406        83      21      10149576        1       39M1D62M        =       10149299        -379    ATCATCTAAATTACTTTTAATGGAAAGGATTCTGACAAGTCTCTTAAATATGGTTTCAGATAACTTTGGGGATCAAACCATTGAACTAGGAAAATCTTCCA   @GHHGHHHHHIHGHHIGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH   MD:Z:39^T44G17  XG:i:1  NM:i:2  XM:i:1  XN:i:0  XO:i:1  AS:i:-13        XS:i:-13        YS:i:0  YT:Z:CP

Alternatively, you can start from unaligned BAM/CRAM/SAM format or preferably Fragments in Apache Parquet format (the default/preferred format for ADAM and Cannoli data on disk), which all include these metadata.

Hope this helps!

@hxdhan
Copy link

hxdhan commented Jun 29, 2021

I try to use hs37d5.dict for meta data . but I found it is not as same as meta data generated by STAR.

hs37d5.zip
star.zip

@heuermh
Copy link
Member Author

heuermh commented Jul 2, 2021

$ head hs37d5.dict star.dict
==> hs37d5.dict <==
@HD	VN:1.6
@SQ	SN:1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:2	LN:243199373	M5:a0d9851da00400dec1098a9255ac712e	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:3	LN:198022430	M5:fdfd811849cc2fadebc929bb925902e5	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:4	LN:191154276	M5:23dccd106897542ad87d2765d28a19a1	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:5	LN:180915260	M5:0740173db9ffd264d728f32784845cd7	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:6	LN:171115067	M5:1d3a93a248d92a729ee764823acbbc6b	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:7	LN:159138663	M5:618366e953d6aaad97dbe4777c29375e	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:8	LN:146364022	M5:96f514a9929e410c6651697bded59aec	UR:file:/mnt/d/code/dev/hs37d5.fa
@SQ	SN:9	LN:141213431	M5:3e273117f15e0a400f01055d9f393768	UR:file:/mnt/d/code/dev/hs37d5.fa

==> star.dict <==
@HD	VN:1.6
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717

RIght, you have to use the same reference as provided to STAR.

If you need to generate a .dict file, you can use picard

$ java -jar picard.jar CreateSequenceDictionary R=ref.fa O=ref.dict

or dsh-bio

$ dsh-bio create-sequence-dictionary -i ref.fa -o ref.dict

@heuermh heuermh added this to the 0.37.0 milestone Aug 23, 2021
@heuermh
Copy link
Member Author

heuermh commented Aug 23, 2021

Closing as Invalid, please feel to reopen if necessary

@heuermh heuermh closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants