BAM/BED to parquet #2376

darked89 · 2022-08-10T14:27:40Z

Hello,

Would it be possible to provide a minimal example be it in Scala/python/CLI, how to convert say BAM to an ADAMs parquet? Same with a canonical 6 columns BED.

DK

heuermh · 2022-08-10T21:46:10Z

Command line

$ adam-submit transformAlignments sample.bam sample.alignments.adam
$ adam-submit transformFeatures annotation.bed annotation.features.adam

Scala

import org.bdgenomics.adam.ds.ADAMContext._

val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

val features = sc.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Python

from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)

alignments = ac.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

features = ac.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Hope this helps!

darked89 · 2022-08-11T05:59:08Z

Thank you very much for such a quick answer.

Bit of a follow up:
the resulting .adam files are in a parquet format readable by say arrow?

heuermh · 2022-08-11T22:30:08Z

Yes, I've never had any issues with Parquet in Apache Arrow. There was a mis-specification between the JVM Parquet and the C++ Parquet with regards to LZ4 compression at some point, I don't know if that is still a problem. Other compression algorithms should be fine.

I did have some issues with incomplete support for Parquet via DuckDB, details here
https://github.com/heuermh/bdg-formats-duckdb

As of that effort, DuckDB did not support Parquet enums or nested schema, both features that we use in bdg-formats/ADAM.

darked89 · 2022-08-12T15:31:29Z

Hello,

I can confirm that so far I have no issues reading parquet files created by ADAM using python polars.
The only a bit confusing thing was with a test RNA-Seq BAM produced by STAR (2x 150bp reads) where somehow I got min insert size= -911256.0. Is it a true insert size or a location offset of a second read in the pair?

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

Well, this should let me start experimenting with ADAM after getting back from vacations.

Many thanks for your help

Darek Kedra

heuermh · 2022-08-12T18:31:36Z

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

We use rather rich schema for all the various genomic data types, defined in Avro at
https://github.com/bigdatagenomics/bdg-formats

The Feature schema was designed to support all of GFF2/GTF, GFF3, BED, Genbank, NarrowPeak, and IntervalList formats. A chart with attribute mappings can be found at
https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md

darked89 closed this as completed Aug 12, 2022

heuermh added this to the 1.1 milestone Aug 12, 2022

heuermh modified the milestones: 1.1, 1.0.1 Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAM/BED to parquet #2376

BAM/BED to parquet #2376

darked89 commented Aug 10, 2022

heuermh commented Aug 10, 2022

darked89 commented Aug 11, 2022

heuermh commented Aug 11, 2022

darked89 commented Aug 12, 2022

heuermh commented Aug 12, 2022

BAM/BED to parquet #2376

BAM/BED to parquet #2376

Comments

darked89 commented Aug 10, 2022

heuermh commented Aug 10, 2022

darked89 commented Aug 11, 2022

heuermh commented Aug 11, 2022

darked89 commented Aug 12, 2022

heuermh commented Aug 12, 2022