Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM/BED to parquet #2376

Closed
darked89 opened this issue Aug 10, 2022 · 5 comments
Closed

BAM/BED to parquet #2376

darked89 opened this issue Aug 10, 2022 · 5 comments
Milestone

Comments

@darked89
Copy link

Hello,

Would it be possible to provide a minimal example be it in Scala/python/CLI, how to convert say BAM to an ADAMs parquet? Same with a canonical 6 columns BED.

DK

@heuermh
Copy link
Member

heuermh commented Aug 10, 2022

Command line

$ adam-submit transformAlignments sample.bam sample.alignments.adam
$ adam-submit transformFeatures annotation.bed annotation.features.adam

Scala

import org.bdgenomics.adam.ds.ADAMContext._

val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

val features = sc.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Python

from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)

alignments = ac.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

features = ac.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Hope this helps!

@darked89
Copy link
Author

Thank you very much for such a quick answer.

Bit of a follow up:
the resulting .adam files are in a parquet format readable by say arrow?

@heuermh
Copy link
Member

heuermh commented Aug 11, 2022

Yes, I've never had any issues with Parquet in Apache Arrow. There was a mis-specification between the JVM Parquet and the C++ Parquet with regards to LZ4 compression at some point, I don't know if that is still a problem. Other compression algorithms should be fine.

I did have some issues with incomplete support for Parquet via DuckDB, details here
https://github.com/heuermh/bdg-formats-duckdb

As of that effort, DuckDB did not support Parquet enums or nested schema, both features that we use in bdg-formats/ADAM.

@darked89
Copy link
Author

Hello,

I can confirm that so far I have no issues reading parquet files created by ADAM using python polars.
The only a bit confusing thing was with a test RNA-Seq BAM produced by STAR (2x 150bp reads) where somehow I got min insert size= -911256.0. Is it a true insert size or a location offset of a second read in the pair?

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

Well, this should let me start experimenting with ADAM after getting back from vacations.

Many thanks for your help

Darek Kedra

@heuermh
Copy link
Member

heuermh commented Aug 12, 2022

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

We use rather rich schema for all the various genomic data types, defined in Avro at
https://github.com/bigdatagenomics/bdg-formats

The Feature schema was designed to support all of GFF2/GTF, GFF3, BED, Genbank, NarrowPeak, and IntervalList formats. A chart with attribute mappings can be found at
https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md

@heuermh heuermh added this to the 1.1 milestone Aug 12, 2022
@heuermh heuermh modified the milestones: 1.1, 1.0.1 Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants