Use of dataset api in ADAM #1018

jpdna · 2016-04-26T17:56:19Z

Here is prototype using Spark Dataset API for the ADAM markduplicates operation:
here

The dataset API version of ADAM markduplicates above is now at least 10% faster than the current RDD implementation.

Prior to the modification described below using the Dataset API was 30-50% slower than the current RDD version.

I describe below the modification that produced this improvement in our use of the Dataset API:

A seemingly minor modification to return from a mapGroup function a tuple containing one Seq as opposed to the same data split into three different Seqs (either individually or nested within a case class) makes large (2x) performance difference in total processing time.

This magnitude of performance gain from passing the same amount of data as one instead of three Seqs seems surprising given that the total amount of data being returned and being next shuffled in a groupBy would seem to be the same. The difference just being the packaging as a single Seq or split into three Seqs.

While I'd imagine some additional overhead due to multiple Seqs vs. one, am surprised by the large difference, and curious to know if there is a best practice that would suggest to avoid Seqs of case classes in a dataset due to this behavior.

The Spark UI output and .explain() for the fast version can be found here
and the slow version here
Stage 2 goes from 59 sec to 2.2 minutes

rxin · 2016-04-26T18:57:56Z

Can you check how much time is spent in GC in the old vs new version?

jpdna · 2016-04-26T20:12:38Z

Can you check how much time is spent in GC in the old vs new version?

Thanks for looking at this @rxin
There doesn't seem to be much difference in GC time.
I have now pasted the Stage details screenshots including the summary metrics in the last two pages of the google docs for slow and fast

Other than the duration metric itself - I don't see any metric which is capturing the duration difference. It doesn't appear to be stragglers - but rather evenly distributed as from the min to median to max, the tasks are all taking around twice as long to complete in the slow version.

hubertp · 2016-06-12T12:02:18Z

@jpdna is this work stalled atm? Anything an eager person could pick up?

jpdna · 2016-06-13T10:50:44Z

@hubertp - thanks for your interest in contributing!. The cause of performance oddity between the two different way to arrange the data prior to running groupBy as described above, with links to google docs and github, is still not understood. It would be interesting if you want to try to replicate in a general simple example. Otherwise - on general dataset API front, I plan to return to that shortly, though you are certainly welcome to have a look at what next steps make sense to you and make a PR - is there a specific area you are interested in being involved with?

One idea for where you could help - similar to https://github.com/jpdna/adam/blob/dataset_api_markdups_v4/adam-core/src/main/scala/org/bdgenomics/adam/dataset/AlignmentRecordLimitProjDS.scala we need to make a dataset API case class for the VCF based variant data like https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation and then explore using that for variant related operations - if you want to take a look at that a PR on that subject might help push along efforts. We haven't yet merged my own dataset api work yet to main ADAM repo due to the need to bump to scala 2_11 to allow larger case classes, I'll plan to today rebase the work I was doing at https://github.com/jpdna/adam/tree/dataset_api_markdups_v4 on the current ADAM repo so that will be a better place for you to start from if you want to work on dataset api stuff in ADAM, I'll update this issue with a link to that when ready.

Fei-Guang · 2016-11-02T00:43:18Z

hope this merge in next release

fnothaft · 2017-02-14T22:41:24Z

Lo and behold; I have a candidate fix for this. Need to clean it up but will PR tomorrow.

Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that: 1. Implement the Scala Product interface and thus can be read into a Spark SQL Dataset. 2. Have a complete constructor that is compatible with the constructor that Spark SQL expects to see when exporting a Dataset back to Scala. 3. And, that have methods for converting to/from the bdg-formats Avro models. Then, we build these model classes in the `org.bdgenomics.adam.sql` package, and use them for export from the Avro based GenomicRDDs. With a Dataset, we can then export to a DataFrame, which enables us to expose data through Python via RDD->Dataset->DataFrame. This is important since the Avro classes generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to Python RDD crossing with them.

Resolves #1018. Adds the `adam-codegen` module, which generates classes that: 1. Implement the Scala Product interface and thus can be read into a Spark SQL Dataset. 2. Have a complete constructor that is compatible with the constructor that Spark SQL expects to see when exporting a Dataset back to Scala. 3. And, that have methods for converting to/from the bdg-formats Avro models. Then, we build these model classes in the `org.bdgenomics.adam.sql` package, and use them for export from the Avro based GenomicRDDs. With a Dataset, we can then export to a DataFrame, which enables us to expose data through Python via RDD->Dataset->DataFrame. This is important since the Avro classes generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to Python RDD crossing with them.

fnothaft mentioned this issue Jul 6, 2016

Load legacy file formats to Spark SQL Dataframes #912

Closed

fnothaft added this to the 0.21.0 milestone Jul 6, 2016

heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

fnothaft mentioned this issue Feb 15, 2017

[ADAM-1018] Add support for Spark SQL Datasets. #1391

Merged

4 tasks

fnothaft mentioned this issue Mar 3, 2017

How to convert genotype DataFrame to VariantContext DataFrame / RDD #886

Closed

heuermh closed this as completed in #1391 Jul 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of dataset api in ADAM #1018

Use of dataset api in ADAM #1018

jpdna commented Apr 26, 2016 •

edited

Loading

rxin commented Apr 26, 2016

jpdna commented Apr 26, 2016

hubertp commented Jun 12, 2016

jpdna commented Jun 13, 2016

Fei-Guang commented Nov 2, 2016

fnothaft commented Feb 14, 2017

Use of dataset api in ADAM #1018

Use of dataset api in ADAM #1018

Comments

jpdna commented Apr 26, 2016 • edited Loading

rxin commented Apr 26, 2016

jpdna commented Apr 26, 2016

hubertp commented Jun 12, 2016

jpdna commented Jun 13, 2016

Fei-Guang commented Nov 2, 2016

fnothaft commented Feb 14, 2017

jpdna commented Apr 26, 2016 •

edited

Loading