Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of dataset api in ADAM #1018

Closed
jpdna opened this issue Apr 26, 2016 · 6 comments
Closed

Use of dataset api in ADAM #1018

jpdna opened this issue Apr 26, 2016 · 6 comments
Milestone

Comments

@jpdna
Copy link
Member

jpdna commented Apr 26, 2016

Here is prototype using Spark Dataset API for the ADAM markduplicates operation:
here

The dataset API version of ADAM markduplicates above is now at least 10% faster than the current RDD implementation.

Prior to the modification described below using the Dataset API was 30-50% slower than the current RDD version.

I describe below the modification that produced this improvement in our use of the Dataset API:

A seemingly minor modification to return from a mapGroup function a tuple containing one Seq as opposed to the same data split into three different Seqs (either individually or nested within a case class) makes large (2x) performance difference in total processing time.

This magnitude of performance gain from passing the same amount of data as one instead of three Seqs seems surprising given that the total amount of data being returned and being next shuffled in a groupBy would seem to be the same. The difference just being the packaging as a single Seq or split into three Seqs.

While I'd imagine some additional overhead due to multiple Seqs vs. one, am surprised by the large difference, and curious to know if there is a best practice that would suggest to avoid Seqs of case classes in a dataset due to this behavior.

The Spark UI output and .explain() for the fast version can be found here
and the slow version here
Stage 2 goes from 59 sec to 2.2 minutes

@rxin
Copy link

rxin commented Apr 26, 2016

Can you check how much time is spent in GC in the old vs new version?

@jpdna
Copy link
Member Author

jpdna commented Apr 26, 2016

Can you check how much time is spent in GC in the old vs new version?

Thanks for looking at this @rxin
There doesn't seem to be much difference in GC time.
I have now pasted the Stage details screenshots including the summary metrics in the last two pages of the google docs for slow and fast

Other than the duration metric itself - I don't see any metric which is capturing the duration difference. It doesn't appear to be stragglers - but rather evenly distributed as from the min to median to max, the tasks are all taking around twice as long to complete in the slow version.

@hubertp
Copy link

hubertp commented Jun 12, 2016

@jpdna is this work stalled atm? Anything an eager person could pick up?

@jpdna
Copy link
Member Author

jpdna commented Jun 13, 2016

@hubertp - thanks for your interest in contributing!. The cause of performance oddity between the two different way to arrange the data prior to running groupBy as described above, with links to google docs and github, is still not understood. It would be interesting if you want to try to replicate in a general simple example. Otherwise - on general dataset API front, I plan to return to that shortly, though you are certainly welcome to have a look at what next steps make sense to you and make a PR - is there a specific area you are interested in being involved with?

One idea for where you could help - similar to https://github.com/jpdna/adam/blob/dataset_api_markdups_v4/adam-core/src/main/scala/org/bdgenomics/adam/dataset/AlignmentRecordLimitProjDS.scala we need to make a dataset API case class for the VCF based variant data like https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation and then explore using that for variant related operations - if you want to take a look at that a PR on that subject might help push along efforts. We haven't yet merged my own dataset api work yet to main ADAM repo due to the need to bump to scala 2_11 to allow larger case classes, I'll plan to today rebase the work I was doing at https://github.com/jpdna/adam/tree/dataset_api_markdups_v4 on the current ADAM repo so that will be a better place for you to start from if you want to work on dataset api stuff in ADAM, I'll update this issue with a link to that when ready.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 6, 2016
@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016
@Fei-Guang
Copy link

hope this merge in next release

@fnothaft
Copy link
Member

Lo and behold; I have a candidate fix for this. Need to clean it up but will PR tomorrow.

fnothaft added a commit to fnothaft/adam that referenced this issue Feb 15, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Feb 17, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Feb 18, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Apr 10, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 12, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 24, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue May 24, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 21, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 23, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 23, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 7, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
heuermh pushed a commit that referenced this issue Jul 11, 2017
Resolves #1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants