[SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect #14090

NarineK · 2016-07-07T13:31:21Z

What changes were proposed in this pull request?

Updates programming guide for spark.gapply/spark.gapplyCollect.

Similar to other examples I used faithful dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.

How was this patch tested?

Existing test cases in R

…dapply and dapplyCollect

SparkQA · 2016-07-07T13:50:43Z

Test build #61911 has finished for PR 14090 at commit 7781d1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-07T16:21:50Z

cc @felixcheung @mengxr

felixcheung · 2016-07-07T18:01:28Z

docs/sparkr.md

+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.


it was hard to do in roxygen2 doc but the programming guide would be a great place to touch on or refer to what "match" means exactly - type mapping between Spark and R is a bit fuzzy and would be good to explain a bit more on that

I suppose this could be explained in dapply above as well

Thanks @felixcheung, Does this sound better ?
"It must reflect R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user." I could also bring up some examples.

I think gapply and dapply are the first important use cases where we require strict mapping Spark JVM types to R atomic types. It might be worthwhile to add a section in the programming guide to illustrate and explain that further.

To be more concrete, what should be the column type of the UDF output R data.frame if the SparkDataFrame has a column of double? It would be good to have a table on that.

That could be a separate PR though.

I see. I think we can describe the following type mapping in the programming guide.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
Those are the types used in the StructType's fields.

Yeah but instead of a pointer to the code it would be great if we could have a table in the documentation.

Thanks @shivaram.
Does the following mapping looks fine to have in the table ?

R Spark .......................... byte byte integer integer float float double double numeric double character string string string binary binary raw binary logical boolean timestamp timestamp date date array array map map struct struct

This looks good to me !

Thanks, I was looking at types.R file and have noticed that we have NA's for array, map and struct.
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42
But I guess in our case we can have: array, map and struct mapped to array, map and struct correspondingly ?!

I think those mappings are only used to print things in str. A better list to consult would be the list at https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that says list in R should become a array in SparkSQL and env in R should map to a map

felixcheung · 2016-07-07T22:43:12Z

LGTM except for comment on "schema matching".
Also I wonder if we should rephrase "can only be used if the output of UDF run on all the partitions can fit in driver memory" - it seems not as strong as a warning or correct as "can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory" (same in dapplyCollect)

NarineK · 2016-07-12T06:07:56Z

Added data type description

SparkQA · 2016-07-12T06:28:17Z

Test build #62145 has finished for PR 14090 at commit c1d7151.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-12T06:43:55Z

Test build #62147 has finished for PR 14090 at commit 2af7243.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-13T20:35:38Z

@felixcheung Could you take one more look at this ?

felixcheung · 2016-07-13T21:20:36Z

docs/sparkr.md

 Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
 and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
-should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output.
+should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match to [data types of R function's output fields](#data-type-mapping-between-r-and-spark).


output fields --> return values or return value?
http://adv-r.had.co.nz/Functions.html#return-values

SparkQA · 2016-07-14T07:01:12Z

Test build #62300 has finished for PR 14090 at commit 5d34943.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T08:54:27Z

Test build #62299 has finished for PR 14090 at commit 8a2aff3.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

shivaram · 2016-07-14T17:22:12Z

docs/sparkr.md

+  <td>map</td>
+</tr>
+<tr>
+  <td>struct</td>


I dont think R has any notion of a struct or map data type ? Looking at the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I think we should remove the struct -> struct and map -> map entries. Also I dont think there is a timestamp class in R. We should probably replace that with POSIXct or POSIXlt?

I don't think date is a type either.

@felixcheung, I think according to the following mapping we expect 'date':
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
And it seems that there is a 'Date' in base.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html
Do I understand correctly ?

@shivaram, I've looked at the following list:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92
It is being called for creating schema's field and it has map, struct, timestamp, etc ...
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L131

Isn't the 'dataType' in 'createStructField' the one passed from R ?

Thats a good point - So users can create a schema with struct and that is mapping to a corresponding SQL type. But they can't create any R objects that will be parsed as struct. The main reason our schema is more flexible than our serialization / deserialization support is that the schema can be used to say read JSON files or JDBC tables etc.

For the use case here, where users are returning a data.frame from UDF I dont think there is any valid mapping for struct from R.

And as you mentioned above we can also change date to Date to be more specific. (It would be ideal now that I think to link these R types to the CRAN help page. For example we can link to https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html for POSIXct / POSIXlt

Sounds good. For the mapping between: POSIXct / POSIXlt to timestamp and Date to 'date' do we need to update getSQLDataType method ?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91

Not really - as I mentioned the getSQLDatatype looks at the schema - the method which looks at the R objects is in

spark/R/pkg/R/serialize.R

Line 84 in 2e4075e

POSIXlt = writeTime(con, object),

yes it should be Date not date

And environment instead of env?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html

> e <- new.env() > class(e) [1] "environment"

SparkQA · 2016-07-15T06:21:58Z

Test build #62369 has finished for PR 14090 at commit 19e849f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-15T21:12:17Z

docs/sparkr.md

-should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
-output of UDF run on all the partitions can fit in driver memory.
+should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
 <div data-lang="r"  markdown="1">


I think we need a new line before the <div> ? Right now the div markings show up in the generated doc. I've attached a screenshot

shivaram · 2016-07-15T21:18:16Z

Thanks @NarineK for the updates. As a final thing I just had some formatting problems when I tested out this change locally. Let me know if you can't reproduce them. I just ran

cd docs
SKIP_API=1 jekyll build
open _site/sparkr.html

NarineK · 2016-07-15T21:31:27Z

Thanks @shivaram, @felixcheung for the comments. I'll address those today.

NarineK · 2016-07-16T20:27:44Z

Thanks, I've generated the docs with your suggested way @shivaram, but I'm not sure if I see the same thing as you.
I still see some '{% highlight r %}' and some formatting issues in general.
{% highlight r %} sparkR.session() {% endhighlight %}
I also followed this documentation:
https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
Please, let me know if you still see the issues after my latest commit.

SparkQA · 2016-07-16T21:19:38Z

Test build #62411 has finished for PR 14090 at commit f584416.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-16T22:12:54Z

Thanks @NarineK - I tried it on a fresh Ubuntu VM and it rendered fine. I think it has something to do with ruby / jekyll versions. The rendered docs looked fine on the Ubuntu VM

LGTM. @felixcheung Could you also take one final look ?

felixcheung · 2016-07-16T23:51:38Z

LGTM. thanks for putting this together!

shivaram · 2016-07-16T23:55:26Z

Merging this to master, branch-2.0

## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide. (cherry picked from commit 4167304) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Narine Kokhlikyan added 4 commits June 21, 2016 01:12

Fixed duplicated documentation problem + separated documentation for …

29d8a5c

…dapply and dapplyCollect

merge with master

698c433

Adding programming guide for gapply/gapplyCollect

85a4493

removing output format

7781d1c

felixcheung reviewed Jul 7, 2016
View reviewed changes

Adding R-Spark data-type mapping

c1d7151

bring back div tag got dapplyCollect

2af7243

felixcheung reviewed Jul 13, 2016
View reviewed changes

Narine Kokhlikyan added 2 commits July 13, 2016 23:18

addressed Felix's comments

8a2aff3

merge with master

5d34943

shivaram reviewed Jul 14, 2016
View reviewed changes

update data types

19e849f

shivaram reviewed Jul 15, 2016
View reviewed changes

Fixing a-href and the div for gapply/dapply

f584416

asfgit closed this in 4167304 Jul 16, 2016

[SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect #14090

[SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect #14090

Uh oh!

Conversation

NarineK commented Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

shivaram commented Jul 7, 2016

Uh oh!

felixcheung Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jul 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jul 7, 2016

Uh oh!

NarineK commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

shivaram commented Jul 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 15, 2016

NarineK commented Jul 7, 2016 •

edited

Loading

felixcheung Jul 7, 2016 •

edited

Loading

NarineK Jul 11, 2016 •

edited

Loading

NarineK Jul 15, 2016 •

edited

Loading

NarineK Jul 15, 2016 •

edited

Loading

NarineK Jul 15, 2016 •

edited

Loading

NarineK commented Jul 16, 2016 •

edited

Loading