Skip to content

Conversation

@NarineK
Copy link
Contributor

@NarineK NarineK commented Jul 7, 2016

What changes were proposed in this pull request?

Updates programming guide for spark.gapply/spark.gapplyCollect.

Similar to other examples I used faithful dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.

How was this patch tested?

Existing test cases in R

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61911 has finished for PR 14090 at commit 7781d1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Jul 7, 2016

cc @felixcheung @mengxr

docs/sparkr.md Outdated
Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
that key. The groups are chosen from `SparkDataFrame`s column(s).
The output of function should be a `data.frame`. Schema specifies the row format of the resulting
`SparkDataFrame`. It must match the R function's output.
Copy link
Member

@felixcheung felixcheung Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was hard to do in roxygen2 doc but the programming guide would be a great place to touch on or refer to what "match" means exactly - type mapping between Spark and R is a bit fuzzy and would be good to explain a bit more on that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could be explained in dapply above as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung, Does this sound better ?
"It must reflect R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user." I could also bring up some examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gapply and dapply are the first important use cases where we require strict mapping Spark JVM types to R atomic types. It might be worthwhile to add a section in the programming guide to illustrate and explain that further.

To be more concrete, what should be the column type of the UDF output R data.frame if the SparkDataFrame has a column of double? It would be good to have a table on that.

That could be a separate PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think we can describe the following type mapping in the programming guide.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
Those are the types used in the StructType's fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but instead of a pointer to the code it would be great if we could have a table in the documentation.

Copy link
Contributor Author

@NarineK NarineK Jul 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shivaram.
Does the following mapping looks fine to have in the table ?


R               Spark
..........................
byte             byte
integer         integer
float             float
double         double
numeric        double
character      string
string            string
binary           binary
raw               binary
logical           boolean
timestamp    timestamp
date              date
array             array
map              map
struct            struct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was looking at types.R file and have noticed that we have NA's for array, map and struct.
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42
But I guess in our case we can have: array, map and struct mapped to array, map and struct correspondingly ?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those mappings are only used to print things in str. A better list to consult would be the list at https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that says list in R should become a array in SparkSQL and env in R should map to a map

@felixcheung
Copy link
Member

LGTM except for comment on "schema matching".
Also I wonder if we should rephrase "can only be used if the output of UDF run on all the partitions can fit in driver memory" - it seems not as strong as a warning or correct as "can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory" (same in dapplyCollect)

@NarineK
Copy link
Contributor Author

NarineK commented Jul 12, 2016

Added data type description

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62145 has finished for PR 14090 at commit c1d7151.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62147 has finished for PR 14090 at commit 2af7243.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@felixcheung Could you take one more look at this ?

docs/sparkr.md Outdated
Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output.
should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match to [data types of R function's output fields](#data-type-mapping-between-r-and-spark).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output fields --> return values or return value?
http://adv-r.had.co.nz/Functions.html#return-values

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62300 has finished for PR 14090 at commit 5d34943.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62299 has finished for PR 14090 at commit 8a2aff3.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated
<td>map</td>
</tr>
<tr>
<td>struct</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think R has any notion of a struct or map data type ? Looking at the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I think we should remove the struct -> struct and map -> map entries. Also I dont think there is a timestamp class in R. We should probably replace that with POSIXct or POSIXlt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think date is a type either.

Copy link
Contributor Author

@NarineK NarineK Jul 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung, I think according to the following mapping we expect 'date':
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
And it seems that there is a 'Date' in base.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html
Do I understand correctly ?

Copy link
Contributor Author

@NarineK NarineK Jul 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram, I've looked at the following list:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92
It is being called for creating schema's field and it has map, struct, timestamp, etc ...
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L131

Isn't the 'dataType' in 'createStructField' the one passed from R ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point - So users can create a schema with struct and that is mapping to a corresponding SQL type. But they can't create any R objects that will be parsed as struct. The main reason our schema is more flexible than our serialization / deserialization support is that the schema can be used to say read JSON files or JDBC tables etc.

For the use case here, where users are returning a data.frame from UDF I dont think there is any valid mapping for struct from R.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as you mentioned above we can also change date to Date to be more specific. (It would be ideal now that I think to link these R types to the CRAN help page. For example we can link to https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html for POSIXct / POSIXlt

Copy link
Contributor Author

@NarineK NarineK Jul 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. For the mapping between: POSIXct / POSIXlt to timestamp and Date to 'date' do we need to update getSQLDataType method ?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really - as I mentioned the getSQLDatatype looks at the schema - the method which looks at the R objects is in

POSIXlt = writeTime(con, object),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should be Date not date

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And environment instead of env?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html

> e <- new.env()
> class(e)
[1] "environment"

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62369 has finished for PR 14090 at commit 19e849f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
output of UDF run on all the partitions can fit in driver memory.
should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
<div data-lang="r" markdown="1">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a new line before the <div> ? Right now the div markings show up in the generated doc. I've attached a screenshot

screenshot 2016-07-15 14 11 39

@shivaram
Copy link
Contributor

Thanks @NarineK for the updates. As a final thing I just had some formatting problems when I tested out this change locally. Let me know if you can't reproduce them. I just ran

cd docs
SKIP_API=1 jekyll build
open _site/sparkr.html

@NarineK
Copy link
Contributor Author

NarineK commented Jul 15, 2016

Thanks @shivaram, @felixcheung for the comments. I'll address those today.

@NarineK
Copy link
Contributor Author

NarineK commented Jul 16, 2016

Thanks, I've generated the docs with your suggested way @shivaram, but I'm not sure if I see the same thing as you.
I still see some '{% highlight r %}' and some formatting issues in general.
{% highlight r %} sparkR.session() {% endhighlight %}
I also followed this documentation:
https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
Please, let me know if you still see the issues after my latest commit.

@SparkQA
Copy link

SparkQA commented Jul 16, 2016

Test build #62411 has finished for PR 14090 at commit f584416.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @NarineK - I tried it on a fresh Ubuntu VM and it rendered fine. I think it has something to do with ruby / jekyll versions. The rendered docs looked fine on the Ubuntu VM

LGTM. @felixcheung Could you also take one final look ?

@felixcheung
Copy link
Member

LGTM. thanks for putting this together!

@shivaram
Copy link
Contributor

Merging this to master, branch-2.0

asfgit pushed a commit that referenced this pull request Jul 16, 2016
## What changes were proposed in this pull request?

Updates programming guide for spark.gapply/spark.gapplyCollect.

Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.

## How was this patch tested?
Existing test cases in R

Author: Narine Kokhlikyan <narine@slice.com>

Closes #14090 from NarineK/gapplyProgGuide.

(cherry picked from commit 4167304)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
@asfgit asfgit closed this in 4167304 Jul 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants