Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented May 18, 2017

What changes were proposed in this pull request?

Grouped documentation for the aggregate functions for Column.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 18, 2017

@felixcheung @HyukjinKwon

Per this suggestion, I'm creating more meaningful examples for the SQL functions.

In this PR, I implemented grouped documentation for all aggregate functions for Column, except those that are also defined for other classes (count, first and last). This achieves the following:

  • Centralized documentation for easy navigation (user can see many related methods on one page).
  • Reduced number of items in See also.
  • Betters examples using shared data. This avoids creating a data frame for each function if they are documented separately. And more importantly, user can copy and paste to run them directly!
  • Cleaner structure and much fewer Rd files.
  • Remove duplicated definition of @param (since they share exactly the same argument).
  • No need to write meaningless examples for trivial functions (because of grouping).

@actuaryzhang
Copy link
Contributor Author

This is what the 'column_aggregate_functions.Rd' doc looks like:

image
image
image

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77046 has finished for PR 18025 at commit 5c8cd1e.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

so I think better example is great and we might be too verbose with individual pages for each function so might be a good idea to consolidate them, but one question, does this affect discoverability?

can function be found on http://spark.apache.org/docs/latest/api/R/index.html
also can one find the function in the shell with ?abs?

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 19, 2017

@felixcheung Thanks for your feedback.

  • This does not affect discoverability: the name of the method is still on the index list
  • No problem with help either, e.g., one can use ?avg.

image

Another important benefit is that we can get rid of most warnings on no examples since we now document all the tiny functions together. The new commit now removes all warnings!
image

I feel this change is important and straightforward. However, this is a laborious effort and lots of review work for you guys. Would like to get your thoughts on how to pursue. Maybe I can finish all functions in the datetime group to show the full picture? Thanks.

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77084 has started for PR 18025 at commit 8d21ebf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for the noRd here? I think it's fairly standard to "document" generic - even though it's not useful most of the time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I intentionally suppressed the documentation for generic functions. The main reason is the documentation of generics conflict with documenting groups of methods together. For example, if I document the avg method for the column class together with other column aggregate methods, here is a couple of scenarios that could happen to the generic avg method:

  1. Without @noRd, roxygen will create a avg rd file to document the avg generic method ONLY. Since in the generics definition, we don't have @title or @description, the package won't compile.
  2. Suppose we get rid of @noRd, and document all generics functions under a new generics rd name. This will compile, but then we have conflicting definitions of the arguments. For example, ... could mean different things in different generics. But this is a minor issue.

The current approach is to suppress the documentation of the generics, since we document each method of the generics. This does not lose any information IMO. The user can always use getGeneric("avg") to get the prototype for the generics. Would like to hear your thoughts.

Copy link
Member

@felixcheung felixcheung May 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think some of the problems are orthogonal - for example, to avoid a avg.rd file we could just change avg in generic.R to have @rdname column_aggregate_functions (which is where avg in function.R is changed to)?

I feel the standard is to have documentation on the generic (and have it on the same rd) like https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html

Copy link
Member

@felixcheung felixcheung May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait - having these here is intentional? if a function supports different type we don't want the documentation to go to DataFrame.R because then Column.R will look empty.

Copy link
Member

@felixcheung felixcheung May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here https://github.com/apache/spark/pull/18025/files#diff-04c14efaae2b7b0f0a45038482f2590cR135
how do we decide this goes to Column.R and not DataFrame.R? It's very easy then later on someone else added more comment in DataFrame.R thinking there is no documentation and then later on the help content is duplicated
(this has happened a few times before)

Copy link
Contributor Author

@actuaryzhang actuaryzhang May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Thanks for pointing this out. In this case, it should be a Column not a SparkDataFrame or a Column. Indeed, this is a problem of documenting only the generics and inheriting the parameter doc from the generics in each method. Ideally, each specific method of a generic should have its own documentation, not the way we document the generics now. Suppose we have a new method alias for another class, the doc in the generics needs to be updated to include this new class. But it's not clear to the developer where to make the change. If we instead just document each method of the generics, it will be much clearer.
I will do a thorough cleanup of these issues after we decide on the big items.

Copy link
Member

@felixcheung felixcheung May 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, I think some wires are crossed. let me clarify. let's take coalesce as an example.

there are 2 coalesce, one coalesce(df) like repartition, and one coalesce(df$foo) on a column like in SQL. so therefore, these are in fact 2 coalesce and x is either a SparkDataFrame or a Column.

and to elaborate, the history behind that approach we have today is because we use to have this

@param x a Column
...
function(x = "Column"...)

and at the same time, in a different .R file

@param x a SparkDataFrame
...
function(x = "SparkDataFrame"

they seem fine when we write the code and it seems logical/easy to maintain, but when the rd/doc page is generated it has

x a SparkDataFrame
x a Column

here x is explained twice. worse, the order is largely random (it's the alphabetic order of the .R file)

and it is going against the standard R pattern of one description line for each parameter with the choice of type separated by .. or.. like https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html, and not to mention CRAN check I think complain about parameter documented more than once.

so in short, having one @param x a SparkDataFrame or a Column is intentional. since this is describing 2 things, from discussions back then it feels more nature to put it some where independent - in fact like you are touching on, I'd argue it's better to look it up in generic.R rather than trying to figure out what other existing overload class that method already has. to be more concrete, like in this change I'd need to know to go to column.R for alias, but DataFrame.R for gapply, but alias is also in DataFrame.R.

@felixcheung
Copy link
Member

thanks for looking into this and yes it will be useful to understand better before proceeding further.

how does the ? help search or API index page work in this case? I wonder if it's because of the @aliases tag (or might be @name?) - aliases we are discussing ways to get rid of since it creates duplicated links everywhere. I just want to make sure we don't break usability - people can't find help in the shell is a huge problem ;)

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 19, 2017

@felixcheung Thanks for asking this. I should have been more clear.

@name is the "name" of the Rd object represented, which is unique. An Rd object can have multiple aliases pointing to it. Usually, when one specifies @name xxx, there will be a \alias{xxx} automatically created to reference xxx.

In the avg case, I actually don't need @name avg, since the doc of avg goes to column_aggregate_functions. In the new commit, I use @aliases avg avg,Column-method which creates two aliases \alias{avg} and \alias{avg,Column-method}.

It now allows the following shortcuts for help search, all of which direct to the same Rd object column_aggregate_functions (the @name object):

  • ?avg
  • ?"avg,Column-method"
  • help("avg")
  • ?column_aggregate_functions
  • Indeed, since this is S4 method, we can also do method?avg("Column").

Details of R documentation here:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-R-documentation-files

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77105 has finished for PR 18025 at commit ead9781.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

thanks, I am somewhat familiar with all the tags. I see that you have @aliases avg instead of @name avg. may I direct you to this JIRA - https://issues.apache.org/jira/browse/SPARK-18825 this is the one tracking all the double/triple links because of aliases and @seealso

I think we are trying to remove @aliases avg,Column-method (ie. -method ones) and so it's better to make sure we come up with a consistent way to handle this.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 20, 2017

@felixcheung
I'm curious what is the motivation to remove the style avg,Column-method?
This is the default and preferred (I believe) way to reference to S4 methods in R.
I could be wrong, but if the motivation is to remove dup references, I think it makes more sense to remove avg, not avg,xx-method. It is import to keep the option method?avg("Column") because users familiar with S4 methods are likely to use this more often.

Suppose we have new classes and methods on avg, say avg,DataFrame-method, avg,RDD-method, avg,Row-method, etc, is the idea to document all avg methods in a single doc and use one \alias{avg} to reference it?

If that's the case, then the idea of grouping different methods by class will not be useful.

@felixcheung
Copy link
Member

I guess this comes from not seeing a lot of S4 methods? In fact, a overwhelming percentage of all methods in packages are S3, and I think users are using to search with ?avg and also expect to see different classes documented on the same page.

granted, I am aware some (many?) of these don't make sense (like coalesce(DF) and coalesce(col) have not much common) and that is a problem.

would you propose they go to different page then? would that also solve https://issues.apache.org/jira/browse/SPARK-18825 ?

@actuaryzhang
Copy link
Contributor Author

@felixcheung I think we may want to distinguish a few cases:

  1. For methods that are mainly defined by only one class, e.g., most function methods for Column, it makes sense to group and document them together. For example, most aggregate functions of Column go into one single Rd, since they are not defined for other classes. In this case, avg will go to this doc since it is not used by other classes.
  2. For methods that are defined by multiple classes, e.g., the show method defined for SparkDataFrame, GroupedData, Column and StreamingQuery, we can still document them in show.Rd. In this case, show will go to this doc and shows the help for all classes that have defined a show method.
  3. When it makes sense, we can also combine 1 & 2 above. For example, gapply and gapplyCollecte are defined for both SparkDataFrame and GroupedData. But we can still document them together and create shared examples.

Let me know if this makes sense.

@felixcheung
Copy link
Member

Thanks for summarizing. I think they make sense. To be clear though, we should also talk about:

  • what if a method is defined in one class and belongs in a group, but also defined for another class (eg. sql function: cov)
  • what if it is defined for multiple classes but meaning are drastically different (eg. coalesce(DF) and coalesce(col) in my example above)

@actuaryzhang
Copy link
Contributor Author

Great point.

  • For a method that is defined in one class and belongs in a group like cov, we can document it in its own Rd, and add a link to in the SeeAlso section of the group doc. In this case, the \alias{cov} will be in cov.Rd.
  • For a method that is defined for multiple classes but meaning are drastically different: I think we can still document them in one Rd, and add a details section to describe the method for each class.

@felixcheung
Copy link
Member

For a method that is defined for multiple classes but meaning are drastically different: I think we can still document them in one Rd, and add a details section to describe the method for each class.

would this be too confusing though? particularly if we are moving to -method one per doc page

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 24, 2017

@felixcheung I just made a new commit which I think has the cleanest solution so far. In this one, I implemented grouping for all aggregate functions for Column, except those that are also defined for other classes (count, first and last). As you can see, it achieves the following:

  • Centralized documentation for easy navigation (user can see 24 related methods on one single page).
  • Reduced number of items in See also (Only three now).
  • Betters examples using shared data. This avoids creating a data frame for each function if they are documented separately. And more importantly, user can copy and paste to run them directly!
  • Cleaner structure and much fewer Rd files (again removed about 20 Rd files).
  • Remove duplicated definition of @param (since they share exactly the same argument).
  • No need to write meaningless examples for trivial functions (because of grouping).

Of course, the scope of this PR is not to address removing the -method entry in the index.

In this version, I also demonstrate how to handle the methods defined by multiple classes (count, first and last). I still document them in their own Rd, and simply give a link in the SeeAlso section. Of course, we can combine the doc for these three methods to something like shared_methods.Rd since each of them is tiny.

Also, to facilitate review, perhaps we can break the changes into several PRs, one for each of aggregate_functions, datetime_functions, math_function, string_function, normal_functions and misc_functions?

After making the change to the Column methods, I will work on the doc for SparkDataFrame and GroupedData. Please let me know your thoughts.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 24, 2017

Attaching a snapshot of part of the doc to show the idea.

image
image
image

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77313 has finished for PR 18025 at commit 10a24b3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang actuaryzhang changed the title [WIP][SparkR] Update doc and examples for sql functions [WIP][SparkR] Grouped documentation for sql functions May 25, 2017
@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77318 has finished for PR 18025 at commit 782ffc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I think we need to give it a title explicitly - see the header/first line of https://cloud.githubusercontent.com/assets/11082368/26429381/64dd117e-409b-11e7-9661-659b5fbe8206.png

@felixcheung
Copy link
Member

felixcheung commented May 25, 2017

I guess we don't need link to stddev_samp since it's the same page
shouldn't std_dev and var_samp be also on this page?

@felixcheung
Copy link
Member

also, since we have an Rd now what you think about collecting all the example into one - that should eliminate all the Not run in every other line.

I think then also this will be a great opportunity to do more than simple head(select(...)) something expanded and more practical? what do you think?

also this #18025 (comment)

I like this approach - these are my comments from your screen shot - I'll review more closely after more changes, thanks!

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 25, 2017

@felixcheung

  • The links to stddev_samp etc are already removed in the latest commit. Sorry that the snapshot was not updated.
  • About collecting all the example into one, I think that'll work for this particular one. But I'm not sure about this in general. These methods are still spread out in the R file. And if we decide to change the grouping of these functions later on, it will be very difficult if we don't have examples in those methods.
  • For a method that is defined for multiple classes but meaning are drastically different, I agree that it's best to document by class. One downside is a generic ?coalesce can only go to one help page, e.g., it could go to the help for SparkDataFrame, not the other classes. Two things we can do to mitigate this: 1) we can add links to the coalesce methods for the other classes in the SeeAlso section. 2) use ?"coalesce,Column-method" or method?coalesce("Column") to get to the help page for a specific class.
  • Regarding the first line which reads approxCountDistinct: this is very unfortunate, and I don't know how to fix it without dropping docs for the generics. The issue is when I document the generic functions into the same Rd files as the methods, it will automatically generate a \name{} entry, which overwrites the \name{column_aggregate_functions} I manually set. One solution is to NOT document the generics: they really don't add value anyway from a documentation perspective. As you can see from the snapshot, there are two entries for approxCountDistinct, the first one for the generic and the second one for the method defined for Column. Same for avg. The doc for generics is redundant IMO. Do you want to see a version without doc for generics?

@felixcheung
Copy link
Member

re: title, would explicitly adding @title help?
re: multiple class - agreed, a link or @seealso should be good. wouldn't ?coalesce show the overloads though

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 25, 2017

  • New commit now resolves the Name issue. @title does not work, which is the header in the second line \title{Aggregate functions for Column operations}. The solution is to use @name NULL for the generics. Now we have:

image

  • Also added several more practical examples. But most of these functions are very straightforward to use.

image

@actuaryzhang
Copy link
Contributor Author

@felixcheung Your comments are all addressed now. Please let me know if there is anything else needed.

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78149 has finished for PR 18025 at commit 014b9f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78152 has finished for PR 18025 at commit 0a7f5fc.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78153 has finished for PR 18025 at commit 19d063c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78190 has finished for PR 18025 at commit 978e13b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Could you take another look and let me know if there is anything else needed?

@HyukjinKwon
Copy link
Member

@actuaryzhang, Are corr, covar_pop and covar_samp the same instances but missed? It looks these are also aggregate functions in functions.scala but these look missed here and I can see R has those functions.

@actuaryzhang
Copy link
Contributor Author

@HyukjinKwon Thanks for catching this. They were incorrectly labeled as math functions instead of aggregate functions in SparkR. And that's why I did not change them.
New commit fixed this now. Note they are still documented in their own Rd because there is also a method defined for SparkDataFrame. I made some cleaning and updated the example to be runnable.

@SparkQA
Copy link

SparkQA commented Jun 17, 2017

Test build #78215 has finished for PR 18025 at commit 875db0d.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2017

Test build #78216 has finished for PR 18025 at commit 79d9fdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @examples \dontrun{corr(df$c, df$d)}
#' @examples
#' \dontrun{
#' df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need space/newline in front of this example like the other ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one does not need the extra newline since it's in its own Rd and there are no examples before it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - I know we talk about it, but we might consider getting all examples on the Rd into one \dontrun block again. as of now it's very hard to review new PR without knowing whether a newline is needed or not...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we should have a newline at the end of every @example block (when there are multiple examples on one Rd)? This way we don't have to know where goes first

#' cov <- cov(df, "title", "gender")
#' }
#'
#' \dontrun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the newline be after the dontrun?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The newline should be between @example and \dontrun to separate multiple dontruns. See the screen shot above.

collect(dataFrame(sct))
})

#' Calculate the sample covariance of two numerical columns of a SparkDataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is one of the tricky ones where there is one page for DataFrame & Columns.
I think it's useful to touch of how this works with a SparkDataFrame and keep this line in some form?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method for SparkDataFrame is still there. I'm just removing redundant doc here.
See the screenshot here.

image
image

Copy link
Member

@felixcheung felixcheung Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. in that case can we add this in the @details for cov
I feel like this has bits of info that could be useful:
Calculate the sample covariance of two numerical columns of a SparkDataFrame - say, numerical columns of one SparkDataFrame, as supposed to cov(df, df$name, df2$bar)

#'
#' @param colName1 the name of the first column
#' @param colName2 the name of the second column
#' @return The covariance of the two columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would the @return line in the final doc?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I added this back. The doc should be very clear even without this return value. Indeed, most functions do not document return value in SparkR. See what it looks like in the image above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but better clarity wouldn't hurt, right?

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Jun 19, 2017

This is how the structure of the doc looks like (this is only snapshot for the main parts for illustration purpose):

image
image
image
image

@SparkQA
Copy link

SparkQA commented Jun 19, 2017

Test build #78244 has finished for PR 18025 at commit 6eae126.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

very cool, thanks, I guess there's only this last comment #18025 (comment)

@actuaryzhang
Copy link
Contributor Author

OK. Updated the doc for the cov method for SparkDataFrame.

@SparkQA
Copy link

SparkQA commented Jun 19, 2017

Test build #78264 has finished for PR 18025 at commit 4cf5ab9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

haha. I like the \emph

@felixcheung
Copy link
Member

AppVeyor failure is unfortunate. but it passed before a doc only change.

@felixcheung
Copy link
Member

merged to master, thanks!

@asfgit asfgit closed this in 8965fe7 Jun 20, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes apache#18025 from actuaryzhang/sparkRDoc4.
@actuaryzhang actuaryzhang deleted the sparkRDoc4 branch July 1, 2017 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants