[SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation #15677

HyukjinKwon · 2016-10-29T00:36:24Z

What changes were proposed in this pull request?

This PR proposes to change the documentation for functions. Please refer the discussion from #15513

The changes include

Re-indent the documentation
Add examples/arguments in extended where the arguments are multiple or specific format (e.g. xml/ json).

For examples, the documentation was updated as below:

Functions with single line usage

Before

pow

Usage: pow(x1, x2) - Raise x1 to the power of x2.
Extended Usage:
> SELECT pow(2, 3);
 8.0

current_timestamp

Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
Extended Usage:
No example for current_timestamp.

After

pow

Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
Extended Usage:
    Examples:
      > SELECT pow(2, 3);
       8.0

current_timestamp

Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
Extended Usage:
    No example/argument for current_timestamp.

Functions with (already) multiple line usage

Before

approx_count_distinct

Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
    approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
      with relativeSD, the maximum estimation error allowed.

Extended Usage:
No example for approx_count_distinct.

percentile_approx

Usage:
      percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.

      percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
      percentile array of column `col` at the given percentage array. Each value of the
      percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
      a positive integer literal which controls approximation accuracy at the cost of memory.
      Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
      the approximation.

Extended Usage:
No example for percentile_approx.

After

approx_count_distinct

Usage:
    approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
      `relativeSD` defines the maximum estimation error allowed.

Extended Usage:
    No example/argument for approx_count_distinct.

percentile_approx

Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
      In this case, returns the approximate percentile array of column `col` at the given
      percentage array.

Extended Usage:
    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0

How was this patch tested?

Manually tested

When examples are multiple

spark-sql> describe function extended reflect;
Function: reflect
Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Extended Usage:
    Examples:
      > SELECT reflect('java.util.UUID', 'randomUUID');
       c33fb387-8500-4bfa-81d2-6e0e3e930df2
      > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
       a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

When Usage is in single line

spark-sql> describe function extended min;
Function: min
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
Usage: min(expr) - Returns the minimum value of `expr`.
Extended Usage:
    No example/argument for min.

When Usage is already in multiple lines

spark-sql> describe function extended percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
      In this case, returns the approximate percentile array of column `col` at the given
      percentage array.

Extended Usage:
    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0

When example/argument is missing

spark-sql> describe function extended rank;
Function: rank
Class: org.apache.spark.sql.catalyst.expressions.Rank
Usage:
    rank() - Computes the rank of a value in a group of values. The result is one plus the number
      of rows preceding or equal to the current row in the ordering of the partition. The values
      will produce gaps in the sequence.

Extended Usage:
    No example/argument for rank.

HyukjinKwon · 2016-10-29T00:39:28Z

cc @srowen, @rxin, @jodersky, @gatorsmile . I closed the previous one and reopened it here.

HyukjinKwon · 2016-10-29T00:41:31Z

@gatorsmile I double-checked the type ones again and tried to describe the types more specifically. Could you please take another look?

HyukjinKwon · 2016-10-29T00:55:40Z

BTW, I hope we can fix up all the minor comments together at once as a final look if each does not block each other.

SparkQA · 2016-10-29T02:54:03Z

Test build #67733 has finished for PR 15677 at commit 2b437fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-29T05:59:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+  extended = """
+    Arguments:
+      expr1 - numeric type expression.
+      expr2 - numeric type expression that defines the decimal places to round.


numeric type expression -> a numeric type expression

gatorsmile · 2016-10-29T06:11:08Z

Since this PR is 2000+ line changes, I am unable to quickly go over all of them. Normally, creating such a big PR is not welcomed, because it is hard for reviewers to read throught all the changes.

I hope my comments can trigger you to think more and fix all the related issues, instead of I pointing them out one by one.

For example, my most recent comment is about the complex type support. Have you tried the nested types?

BTW, when my teammates do anything like this PR, I always tell them to write test cases if they are not 100% sure. Do not submit a PR without verification. I hope we can do the document changes more carefully, because the external users will not read the source codes. They might easily lose patience.

HyukjinKwon · 2016-10-29T06:39:53Z

Hey @gatorsmile, I am not rushing you to review and also I didn't mean to find every one. I meant I can wait for all the reviews and then sweep it if they are only minors and each does not block each other.

As I said, for testing, I have tested one by one when I was in doubt. I have tested honestly most of them multiple times (I admit it wasn't all for every type). So, I guess mostly the argument types are correct. The recent changes include few minor changes such as except map.

In the case of this PR, basically, it introduces new sections in documentation. I guess fixing the same instances in single PR is reasonable. At least, I asked @rxin and @srowen first before/after making this PR. I also said several times this would be big and I guess I got approved.

There are always rooms for improvements for changes. We can't make something perfect at once. Also, I guess we have still a lot time to fix them more. For example, we can add all the test cases in a folllowup. This would even not be released right after this PR. Other guys still can take a look for this and submit another PR. I mean.. this would not really affect other codes or functionalities too and this is an improvement of documentation, fixing existing typos/mistakes without any downside.

Please let us just focus on this topic.

gatorsmile · 2016-10-29T06:52:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

 */
 @ExpressionDescription(
-  usage = "_FUNC_(col1, col2, col3, ...) - Creates a struct with the given field values.")
+  usage = "_FUNC_(expr1, expr2, expr2 ...) - Creates a struct with the given field values.",


expr2, expr2 -> expr2, expr3

gatorsmile · 2016-10-29T07:02:13Z

I am feeling sorry, but I have to say I will not merge the PR if the quality is not good enough. Maybe my standard is too high. If the other Committers have different opinions, they can merge it without considering my comments.

I do not think the changes are minor. Without except map, the argument descriptions are incorrect. When we writing the document like this, we should be more careful.

Writing the documents are the same as writing the source codes. Sometimes, documents might be more important, because users might read it. If the documents do not match the behaviors, users normally first doubt the quality.

gatorsmile · 2016-10-29T07:09:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+  extended = """
+    Arguments:
+      expr - a numeric or interval expression.
+    Examples:


Nit: add an extra space

gatorsmile · 2016-10-29T07:16:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

 */
 @ExpressionDescription(
-  usage = "_FUNC_(key0, value0, key1, value1...) - Creates a map with the given key/value pairs.")
+  usage = "_FUNC_(key0, value0, key1, value1...) - Creates a map with the given key/value pairs.",


Nit: please correct all of them to make the usage of ... consistent. For example, x, y, z, ...

HyukjinKwon · 2016-10-29T07:22:45Z

@gatorsmile OK, so, your concern is about potentially inaccurate types and typos. If you are really worried, maybe I can split this PR into several ones.

gatorsmile · 2016-10-29T07:23:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+  usage = "_FUNC_(timestamp) - Returns the minute component of the string/timestamp.",
+  extended = """
+    Arguments:
+      timestamp - a timestamp expression.


This is inconsistent with the function description. Please check which one is right.

Also go over all the date/time/timestamp function descriptions. At least they should be consistent.

Oh, yes. I knew this but I was hesitant to fix string/timestamp because fixing this to timestamp sounds a (strictly..) regression in the documentation (it mean, string/timestamp is not inaccurate..); however, I also could not add string in extended part as it is inconsistent with the others.

In DBMS, this is a very common issue. If users want to use strings to represent timestamp, or the other datetime values, they need to follow the formats. If I were users, my first question is which formats they should follow. For example, the precision is 6 or 2?

string/timestamp is not clear. We definitely should improve it. Below is a document you can refer: http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_datetimestringrepresentation.html

gatorsmile · 2016-10-29T07:26:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

+      val - an expression of any type.
+
+    Examples:
+      > SELECT _FUNC_("a", 1, "b", 2, "c", 3);


Could you follow the consistent way to specify the string literal in the example? Most of them are using single quotes

gatorsmile · 2016-10-29T07:37:12Z

Not sure whether the other reviewers have the same feeling like me. 2000+ LOC is too big, imo. The focus of this PR should be on the newly added argument types, descriptions and examples.

srowen · 2016-10-29T12:00:13Z

@gatorsmile yes, this is why I suggested earlier that we stop and commit the changes we had in the last PR. We have to break this up to make it manageable in this case, because of the size. There are important doc fixes here that we do need to commit. You have indeed asked for a different additional set of logical changes, and now we're mostly down that path, which has led to one big bang PR. I don't think it's useful to question now whether it should be committed!

We can commit a change that is a coherent improvement, which does not make the docs worse. We do not have to commit a change that improves or even fixes everything in the existing docs. Some of the problems you're highlighting were there already. Some are improvements to new docs, where there was simply no information before.

Re-skimming this large change, it seems like it makes a coherent improvement to the docs (modulo a few more fixes you suggested). If the issue is down to type info, do you believe anything here is actually wrong-er than it was before? If so, let's zero in on those essential changes and make them, and merge this. If not let's merge this. If there is more to discuss, it will be much easier to address in a second pass, I think. We won't have a release in between the two PRs.

gatorsmile · 2016-10-29T16:49:59Z

@srowen If the newly added contents are not right, should we commit them and fix them later?

If this PR does not add argument descriptions, I am fine to merge it (after one more pass) because this is an incremental improvement, as what you said.

gatorsmile · 2016-10-29T16:57:42Z

How about splitting this PR to two parts? One is the document improvement (fixing typos and adding examples). Another is to add argument descriptions (many parts are not clearly defined. More discussions are needed).

gatorsmile · 2016-10-30T06:40:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

-  extended = "> SELECT _FUNC_('Spark SQL');\n 'LQS krapS'")
+  extended = """
+    Arguments:
+      str - a string expression.


Please check this. This is wrong.

HyukjinKwon · 2016-10-30T08:00:27Z

@gatorsmile, I guess the changes here are mostly correct and accurate and make sense. At least, I guess the changes here are mostly logical and consistent. This does not mean "not right" or "wrong".

For example, if users run describe function extended reverse; (the one you pointed out right above)

Before

Function: reverse
Class: org.apache.spark.sql.catalyst.expressions.StringReverse
Usage: reverse(str) - Returns the reversed given string.
Extended Usage:
> SELECT reverse('Spark SQL');
 'LQS krapS'

After

Function: reverse
Class: org.apache.spark.sql.catalyst.expressions.StringReverse
Usage: reverse(str) - Returns the reversed given string.
Extended Usage:
    Arguments:
      str - a string expression.

    Examples:
      > SELECT reverse('Spark SQL');
       LQS krapS

Do you think this is completely incorrect and what users might easily lose patience at? I guess this is an improvement. Sure, we can improve them more and make them more consistent but I guess this might not be the reason to block this change.

HyukjinKwon · 2016-10-30T08:01:42Z

Could I please ask your opinion @rxin ? I would rather simply follow the majority.

SparkQA · 2016-10-30T11:04:25Z

Test build #67779 has finished for PR 15677 at commit b46404b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-10-30T16:46:04Z

    val df2 = Seq((2.0f, 345)).toDF("a", "b")
    checkAnswer(
      df2.select(reverse($"a"), reverse($"b")),
      Row("0.2", "543"))

The argument of reverse is not limited to String. Thus, str - a string expression. is not right.

However, the complex type we should block. We also need to fix it in the code.

gatorsmile · 2016-10-30T17:10:46Z

Based on my review in this PR, I think we need to resolve the following three major issues when we document the argument type:

First, implicit type casting should be clearly explained. I am not comfortable if we just let external users figure it out by themselves. They do not know the rules.
Second, null handling of function arguments should also be documented. For most functions, they are null intolerant and always return null when any argument is null. However, we still have a few exceptions. For example, sentences.
Third, the arguments also need to use different terms for literal/constant, foldable expressions and non-foldable expressions.

srowen · 2016-10-30T17:12:41Z

@gatorsmile I suggest we handle all of that as a follow-on PR, because this has already become a series of good related changes that has nevertheless become a very big change. Each of those were existing issues in the docs so I don't think this makes it worse.

gatorsmile · 2016-10-30T17:20:10Z

I am still uncomfortable about the newly added contents regarding the argument types, if we do not resolve the above three general issues.

If this were merged, basically, in the follow-up PR, we need to rewrite almost all the argument types.

srowen · 2016-10-30T17:26:39Z

That's fine, they can be changed soon after. Put it this way: do you feel the same argument applies separately to the existing documentation? if it wasn't a 'regression' in this PR i think it's separable.

gatorsmile · 2016-10-30T17:31:38Z

If the existing way to document argument types is not right, why we still want to merge it into the master branch?

Why not removing them from this PR, if we care the PR grows too big?

srowen · 2016-10-30T17:39:09Z

The argument would be: because it's not less-right? (I'm not even judging whether your points are right but I presume they are.)

I think we did originally not have nearly this much change regarding types and I wanted to merge the last PR, so yes I had the same question before. But we're here now, so what's the simplest path forward?

The review has been dragging on, which is OK if it's converging, but now you're suggesting you want to abandon this change. I don't think that's the right outcome. I'd rather resolve the stall by breaking down the change instead. I don't hear that you're arguing that the docs are wrong, just as incomplete in some ways as they were before.

More practically, it's hard for you to ask for a hundred changes in comments. It's hard for me to track whether the changes affect what I've previously reviewed. I think we have to draw this to a checkpoint and so I'm asking the narrower question, whether this is actually making anything worse? it makes a number of different things better, according to its original purpose.

gatorsmile · 2016-10-30T17:46:22Z

IMO, when we add the argument types in the function descriptions, we should deliver the corresponding test cases in the same PRs. Otherwise, I do not think how we can know the argument types are right or wrong, especially when we have to consider the implicit type conversion. Strictly speaking, the newly added argument types are wrong.

I think we just need to remove the argument types from this PR. That is just part of this PR.

SparkQA · 2016-11-02T06:54:22Z

Test build #67962 has finished for PR 15677 at commit b2fb5e6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T09:37:08Z

Test build #67964 has finished for PR 15677 at commit d25abca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-02T09:47:43Z

@gatorsmile @HyukjinKwon do you both feel like this is ready to merge?

HyukjinKwon · 2016-11-02T09:54:07Z

Thanks for asking. I think yes it is ready.

gatorsmile · 2016-11-02T16:37:24Z

LGTM except two comments: one and another.

Thanks for your work!

… arguments Address typos Let's use English for the examples at least to reduce the lines style off for line length Address both comments

HyukjinKwon · 2016-11-02T23:20:05Z

Thanks @gatorsmile and @srowen. It seems ready.

gatorsmile · 2016-11-02T23:44:05Z

LGTM pending test.

SparkQA · 2016-11-03T01:52:10Z

Test build #68031 has finished for PR 15677 at commit 4ed9d03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-03T02:06:53Z

Test build #68033 has finished for PR 15677 at commit 6ad2068.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-03T02:09:18Z

Test build #68027 has finished for PR 15677 at commit cd0c810.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-03T02:11:29Z

(#15677 (comment) was on the past commit. I resolved the conflicts twice.)

…ssion and improve documentation ## What changes were proposed in this pull request? This PR proposes to change the documentation for functions. Please refer the discussion from #15513 The changes include - Re-indent the documentation - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json). For examples, the documentation was updated as below: ### Functions with single line usage **Before** - `pow` ``` sql Usage: pow(x1, x2) - Raise x1 to the power of x2. Extended Usage: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example for current_timestamp. ``` **After** - `pow` ``` sql Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`. Extended Usage: Examples: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example/argument for current_timestamp. ``` ### Functions with (already) multiple line usage **Before** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++. approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++ with relativeSD, the maximum estimation error allowed. Extended Usage: No example for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. ``` **After** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. `relativeSD` defines the maximum estimation error allowed. Extended Usage: No example/argument for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` ## How was this patch tested? Manually tested **When examples are multiple** ``` sql spark-sql> describe function extended reflect; Function: reflect Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection. Extended Usage: Examples: > SELECT reflect('java.util.UUID', 'randomUUID'); c33fb387-8500-4bfa-81d2-6e0e3e930df2 > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2'); a5cf6c42-0c85-418f-af6c-3e4e5b1328f2 ``` **When `Usage` is in single line** ``` sql spark-sql> describe function extended min; Function: min Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min Usage: min(expr) - Returns the minimum value of `expr`. Extended Usage: No example/argument for min. ``` **When `Usage` is already in multiple lines** ``` sql spark-sql> describe function extended percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` **When example/argument is missing** ``` sql spark-sql> describe function extended rank; Function: rank Class: org.apache.spark.sql.catalyst.expressions.Rank Usage: rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence. Extended Usage: No example/argument for rank. ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15677 from HyukjinKwon/SPARK-17963-1. (cherry picked from commit 7eb2ca8) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

gatorsmile · 2016-11-03T03:59:03Z

Merging to master and 2.1! Thanks!

rxin · 2016-11-20T05:56:34Z

@HyukjinKwon why did we add backticks to surround all parameters? It looks pretty weird actually.

HyukjinKwon · 2016-11-20T06:09:43Z

Some of them already had backricks in the description and others did not. Matching it up with backticks or double-quotes was initially suggested by some comments such as #15677 (comment), #15513 (comment) and #15677 (comment). I could not find the concrete reason to not follow. I can remove all of them if you confirm.

gatorsmile · 2016-11-20T06:49:58Z

Normally, most RDBMS docs are using Italic fonts for the parameter/argument names. Linux man is using underscore to highlight argument names in descriptions. I also saw another way

``expr1''

I do not know which ways are better in our function descriptions.

rxin · 2016-11-20T07:13:33Z

These function docs are printed onto the console without formatting.

gatorsmile · 2016-11-20T07:38:53Z

True. That is why I do not know which is the best way for us to show the argument/parameter names.

…ssion and improve documentation ## What changes were proposed in this pull request? This PR proposes to change the documentation for functions. Please refer the discussion from apache#15513 The changes include - Re-indent the documentation - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json). For examples, the documentation was updated as below: ### Functions with single line usage **Before** - `pow` ``` sql Usage: pow(x1, x2) - Raise x1 to the power of x2. Extended Usage: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example for current_timestamp. ``` **After** - `pow` ``` sql Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`. Extended Usage: Examples: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example/argument for current_timestamp. ``` ### Functions with (already) multiple line usage **Before** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++. approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++ with relativeSD, the maximum estimation error allowed. Extended Usage: No example for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. ``` **After** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. `relativeSD` defines the maximum estimation error allowed. Extended Usage: No example/argument for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` ## How was this patch tested? Manually tested **When examples are multiple** ``` sql spark-sql> describe function extended reflect; Function: reflect Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection. Extended Usage: Examples: > SELECT reflect('java.util.UUID', 'randomUUID'); c33fb387-8500-4bfa-81d2-6e0e3e930df2 > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2'); a5cf6c42-0c85-418f-af6c-3e4e5b1328f2 ``` **When `Usage` is in single line** ``` sql spark-sql> describe function extended min; Function: min Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min Usage: min(expr) - Returns the minimum value of `expr`. Extended Usage: No example/argument for min. ``` **When `Usage` is already in multiple lines** ``` sql spark-sql> describe function extended percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` **When example/argument is missing** ``` sql spark-sql> describe function extended rank; Function: rank Class: org.apache.spark.sql.catalyst.expressions.Rank Usage: rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence. Extended Usage: No example/argument for rank. ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15677 from HyukjinKwon/SPARK-17963-1.

gatorsmile reviewed Oct 29, 2016

View reviewed changes

gatorsmile reviewed Oct 30, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-17963-1 branch from d25abca to cd0c810 Compare November 2, 2016 22:59

Add examples (extend) in each function and improve documentation with…

28edbd7

… arguments Address typos Let's use English for the examples at least to reduce the lines style off for line length Address both comments

HyukjinKwon force-pushed the SPARK-17963-1 branch from cd0c810 to 28edbd7 Compare November 2, 2016 23:12

HyukjinKwon added 2 commits November 3, 2016 08:14

no extra newline

4ed9d03

Fix another typo

6ad2068

asfgit closed this in 7eb2ca8 Nov 3, 2016

HyukjinKwon mentioned this pull request Nov 4, 2016

[MINOR][Documentation] Fix some minor descriptions in functions consistently with expressions #15765

Closed

HyukjinKwon deleted the SPARK-17963-1 branch January 2, 2018 03:43

[SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation #15677

[SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation #15677

Uh oh!

Conversation

HyukjinKwon commented Oct 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Functions with single line usage

Functions with (already) multiple line usage

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 29, 2016

Uh oh!

HyukjinKwon commented Oct 29, 2016

Uh oh!

HyukjinKwon commented Oct 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 29, 2016

Uh oh!

srowen commented Oct 29, 2016

Uh oh!

gatorsmile commented Oct 29, 2016

Uh oh!

gatorsmile commented Oct 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 30, 2016

Uh oh!

SparkQA commented Oct 30, 2016

Uh oh!

gatorsmile commented Oct 30, 2016

Uh oh!

gatorsmile commented Oct 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Oct 30, 2016

Uh oh!

gatorsmile commented Oct 30, 2016

Uh oh!

srowen commented Oct 30, 2016

Uh oh!

gatorsmile commented Oct 30, 2016

Uh oh!

srowen commented Oct 30, 2016

Uh oh!

gatorsmile commented Oct 30, 2016

Uh oh!

HyukjinKwon commented Oct 29, 2016 •

edited

Loading

HyukjinKwon commented Oct 29, 2016 •

edited

Loading

gatorsmile commented Oct 29, 2016 •

edited

Loading

HyukjinKwon commented Oct 29, 2016 •

edited

Loading

HyukjinKwon commented Oct 30, 2016 •

edited

Loading

gatorsmile commented Oct 30, 2016 •

edited

Loading

gatorsmile commented Nov 2, 2016 •

edited

Loading

HyukjinKwon commented Nov 20, 2016 •

edited

Loading

rxin commented Nov 20, 2016 •

edited

Loading