Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 29, 2016

What changes were proposed in this pull request?

This PR proposes to change the documentation for functions. Please refer the discussion from #15513

The changes include

  • Re-indent the documentation
  • Add examples/arguments in extended where the arguments are multiple or specific format (e.g. xml/ json).

For examples, the documentation was updated as below:

Functions with single line usage

Before

  • pow

    Usage: pow(x1, x2) - Raise x1 to the power of x2.
    Extended Usage:
    > SELECT pow(2, 3);
     8.0
  • current_timestamp

    Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
    Extended Usage:
    No example for current_timestamp.

After

  • pow

    Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
    Extended Usage:
        Examples:
          > SELECT pow(2, 3);
           8.0
  • current_timestamp

    Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
    Extended Usage:
        No example/argument for current_timestamp.

Functions with (already) multiple line usage

Before

  • approx_count_distinct

    Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
        approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
          with relativeSD, the maximum estimation error allowed.
    
    Extended Usage:
    No example for approx_count_distinct.
  • percentile_approx

    Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
          column `col` at the given percentage. The value of percentage must be between 0.0
          and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
          controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
          better accuracy, `1.0/accuracy` is the relative error of the approximation.
    
          percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
          percentile array of column `col` at the given percentage array. Each value of the
          percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
          a positive integer literal which controls approximation accuracy at the cost of memory.
          Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
          the approximation.
    
    Extended Usage:
    No example for percentile_approx.

After

  • approx_count_distinct

    Usage:
        approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
          `relativeSD` defines the maximum estimation error allowed.
    
    Extended Usage:
        No example/argument for approx_count_distinct.
  • percentile_approx

    Usage:
        percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
          column `col` at the given percentage. The value of percentage must be between 0.0
          and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
          controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
          better accuracy, `1.0/accuracy` is the relative error of the approximation.
          When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
          In this case, returns the approximate percentile array of column `col` at the given
          percentage array.
    
    Extended Usage:
        Examples:
          > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
           [10.0,10.0,10.0]
          > SELECT percentile_approx(10.0, 0.5, 100);
           10.0

How was this patch tested?

Manually tested

When examples are multiple

spark-sql> describe function extended reflect;
Function: reflect
Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Extended Usage:
    Examples:
      > SELECT reflect('java.util.UUID', 'randomUUID');
       c33fb387-8500-4bfa-81d2-6e0e3e930df2
      > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
       a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

When Usage is in single line

spark-sql> describe function extended min;
Function: min
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
Usage: min(expr) - Returns the minimum value of `expr`.
Extended Usage:
    No example/argument for min.

When Usage is already in multiple lines

spark-sql> describe function extended percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
      In this case, returns the approximate percentile array of column `col` at the given
      percentage array.

Extended Usage:
    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0

When example/argument is missing

spark-sql> describe function extended rank;
Function: rank
Class: org.apache.spark.sql.catalyst.expressions.Rank
Usage:
    rank() - Computes the rank of a value in a group of values. The result is one plus the number
      of rows preceding or equal to the current row in the ordering of the partition. The values
      will produce gaps in the sequence.

Extended Usage:
    No example/argument for rank.

@HyukjinKwon
Copy link
Member Author

cc @srowen, @rxin, @jodersky, @gatorsmile . I closed the previous one and reopened it here.

@HyukjinKwon
Copy link
Member Author

@gatorsmile I double-checked the type ones again and tried to describe the types more specifically. Could you please take another look?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 29, 2016

BTW, I hope we can fix up all the minor comments together at once as a final look if each does not block each other.

@SparkQA
Copy link

SparkQA commented Oct 29, 2016

Test build #67733 has finished for PR 15677 at commit 2b437fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

extended = """
Arguments:
expr1 - numeric type expression.
expr2 - numeric type expression that defines the decimal places to round.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numeric type expression -> a numeric type expression

@gatorsmile
Copy link
Member

gatorsmile commented Oct 29, 2016

Since this PR is 2000+ line changes, I am unable to quickly go over all of them. Normally, creating such a big PR is not welcomed, because it is hard for reviewers to read throught all the changes.

I hope my comments can trigger you to think more and fix all the related issues, instead of I pointing them out one by one.

For example, my most recent comment is about the complex type support. Have you tried the nested types?

BTW, when my teammates do anything like this PR, I always tell them to write test cases if they are not 100% sure. Do not submit a PR without verification. I hope we can do the document changes more carefully, because the external users will not read the source codes. They might easily lose patience.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 29, 2016

Hey @gatorsmile, I am not rushing you to review and also I didn't mean to find every one. I meant I can wait for all the reviews and then sweep it if they are only minors and each does not block each other.

As I said, for testing, I have tested one by one when I was in doubt. I have tested honestly most of them multiple times (I admit it wasn't all for every type). So, I guess mostly the argument types are correct. The recent changes include few minor changes such as except map.

In the case of this PR, basically, it introduces new sections in documentation. I guess fixing the same instances in single PR is reasonable. At least, I asked @rxin and @srowen first before/after making this PR. I also said several times this would be big and I guess I got approved.

There are always rooms for improvements for changes. We can't make something perfect at once. Also, I guess we have still a lot time to fix them more. For example, we can add all the test cases in a folllowup. This would even not be released right after this PR. Other guys still can take a look for this and submit another PR. I mean.. this would not really affect other codes or functionalities too and this is an improvement of documentation, fixing existing typos/mistakes without any downside.

Please let us just focus on this topic.

*/
@ExpressionDescription(
usage = "_FUNC_(col1, col2, col3, ...) - Creates a struct with the given field values.")
usage = "_FUNC_(expr1, expr2, expr2 ...) - Creates a struct with the given field values.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expr2, expr2 -> expr2, expr3

@gatorsmile
Copy link
Member

I am feeling sorry, but I have to say I will not merge the PR if the quality is not good enough. Maybe my standard is too high. If the other Committers have different opinions, they can merge it without considering my comments.

I do not think the changes are minor. Without except map, the argument descriptions are incorrect. When we writing the document like this, we should be more careful.

Writing the documents are the same as writing the source codes. Sometimes, documents might be more important, because users might read it. If the documents do not match the behaviors, users normally first doubt the quality.

extended = """
Arguments:
expr - a numeric or interval expression.
Examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add an extra space

*/
@ExpressionDescription(
usage = "_FUNC_(key0, value0, key1, value1...) - Creates a map with the given key/value pairs.")
usage = "_FUNC_(key0, value0, key1, value1...) - Creates a map with the given key/value pairs.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please correct all of them to make the usage of ... consistent. For example, x, y, z, ...

@HyukjinKwon
Copy link
Member Author

@gatorsmile OK, so, your concern is about potentially inaccurate types and typos. If you are really worried, maybe I can split this PR into several ones.

usage = "_FUNC_(timestamp) - Returns the minute component of the string/timestamp.",
extended = """
Arguments:
timestamp - a timestamp expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the function description. Please check which one is right.

Also go over all the date/time/timestamp function descriptions. At least they should be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. I knew this but I was hesitant to fix string/timestamp because fixing this to timestamp sounds a (strictly..) regression in the documentation (it mean, string/timestamp is not inaccurate..); however, I also could not add string in extended part as it is inconsistent with the others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DBMS, this is a very common issue. If users want to use strings to represent timestamp, or the other datetime values, they need to follow the formats. If I were users, my first question is which formats they should follow. For example, the precision is 6 or 2?

string/timestamp is not clear. We definitely should improve it. Below is a document you can refer: http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_datetimestringrepresentation.html

val - an expression of any type.
Examples:
> SELECT _FUNC_("a", 1, "b", 2, "c", 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the consistent way to specify the string literal in the example? Most of them are using single quotes

@gatorsmile
Copy link
Member

Not sure whether the other reviewers have the same feeling like me. 2000+ LOC is too big, imo. The focus of this PR should be on the newly added argument types, descriptions and examples.

@srowen
Copy link
Member

srowen commented Oct 29, 2016

@gatorsmile yes, this is why I suggested earlier that we stop and commit the changes we had in the last PR. We have to break this up to make it manageable in this case, because of the size. There are important doc fixes here that we do need to commit. You have indeed asked for a different additional set of logical changes, and now we're mostly down that path, which has led to one big bang PR. I don't think it's useful to question now whether it should be committed!

We can commit a change that is a coherent improvement, which does not make the docs worse. We do not have to commit a change that improves or even fixes everything in the existing docs. Some of the problems you're highlighting were there already. Some are improvements to new docs, where there was simply no information before.

Re-skimming this large change, it seems like it makes a coherent improvement to the docs (modulo a few more fixes you suggested). If the issue is down to type info, do you believe anything here is actually wrong-er than it was before? If so, let's zero in on those essential changes and make them, and merge this. If not let's merge this. If there is more to discuss, it will be much easier to address in a second pass, I think. We won't have a release in between the two PRs.

@gatorsmile
Copy link
Member

@srowen If the newly added contents are not right, should we commit them and fix them later?

If this PR does not add argument descriptions, I am fine to merge it (after one more pass) because this is an incremental improvement, as what you said.

@gatorsmile
Copy link
Member

How about splitting this PR to two parts? One is the document improvement (fixing typos and adding examples). Another is to add argument descriptions (many parts are not clearly defined. More discussions are needed).

extended = "> SELECT _FUNC_('Spark SQL');\n 'LQS krapS'")
extended = """
Arguments:
str - a string expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this. This is wrong.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 30, 2016

@gatorsmile, I guess the changes here are mostly correct and accurate and make sense. At least, I guess the changes here are mostly logical and consistent. This does not mean "not right" or "wrong".

For example, if users run describe function extended reverse; (the one you pointed out right above)

Before

Function: reverse
Class: org.apache.spark.sql.catalyst.expressions.StringReverse
Usage: reverse(str) - Returns the reversed given string.
Extended Usage:
> SELECT reverse('Spark SQL');
 'LQS krapS'

After

Function: reverse
Class: org.apache.spark.sql.catalyst.expressions.StringReverse
Usage: reverse(str) - Returns the reversed given string.
Extended Usage:
    Arguments:
      str - a string expression.

    Examples:
      > SELECT reverse('Spark SQL');
       LQS krapS

Do you think this is completely incorrect and what users might easily lose patience at? I guess this is an improvement. Sure, we can improve them more and make them more consistent but I guess this might not be the reason to block this change.

@HyukjinKwon
Copy link
Member Author

Could I please ask your opinion @rxin ? I would rather simply follow the majority.

@SparkQA
Copy link

SparkQA commented Oct 30, 2016

Test build #67779 has finished for PR 15677 at commit b46404b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

    val df2 = Seq((2.0f, 345)).toDF("a", "b")
    checkAnswer(
      df2.select(reverse($"a"), reverse($"b")),
      Row("0.2", "543"))

The argument of reverse is not limited to String. Thus, str - a string expression. is not right.

However, the complex type we should block. We also need to fix it in the code.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 30, 2016

Based on my review in this PR, I think we need to resolve the following three major issues when we document the argument type:

  • First, implicit type casting should be clearly explained. I am not comfortable if we just let external users figure it out by themselves. They do not know the rules.
  • Second, null handling of function arguments should also be documented. For most functions, they are null intolerant and always return null when any argument is null. However, we still have a few exceptions. For example, sentences.
  • Third, the arguments also need to use different terms for literal/constant, foldable expressions and non-foldable expressions.

@srowen
Copy link
Member

srowen commented Oct 30, 2016

@gatorsmile I suggest we handle all of that as a follow-on PR, because this has already become a series of good related changes that has nevertheless become a very big change. Each of those were existing issues in the docs so I don't think this makes it worse.

@gatorsmile
Copy link
Member

I am still uncomfortable about the newly added contents regarding the argument types, if we do not resolve the above three general issues.

If this were merged, basically, in the follow-up PR, we need to rewrite almost all the argument types.

@srowen
Copy link
Member

srowen commented Oct 30, 2016

That's fine, they can be changed soon after. Put it this way: do you feel the same argument applies separately to the existing documentation? if it wasn't a 'regression' in this PR i think it's separable.

@gatorsmile
Copy link
Member

If the existing way to document argument types is not right, why we still want to merge it into the master branch?

Why not removing them from this PR, if we care the PR grows too big?

@srowen
Copy link
Member

srowen commented Oct 30, 2016

The argument would be: because it's not less-right? (I'm not even judging whether your points are right but I presume they are.)

I think we did originally not have nearly this much change regarding types and I wanted to merge the last PR, so yes I had the same question before. But we're here now, so what's the simplest path forward?

The review has been dragging on, which is OK if it's converging, but now you're suggesting you want to abandon this change. I don't think that's the right outcome. I'd rather resolve the stall by breaking down the change instead. I don't hear that you're arguing that the docs are wrong, just as incomplete in some ways as they were before.

More practically, it's hard for you to ask for a hundred changes in comments. It's hard for me to track whether the changes affect what I've previously reviewed. I think we have to draw this to a checkpoint and so I'm asking the narrower question, whether this is actually making anything worse? it makes a number of different things better, according to its original purpose.

@gatorsmile
Copy link
Member

IMO, when we add the argument types in the function descriptions, we should deliver the corresponding test cases in the same PRs. Otherwise, I do not think how we can know the argument types are right or wrong, especially when we have to consider the implicit type conversion. Strictly speaking, the newly added argument types are wrong.

I think we just need to remove the argument types from this PR. That is just part of this PR.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67962 has finished for PR 15677 at commit b2fb5e6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67964 has finished for PR 15677 at commit d25abca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 2, 2016

@gatorsmile @HyukjinKwon do you both feel like this is ready to merge?

@HyukjinKwon
Copy link
Member Author

Thanks for asking. I think yes it is ready.

@gatorsmile
Copy link
Member

gatorsmile commented Nov 2, 2016

LGTM except two comments: one and another.

Thanks for your work!

… arguments

Address typos

Let's use English for the examples at least to reduce the lines

style off for line length

Address both comments
@HyukjinKwon
Copy link
Member Author

Thanks @gatorsmile and @srowen. It seems ready.

@gatorsmile
Copy link
Member

LGTM pending test.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68031 has finished for PR 15677 at commit 4ed9d03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68033 has finished for PR 15677 at commit 6ad2068.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68027 has finished for PR 15677 at commit cd0c810.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

(#15677 (comment) was on the past commit. I resolved the conflicts twice.)

asfgit pushed a commit that referenced this pull request Nov 3, 2016
…ssion and improve documentation

## What changes were proposed in this pull request?

This PR proposes to change the documentation for functions. Please refer the discussion from #15513

The changes include
- Re-indent the documentation
- Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json).

For examples, the documentation was updated as below:
### Functions with single line usage

**Before**
- `pow`

  ``` sql
  Usage: pow(x1, x2) - Raise x1 to the power of x2.
  Extended Usage:
  > SELECT pow(2, 3);
   8.0
  ```
- `current_timestamp`

  ``` sql
  Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
  Extended Usage:
  No example for current_timestamp.
  ```

**After**
- `pow`

  ``` sql
  Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
  Extended Usage:
      Examples:
        > SELECT pow(2, 3);
         8.0
  ```

- `current_timestamp`

  ``` sql
  Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
  Extended Usage:
      No example/argument for current_timestamp.
  ```
### Functions with (already) multiple line usage

**Before**
- `approx_count_distinct`

  ``` sql
  Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
      approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
        with relativeSD, the maximum estimation error allowed.

  Extended Usage:
  No example for approx_count_distinct.
  ```
- `percentile_approx`

  ``` sql
  Usage:
        percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
        column `col` at the given percentage. The value of percentage must be between 0.0
        and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
        controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
        better accuracy, `1.0/accuracy` is the relative error of the approximation.

        percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
        percentile array of column `col` at the given percentage array. Each value of the
        percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
        a positive integer literal which controls approximation accuracy at the cost of memory.
        Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
        the approximation.

  Extended Usage:
  No example for percentile_approx.
  ```

**After**
- `approx_count_distinct`

  ``` sql
  Usage:
      approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
        `relativeSD` defines the maximum estimation error allowed.

  Extended Usage:
      No example/argument for approx_count_distinct.
  ```

- `percentile_approx`

  ``` sql
  Usage:
      percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
        column `col` at the given percentage. The value of percentage must be between 0.0
        and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
        controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
        better accuracy, `1.0/accuracy` is the relative error of the approximation.
        When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
        In this case, returns the approximate percentile array of column `col` at the given
        percentage array.

  Extended Usage:
      Examples:
        > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
         [10.0,10.0,10.0]
        > SELECT percentile_approx(10.0, 0.5, 100);
         10.0
  ```
## How was this patch tested?

Manually tested

**When examples are multiple**

``` sql
spark-sql> describe function extended reflect;
Function: reflect
Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Extended Usage:
    Examples:
      > SELECT reflect('java.util.UUID', 'randomUUID');
       c33fb387-8500-4bfa-81d2-6e0e3e930df2
      > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
       a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
```

**When `Usage` is in single line**

``` sql
spark-sql> describe function extended min;
Function: min
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
Usage: min(expr) - Returns the minimum value of `expr`.
Extended Usage:
    No example/argument for min.
```

**When `Usage` is already in multiple lines**

``` sql
spark-sql> describe function extended percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
      In this case, returns the approximate percentile array of column `col` at the given
      percentage array.

Extended Usage:
    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0
```

**When example/argument is missing**

``` sql
spark-sql> describe function extended rank;
Function: rank
Class: org.apache.spark.sql.catalyst.expressions.Rank
Usage:
    rank() - Computes the rank of a value in a group of values. The result is one plus the number
      of rows preceding or equal to the current row in the ordering of the partition. The values
      will produce gaps in the sequence.

Extended Usage:
    No example/argument for rank.
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15677 from HyukjinKwon/SPARK-17963-1.

(cherry picked from commit 7eb2ca8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 7eb2ca8 Nov 3, 2016
@gatorsmile
Copy link
Member

Merging to master and 2.1! Thanks!

asfgit pushed a commit that referenced this pull request Nov 6, 2016
…stently with expressions

## What changes were proposed in this pull request?

This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in #15677

Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty.

## How was this patch tested?

Jenkins tests and manually.

For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below:

**From**

![2016-11-04 6 53 35](https://cloud.githubusercontent.com/assets/6477701/20002648/42989922-a2c5-11e6-8a32-b73eda49e8c3.png)
![2016-11-04 6 53 45](https://cloud.githubusercontent.com/assets/6477701/20002650/429fb310-a2c5-11e6-926b-e030d7eb0185.png)
![2016-11-04 6 54 11](https://cloud.githubusercontent.com/assets/6477701/20002649/429d570a-a2c5-11e6-9e7e-44090f337e32.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002647/4297fc74-a2c5-11e6-801a-b89fbcbfca44.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002697/749f5780-a2c5-11e6-835f-022e1f2f82e3.png)

**To**

![2016-11-04 7 03 48](https://cloud.githubusercontent.com/assets/6477701/20002659/4961b504-a2c5-11e6-9ee0-ef0751482f47.png)
![2016-11-04 7 04 03](https://cloud.githubusercontent.com/assets/6477701/20002660/49871d3a-a2c5-11e6-85ea-d9a5d11efeff.png)
![2016-11-04 7 04 28](https://cloud.githubusercontent.com/assets/6477701/20002662/498e0f14-a2c5-11e6-803d-c0c5aeda4153.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15765 from HyukjinKwon/minor-function-doc.

(cherry picked from commit 15d3926)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
ghost pushed a commit to dbtsai/spark that referenced this pull request Nov 6, 2016
…stently with expressions

## What changes were proposed in this pull request?

This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in apache#15677

Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty.

## How was this patch tested?

Jenkins tests and manually.

For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below:

**From**

![2016-11-04 6 53 35](https://cloud.githubusercontent.com/assets/6477701/20002648/42989922-a2c5-11e6-8a32-b73eda49e8c3.png)
![2016-11-04 6 53 45](https://cloud.githubusercontent.com/assets/6477701/20002650/429fb310-a2c5-11e6-926b-e030d7eb0185.png)
![2016-11-04 6 54 11](https://cloud.githubusercontent.com/assets/6477701/20002649/429d570a-a2c5-11e6-9e7e-44090f337e32.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002647/4297fc74-a2c5-11e6-801a-b89fbcbfca44.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002697/749f5780-a2c5-11e6-835f-022e1f2f82e3.png)

**To**

![2016-11-04 7 03 48](https://cloud.githubusercontent.com/assets/6477701/20002659/4961b504-a2c5-11e6-9ee0-ef0751482f47.png)
![2016-11-04 7 04 03](https://cloud.githubusercontent.com/assets/6477701/20002660/49871d3a-a2c5-11e6-85ea-d9a5d11efeff.png)
![2016-11-04 7 04 28](https://cloud.githubusercontent.com/assets/6477701/20002662/498e0f14-a2c5-11e6-803d-c0c5aeda4153.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15765 from HyukjinKwon/minor-function-doc.
@rxin
Copy link
Contributor

rxin commented Nov 20, 2016

@HyukjinKwon why did we add backticks to surround all parameters? It looks pretty weird actually.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 20, 2016

Some of them already had backricks in the description and others did not. Matching it up with backticks or double-quotes was initially suggested by some comments such as #15677 (comment), #15513 (comment) and #15677 (comment). I could not find the concrete reason to not follow. I can remove all of them if you confirm.

@gatorsmile
Copy link
Member

Normally, most RDBMS docs are using Italic fonts for the parameter/argument names. Linux man is using underscore to highlight argument names in descriptions. I also saw another way

``expr1''

I do not know which ways are better in our function descriptions.

@rxin
Copy link
Contributor

rxin commented Nov 20, 2016

These function docs are printed onto the console without formatting.

@gatorsmile
Copy link
Member

True. That is why I do not know which is the best way for us to show the argument/parameter names.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ssion and improve documentation

## What changes were proposed in this pull request?

This PR proposes to change the documentation for functions. Please refer the discussion from apache#15513

The changes include
- Re-indent the documentation
- Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json).

For examples, the documentation was updated as below:
### Functions with single line usage

**Before**
- `pow`

  ``` sql
  Usage: pow(x1, x2) - Raise x1 to the power of x2.
  Extended Usage:
  > SELECT pow(2, 3);
   8.0
  ```
- `current_timestamp`

  ``` sql
  Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
  Extended Usage:
  No example for current_timestamp.
  ```

**After**
- `pow`

  ``` sql
  Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
  Extended Usage:
      Examples:
        > SELECT pow(2, 3);
         8.0
  ```

- `current_timestamp`

  ``` sql
  Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
  Extended Usage:
      No example/argument for current_timestamp.
  ```
### Functions with (already) multiple line usage

**Before**
- `approx_count_distinct`

  ``` sql
  Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
      approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
        with relativeSD, the maximum estimation error allowed.

  Extended Usage:
  No example for approx_count_distinct.
  ```
- `percentile_approx`

  ``` sql
  Usage:
        percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
        column `col` at the given percentage. The value of percentage must be between 0.0
        and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
        controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
        better accuracy, `1.0/accuracy` is the relative error of the approximation.

        percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
        percentile array of column `col` at the given percentage array. Each value of the
        percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
        a positive integer literal which controls approximation accuracy at the cost of memory.
        Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
        the approximation.

  Extended Usage:
  No example for percentile_approx.
  ```

**After**
- `approx_count_distinct`

  ``` sql
  Usage:
      approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
        `relativeSD` defines the maximum estimation error allowed.

  Extended Usage:
      No example/argument for approx_count_distinct.
  ```

- `percentile_approx`

  ``` sql
  Usage:
      percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
        column `col` at the given percentage. The value of percentage must be between 0.0
        and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
        controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
        better accuracy, `1.0/accuracy` is the relative error of the approximation.
        When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
        In this case, returns the approximate percentile array of column `col` at the given
        percentage array.

  Extended Usage:
      Examples:
        > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
         [10.0,10.0,10.0]
        > SELECT percentile_approx(10.0, 0.5, 100);
         10.0
  ```
## How was this patch tested?

Manually tested

**When examples are multiple**

``` sql
spark-sql> describe function extended reflect;
Function: reflect
Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Extended Usage:
    Examples:
      > SELECT reflect('java.util.UUID', 'randomUUID');
       c33fb387-8500-4bfa-81d2-6e0e3e930df2
      > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
       a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
```

**When `Usage` is in single line**

``` sql
spark-sql> describe function extended min;
Function: min
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
Usage: min(expr) - Returns the minimum value of `expr`.
Extended Usage:
    No example/argument for min.
```

**When `Usage` is already in multiple lines**

``` sql
spark-sql> describe function extended percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of percentage must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
      In this case, returns the approximate percentile array of column `col` at the given
      percentage array.

Extended Usage:
    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0
```

**When example/argument is missing**

``` sql
spark-sql> describe function extended rank;
Function: rank
Class: org.apache.spark.sql.catalyst.expressions.Rank
Usage:
    rank() - Computes the rank of a value in a group of values. The result is one plus the number
      of rows preceding or equal to the current row in the ordering of the partition. The values
      will produce gaps in the sequence.

Extended Usage:
    No example/argument for rank.
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15677 from HyukjinKwon/SPARK-17963-1.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…stently with expressions

## What changes were proposed in this pull request?

This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in apache#15677

Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty.

## How was this patch tested?

Jenkins tests and manually.

For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below:

**From**

![2016-11-04 6 53 35](https://cloud.githubusercontent.com/assets/6477701/20002648/42989922-a2c5-11e6-8a32-b73eda49e8c3.png)
![2016-11-04 6 53 45](https://cloud.githubusercontent.com/assets/6477701/20002650/429fb310-a2c5-11e6-926b-e030d7eb0185.png)
![2016-11-04 6 54 11](https://cloud.githubusercontent.com/assets/6477701/20002649/429d570a-a2c5-11e6-9e7e-44090f337e32.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002647/4297fc74-a2c5-11e6-801a-b89fbcbfca44.png)
![2016-11-04 6 53 51](https://cloud.githubusercontent.com/assets/6477701/20002697/749f5780-a2c5-11e6-835f-022e1f2f82e3.png)

**To**

![2016-11-04 7 03 48](https://cloud.githubusercontent.com/assets/6477701/20002659/4961b504-a2c5-11e6-9ee0-ef0751482f47.png)
![2016-11-04 7 04 03](https://cloud.githubusercontent.com/assets/6477701/20002660/49871d3a-a2c5-11e6-85ea-d9a5d11efeff.png)
![2016-11-04 7 04 28](https://cloud.githubusercontent.com/assets/6477701/20002662/498e0f14-a2c5-11e6-803d-c0c5aeda4153.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)
![2016-11-04 7 33 39](https://cloud.githubusercontent.com/assets/6477701/20002731/a76e30d2-a2c5-11e6-993b-0481b8342d6b.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15765 from HyukjinKwon/minor-function-doc.
@HyukjinKwon HyukjinKwon deleted the SPARK-17963-1 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants