Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 17, 2016

What changes were proposed in this pull request?

This PR proposes to change the documentation for functions.

The changes include

  • Re-indent the documentation
  • Add examples/arguments in extended where the arguments are multiple or specific format (e.g. xml/ json).

For examples, the documentation was updated as below:

Functions with single line usage

Before

  • pow

    Usage: pow(x1, x2) - Raise x1 to the power of x2.
    Extended Usage:
    > SELECT pow(2, 3);
     8.0
  • current_timestamp

    Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
    Extended Usage:
    No example for current_timestamp.

After

  • pow

    Usage: pow(expr1, expr2) - Raise expr1 to the power of expr2.
    Extended Usage:
        Arguments:
          expr1 - a numeric expression.
          expr2 - a numeric expression.
    
        Examples:
          > SELECT pow(2, 3);
           8.0
  • current_timestamp

    Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
    Extended Usage:
        No example/arguemnt for current_timestamp.

Functions with (already) multiple line usage

Before

  • approx_count_distinct

    Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
        approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
          with relativeSD, the maximum estimation error allowed.
    
    Extended Usage:
    No example for approx_count_distinct.
  • percentile_approx

    Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
          column `col` at the given percentage. The value of percentage must be between 0.0
          and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
          controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
          better accuracy, `1.0/accuracy` is the relative error of the approximation.
    
          percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
          percentile array of column `col` at the given percentage array. Each value of the
          percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
          a positive integer literal which controls approximation accuracy at the cost of memory.
          Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
          the approximation.
    
    Extended Usage:
    No example for percentile_approx.

After

  • approx_count_distinct

    Usage:
        approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
          relativeSD defines the maximum estimation error allowed.
    
    Extended Usage:
        Arguments:
          expr - an expression of any type that represents data to count.
          relativeSD - a numeric literal that defines the maximum estimation error allowed.
  • percentile_approx

    Usage:
        percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
          column `col` at the given percentage. The value of `percentage` must be between 0.0
          and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
          controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
          better accuracy, `1.0/accuracy` is the relative error of the approximation.
          When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
    
    Extended Usage:
        Arguments:
          col - a numeric expression.
          percentage - a numeric literal or an array literal of numeric type that defines the
            percentile. For example, 0.5 means 50-percentile.
          accuracy - a numeric literal.
    
        Examples:
          > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
           [10.0,10.0,10.0]
          > SELECT percentile_approx(10.0, 0.5, 100);
           10.0

How was this patch tested?

Manually tested

When examples are multiple

spark-sql> describe function extended reflect;
Function: reflect
Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Extended Usage:
    Arguments:
      class - a string literal that represents a fully-qualified class name.
      method - a string literal that represents a method name.
      arg - a boolean, string or numeric expression except decimal that represents an argument for
        the method.

    Examples:
      > SELECT reflect('java.util.UUID', 'randomUUID');
       c33fb387-8500-4bfa-81d2-6e0e3e930df2
      > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
       a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

When Usage is in single line

spark-sql> describe function extended min;
Function: min
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
Usage: min(expr) - Returns the minimum value of `expr`.
Extended Usage:
    Arguments:
      expr - an expression of any type.

When Usage is already in multiple lines

spark-sql> describe function extended percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
    percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
      column `col` at the given percentage. The value of `percentage` must be between 0.0
      and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
      controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
      better accuracy, `1.0/accuracy` is the relative error of the approximation.
      When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.

Extended Usage:
    Arguments:
      col - a numeric expression.
      percentage - a numeric literal or an array literal of numeric type that defines the
        percentile. For example, 0.5 means 50-percentile.
      accuracy - a numeric literal.

    Examples:
      > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
       [10.0,10.0,10.0]
      > SELECT percentile_approx(10.0, 0.5, 100);
       10.0

When example/argument is missing

spark-sql> describe function extended rank;
Function: rank
Class: org.apache.spark.sql.catalyst.expressions.Rank
Usage:
    rank() - Computes the rank of a value in a group of values. The result is one plus the number
      of rows preceding or equal to the current row in the ordering of the partition. The values
      will produce gaps in the sequence.

Extended Usage:
    No example/argument for rank.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 17, 2016

(This PR is not complete yet). I am sorry for asking repeatedly but this one is slightly different with the JIRA one. cc @rxin @srowen Could you please check if this one is preferable? I am worried to go further without confirmation because it seems the change would be a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation -> document?
xpath -> XPath
(Here and in a few other places)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: while we're here, should this say 'set' as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit but while you're changing this, there's no reason to capitalize skewness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with kurtosis, they're not proper nouns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say 'list'? and it doesn't seem like they're unique, necessarily, given existing text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, it was a typo. Yes, it should be list.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 17, 2016

@srowen I will definitely double check the changes here before proceeding further. Thank you for your review before getting this too big. I will proceed this in 1-2 days.

Please let me know if anyone feels the format looks not nice or should be fixed.

@srowen
Copy link
Member

srowen commented Oct 17, 2016

I suppose I don't know the conventions here well, but, the format looks better in your change, and more params documentation seems helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "type type" mean?

@rxin
Copy link
Contributor

rxin commented Oct 17, 2016

I find this too verbose for the basic one. When looking at the basic one, I want an one-liner explanation because it can also show up in the entire list of UDFs.

I'd put the detailed argument type, etc into the extended part, rather than in basic.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67071 has finished for PR 15513 at commit 2059374.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Skewness(expr: Expression) extends CentralMomentAgg(expr)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 18, 2016

Thank you @rxin, I just updated the PR description. I left the usage already having single-line usage as it was but just will indent the multiple line ones. Also, I moved the arguments and examples into extended part. Will proceed soon in few days in case the format should be fixed more.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-17963][SQL][Documentation] Add examples (extend) in each function and improve documentation with arguments [WIP][SPARK-17963][SQL][Documentation] Add examples (extend) in each function and improve documentation with arguments (not ready for code review but just format) Oct 18, 2016
@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67130 has finished for PR 15513 at commit 22c7bcf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67199 has finished for PR 15513 at commit e314d09.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67204 has finished for PR 15513 at commit e9f98cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Oh, FWIW, this build seems being failed correctly in some tests. I will fix them just to prevent misleading.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #3366 has finished for PR 15513 at commit e9f98cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

extended = """
Arguments:
expr1 - an expression of any type.
expr2 - an expression of any type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it can support any type?

For logical operations (AND, OR or others), I think the only acceptable types are boolean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should change. It was my mistake. Thanks!

@gatorsmile
Copy link
Member

Could you take another pass at changes? especially the argument types. I think this PR still has many related issues.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 27, 2016

I mostly ran it by myself when I was in doubt so I guess it'd be mostly okay. At least, one major issue was identified above, so I will definitely look into this closely again and be back soon.

BTW, there were comments about argument description (not about a typo but semantic change), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment) and #15513 (comment) (if I haven't missed a couple of ones).

The valid ones are both #15513 (comment) and #15513 (comment) which I guess only one is the major one (inappropriate type) and the other one is minor (to take out the decimal type from numeric type). I guess this might not imply that it has many related issues about this.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation with arguments [WIP][SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation with arguments Oct 27, 2016
@HyukjinKwon
Copy link
Member Author

I will take another look closely as suggested and then will let you all know.

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67612 has finished for PR 15513 at commit 498d69c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #3376 has finished for PR 15513 at commit 498d69c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation with arguments [SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation with arguments Oct 27, 2016
@HyukjinKwon
Copy link
Member Author

I took another look and It seems generally fine. Could you take a look all?

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67646 has finished for PR 15513 at commit 400cee5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67674 has finished for PR 15513 at commit 400cee5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Will review it tomorrow. Thanks!

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 28, 2016

Thanks @gatorsmile. Just FYI, I would like to note the rule I used for argument types (just to avoid extra efforts when you review).

As we all know, I did not mention implicit casting as suggested. So, I kind of did the best efforts to describe this by using the abstract terms for types such as numeric or integral which are borrowed from the actual class names such as NumericType or IntegralType (scala.math.Integral and scala.math.Numeric) where possible. When it is not possible, I noted the original type as each function requires.

For example, If a function takes integer literal as the argument but allows implicit casting,
I noted it as numeric literal because other numeric literals such as 1BD, 1.0D, 1 and 1L are allowed (but note that string literal "1" is also allowed).

Another reason why I used this rule is, for the potential documentation update to mention implicit casting in the future. For example,

a numeric expression.

would be easily updated as below:

a numeric expression or any non-numeric types that can be implicitly casted to a numeric expression.

Another example would be..

a timestamp expression.

This would be easily updated as below:

a timestamp expression or any types that can be implicitly casted to a timestamp expression.

@ExpressionDescription(
usage = "_FUNC_(date, fmt) - Returns returns date with the time portion of the day truncated to the unit specified by the format model fmt.",
extended = "> SELECT _FUNC_('2009-02-12', 'MM')\n '2009-02-01'\n> SELECT _FUNC_('2015-10-27', 'YEAR');\n '2015-01-01'")
usage = "_FUNC_(date, fmt) - Returns returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns returns -> Returns

@gatorsmile
Copy link
Member

I still found a general issue in the type description.

an expression of any type appears 54 times in this PR. However, have you checked whether they can work well for the complex types? For example, struct, array, and map?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 28, 2016

Yes, I have. Could you point out an instance? I will fix them and double check the same instances. If wrong.

usage = "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`.",
extended = """
Arguments:
expr - an expression of any type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT cast(array(1) as string), cast(struct(1) as string), cast(map(1,1) as string);
[1] [1] keys: [1], values: [1]

""",
extended = """
Arguments:
expr - an expression of any type that represents data to count.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT count(array(1)), count(struct(1)), count(map(1,1));
1   1   1

""",
extended = """
Arguments:
expr - an expression of any type that represents data to collect the first.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT first(array(1)), first(struct(1)), first(map(1,1));
[1] {"col1":1}  {1:1}

""",
extended = """
Arguments:
expr - an expression of any type that represents data to count.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT approx_count_distinct(array(1)), approx_count_distinct(struct(1)), approx_count_distinct(map(1,1));
1   1   1

usage = "_FUNC_(expr) - Collects and returns a list of non-unique elements.",
extended = """
Arguments:
expr - an expression of any type that represents data to collect as a list.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT collect_list(array(1)), collect_list(struct(1)), collect_list(map(1, 1));
[[1]]   [{"col1":1}]    [{1:1}]

extended = """
Arguments:
expr1 - an expression of any type.
expr2 - an expression of any type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT array(1) = array(1),  struct(1) = struct(1),  map(1, 1) = map(1, 1);
true    true    false

extended = """
Arguments:
expr1 - an expression of any type.
expr2 - an expression of any type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT array(1) <=> array(1),  struct(1) <=> struct(1),  map(1, 1) <=> map(1, 1);
true    true    false

extended = """
Arguments:
strfmt - a string expression.
obj - an expression of any type.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> SELECT format_string("Hello World %d %s", 100, array(1), struct(1), map(1, 1));
Hello World 100 [1]

input - an expression of any type.
offset - a numeric expression. Default is 1.
default - an expression of any type. Default is null.
""")
Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
      val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
        .selectExpr("array(value) as value", "key")
      df.select(
        lead("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
    }
    {
      val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
        .selectExpr("struct(value) as value", "key")
      df.select(
        lead("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
    }

Arguments:
input - an expression of any type.
offset - a numeric expression. Default is 1.
default - an expression of any type. Default is null.
Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
      val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
        .selectExpr("array(value) as value", "key")
      df.select(
        lag("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
    }
    {
      val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
        .selectExpr("struct(value) as value", "key")
      df.select(
        lag("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
    }

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 29, 2016

Let me close and reopen another. It seems really messy.

@SparkQA
Copy link

SparkQA commented Oct 29, 2016

Test build #67732 has finished for PR 15513 at commit 2b437fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-17963 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants