-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17963][SQL][Documentation] Add examples (extend) in each expression and improve documentation with arguments #15513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation -> document?
xpath -> XPath
(Here and in a few other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: while we're here, should this say 'set' as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit but while you're changing this, there's no reason to capitalize skewness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with kurtosis, they're not proper nouns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say 'list'? and it doesn't seem like they're unique, necessarily, given existing text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, it was a typo. Yes, it should be list.
|
@srowen I will definitely double check the changes here before proceeding further. Thank you for your review before getting this too big. I will proceed this in 1-2 days. Please let me know if anyone feels the format looks not nice or should be fixed. |
|
I suppose I don't know the conventions here well, but, the format looks better in your change, and more params documentation seems helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "type type" mean?
|
I find this too verbose for the basic one. When looking at the basic one, I want an one-liner explanation because it can also show up in the entire list of UDFs. I'd put the detailed argument type, etc into the extended part, rather than in basic. |
|
Test build #67071 has finished for PR 15513 at commit
|
|
Thank you @rxin, I just updated the PR description. I left the usage already having single-line usage as it was but just will indent the multiple line ones. Also, I moved the arguments and examples into extended part. Will proceed soon in few days in case the format should be fixed more. |
|
Test build #67130 has finished for PR 15513 at commit
|
|
Test build #67199 has finished for PR 15513 at commit
|
|
Test build #67204 has finished for PR 15513 at commit
|
|
Oh, FWIW, this build seems being failed correctly in some tests. I will fix them just to prevent misleading. |
|
Test build #3366 has finished for PR 15513 at commit
|
| extended = """ | ||
| Arguments: | ||
| expr1 - an expression of any type. | ||
| expr2 - an expression of any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it can support any type?
For logical operations (AND, OR or others), I think the only acceptable types are boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should change. It was my mistake. Thanks!
|
Could you take another pass at changes? especially the argument types. I think this PR still has many related issues. |
|
I mostly ran it by myself when I was in doubt so I guess it'd be mostly okay. At least, one major issue was identified above, so I will definitely look into this closely again and be back soon. BTW, there were comments about argument description (not about a typo but semantic change), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment), #15513 (comment) and #15513 (comment) (if I haven't missed a couple of ones). The valid ones are both #15513 (comment) and #15513 (comment) which I guess only one is the major one (inappropriate type) and the other one is minor (to take out the decimal type from numeric type). I guess this might not imply that it has many related issues about this. |
|
I will take another look closely as suggested and then will let you all know. |
|
Test build #67612 has finished for PR 15513 at commit
|
|
Test build #3376 has finished for PR 15513 at commit
|
|
I took another look and It seems generally fine. Could you take a look all? |
|
Test build #67646 has finished for PR 15513 at commit
|
|
retest this please |
|
Test build #67674 has finished for PR 15513 at commit
|
|
Will review it tomorrow. Thanks! |
|
Thanks @gatorsmile. Just FYI, I would like to note the rule I used for argument types (just to avoid extra efforts when you review). As we all know, I did not mention implicit casting as suggested. So, I kind of did the best efforts to describe this by using the abstract terms for types such as For example, If a function takes Another reason why I used this rule is, for the potential documentation update to mention implicit casting in the future. For example, would be easily updated as below: Another example would be.. This would be easily updated as below: |
| @ExpressionDescription( | ||
| usage = "_FUNC_(date, fmt) - Returns returns date with the time portion of the day truncated to the unit specified by the format model fmt.", | ||
| extended = "> SELECT _FUNC_('2009-02-12', 'MM')\n '2009-02-01'\n> SELECT _FUNC_('2015-10-27', 'YEAR');\n '2015-01-01'") | ||
| usage = "_FUNC_(date, fmt) - Returns returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns returns -> Returns
|
I still found a general issue in the type description.
|
|
Yes, I have. Could you point out an instance? I will fix them and double check the same instances. If wrong. |
| usage = "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`.", | ||
| extended = """ | ||
| Arguments: | ||
| expr - an expression of any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT cast(array(1) as string), cast(struct(1) as string), cast(map(1,1) as string);
[1] [1] keys: [1], values: [1]| """, | ||
| extended = """ | ||
| Arguments: | ||
| expr - an expression of any type that represents data to count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT count(array(1)), count(struct(1)), count(map(1,1));
1 1 1| """, | ||
| extended = """ | ||
| Arguments: | ||
| expr - an expression of any type that represents data to collect the first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT first(array(1)), first(struct(1)), first(map(1,1));
[1] {"col1":1} {1:1}| """, | ||
| extended = """ | ||
| Arguments: | ||
| expr - an expression of any type that represents data to count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT approx_count_distinct(array(1)), approx_count_distinct(struct(1)), approx_count_distinct(map(1,1));
1 1 1| usage = "_FUNC_(expr) - Collects and returns a list of non-unique elements.", | ||
| extended = """ | ||
| Arguments: | ||
| expr - an expression of any type that represents data to collect as a list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT collect_list(array(1)), collect_list(struct(1)), collect_list(map(1, 1));
[[1]] [{"col1":1}] [{1:1}]| extended = """ | ||
| Arguments: | ||
| expr1 - an expression of any type. | ||
| expr2 - an expression of any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT array(1) = array(1), struct(1) = struct(1), map(1, 1) = map(1, 1);
true true false| extended = """ | ||
| Arguments: | ||
| expr1 - an expression of any type. | ||
| expr2 - an expression of any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT array(1) <=> array(1), struct(1) <=> struct(1), map(1, 1) <=> map(1, 1);
true true false| extended = """ | ||
| Arguments: | ||
| strfmt - a string expression. | ||
| obj - an expression of any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> SELECT format_string("Hello World %d %s", 100, array(1), struct(1), map(1, 1));
Hello World 100 [1]| input - an expression of any type. | ||
| offset - a numeric expression. Default is 1. | ||
| default - an expression of any type. Default is null. | ||
| """) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
.selectExpr("array(value) as value", "key")
df.select(
lead("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
}
{
val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
.selectExpr("struct(value) as value", "key")
df.select(
lead("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
}| Arguments: | ||
| input - an expression of any type. | ||
| offset - a numeric expression. Default is 1. | ||
| default - an expression of any type. Default is null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
.selectExpr("array(value) as value", "key")
df.select(
lag("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
}
{
val df = Seq((1, "1"), (2, "2"), (1, "1"), (2, "2")).toDF("key", "value")
.selectExpr("struct(value) as value", "key")
df.select(
lag("value", 1).over(Window.partitionBy($"key").orderBy($"value"))).show()
}|
Let me close and reopen another. It seems really messy. |
|
Test build #67732 has finished for PR 15513 at commit
|
What changes were proposed in this pull request?
This PR proposes to change the documentation for functions.
The changes include
extendedwhere the arguments are multiple or specific format (e.g. xml/ json).For examples, the documentation was updated as below:
Functions with single line usage
Before
powcurrent_timestampAfter
powcurrent_timestampFunctions with (already) multiple line usage
Before
approx_count_distinctpercentile_approxUsage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx.After
approx_count_distinctUsage: approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. relativeSD defines the maximum estimation error allowed. Extended Usage: Arguments: expr - an expression of any type that represents data to count. relativeSD - a numeric literal that defines the maximum estimation error allowed.percentile_approxUsage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of `percentage` must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. Extended Usage: Arguments: col - a numeric expression. percentage - a numeric literal or an array literal of numeric type that defines the percentile. For example, 0.5 means 50-percentile. accuracy - a numeric literal. Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0How was this patch tested?
Manually tested
When examples are multiple
When
Usageis in single lineWhen
Usageis already in multiple linesWhen example/argument is missing