Skip to content

Conversation

@jashgala
Copy link
Contributor

@jashgala jashgala commented Feb 8, 2019

What changes were proposed in this pull request?

This PR addresses SPARK-23619: https://issues.apache.org/jira/browse/SPARK-23619

It adds additional comments indicating the default column names for the explode and posexplode
functions in Spark-SQL.

Functions for which comments have been updated so far:

  • stack
  • inline
  • explode
  • posexplode
  • explode_outer
  • posexplode_outer

How was this patch tested?

This is just a change in the comments. The package builds and tests successfullly after the change.

@maropu
Copy link
Member

maropu commented Feb 9, 2019

You need to update ExpressionDescription if you want to update the docs, e.g.,

usage = "_FUNC_(expr) - Separates the elements of array `expr` into multiple rows, or the elements of map `expr` into multiple rows and columns.",

https://spark.apache.org/docs/2.4.0/api/sql/index.html#explode

@maropu
Copy link
Member

maropu commented Feb 9, 2019

Also, could you check the other functions for the same fix. e.g., explode_outer, ...

@jashgala
Copy link
Contributor Author

jashgala commented Feb 9, 2019

@maropu
Thanks for the guidance.

I've fixed the comments, added comments for explode_outer and posexplode_outer and also added the ExpressionDescriptions.

Please let me know if there is anything else I might've missed.

@maropu
Copy link
Member

maropu commented Feb 11, 2019

Thanks. Could you cover all the related document fixes in this pr? How about Inline and Stack? They have default column names, right?

/**
* Creates a new row for each element in the given array or map column.
* Uses the default column name `col` for elements in the array and
* `key` and `value` for elements in the map unless specified otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update Python doc and R doc while we're here. Take a look for functions.py and functions.R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... I forgot the viewpoint...thanks.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102194 has finished for PR 23748 at commit 4ed035b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102195 has finished for PR 23748 at commit 4ed035b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jashgala
Copy link
Contributor Author

@maropu @HyukjinKwon

While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns).

scala> spark.sql("SELECT stack(2, 1, 2, 3)").show
+----+----+
|col0|col1|
+----+----+
|   1|   2|
|   3|null|
+----+----+


scala>  spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show
+----+----+
|col1|col2|
+----+----+
|   1|   a|
|   2|   b|
+----+----+

This feels like an issue with consistency. Is there a reason why this column naming convention is kept different?

@maropu
Copy link
Member

maropu commented Feb 13, 2019

ah, yes. I think it'd better to fix this, too, in a separate pr.
Probably, I think zero-based indexing is more natural. For example, auto-generated column names in CSV read is baed on zero-based;

scala> spark.read.csv("test.csv").show
+---+---+---+                                                                   
|_c0|_c1|_c2|
+---+---+---+
|  1|  2|  3|
+---+---+---+

Could you check if we have other expressions having one-based indexing?

@jashgala
Copy link
Contributor Author

jashgala commented Feb 14, 2019

Have created a JIRA for tracking the consistency fix: https://issues.apache.org/jira/browse/SPARK-26879

Will update the other expressions having one-based indexing on the JIRA itself, so that it can continue as a separate investigation.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102622 has finished for PR 23748 at commit e5751ed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102623 has finished for PR 23748 at commit 80c0649.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jashgala
Copy link
Contributor Author

@maropu I've added comments for Inline and Stack in the generators.scala file. I didn't see these in the functions.* files though.
@HyukjinKwon Have added comments in Python and R functions files as well as per the review comments above.

I see that there are a lot of functions (e.g. mathematical functions, etc.) in the functions.* files. Is there value in adding the column names in the comments for all of them or do should we make this change specifically for the functions that are specifically used to generate columns/rows (e.g. explode, etc.)?

@HyukjinKwon
Copy link
Member

I haven;t checked super closely but looks fine.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102625 has finished for PR 23748 at commit d0dea9c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102624 has finished for PR 23748 at commit d029d3c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@maropu
Copy link
Member

maropu commented Feb 22, 2019

@jashgala Yea, I think we would be nice to fix all the same issues in this pr where possible.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102631 has finished for PR 23748 at commit d0dea9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jashgala
Copy link
Contributor Author

Cool... Will make the change and add more commits!

@jashgala jashgala changed the title [SPARK-23619][SQL] Added default column names for explode & posexplode in comments [WIP][SPARK-23619][SQL] Added default column names for explode & posexplode in comments Feb 25, 2019
@jashgala
Copy link
Contributor Author

jashgala commented Mar 4, 2019

@maropu
Just wanted to update... I'm going through each function, understanding it and then adding the comments... It's taking a bit of a while especially due to real life responsibilities, but I want to reassure I am working on it!

@HyukjinKwon
Copy link
Member

I took a quick look and looks fine. Let me leave it to @maropu since he's been actively reviewing this.

@maropu maropu changed the title [WIP][SPARK-23619][SQL] Added default column names for explode & posexplode in comments [SPARK-23619][SQL] Added default column names for explode & posexplode in comments Mar 6, 2019
@maropu
Copy link
Member

maropu commented Mar 6, 2019

Could you update the PR description; plz list up all the function you addressed in this PR?

@jashgala
Copy link
Contributor Author

jashgala commented Mar 7, 2019

So far I've addressed the following (also updated in PR description):

  • stack
  • inline
  • explode
  • posexplode
  • explode_outer
  • posexplode_outer

The others are still WIP

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 26, 2019

Test build #104934 has finished for PR 23748 at commit d0dea9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-23619][SQL] Added default column names for explode & posexplode in comments [SPARK-23619][DOC] Add output description for some generator expressions / functions Apr 27, 2019
@HyukjinKwon HyukjinKwon changed the title [SPARK-23619][DOC] Add output description for some generator expressions / functions [SPARK-23619][DOCS] Add output description for some generator expressions / functions Apr 27, 2019
@HyukjinKwon
Copy link
Member

Merged to master.

@jashgala
Copy link
Contributor Author

Thanks.... I will fix the other documentation when I have some free time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants