Skip to content

Conversation

@petermaxlee
Copy link
Contributor

@petermaxlee petermaxlee commented Jun 30, 2016

What changes were proposed in this pull request?

This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.

How was this patch tested?

Added unit tests and end-to-end tests.

@petermaxlee
Copy link
Contributor Author

Can you guys take a look at this? @cloud-fan @dongjoon-hyun @gatorsmile

@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61530 has finished for PR 13991 at commit 66fae41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the boolean and string one doesn't share same implementation?(xpathUtil.evalBoolean and xpathUtil.evalString)

@gatorsmile
Copy link
Member

Could you add some test cases for the boundary values? That means, trying the maximum and minimum values for each different data type. Also, the values that are out of the boundary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which Exception will we throw? What is the message we issue? When users see the message, can they understand the cause?

@gatorsmile
Copy link
Member

Could you add more test cases for corrupted xml and illegal inputs? So far, the test suite XPathExpressionSuite only covers the basic cases.

@gatorsmile
Copy link
Member

Normally, for each function or data type we implement in an enterprise product, we need to document the limit/restriction. That is why I am asking for the boundary values, negative cases and error handling.

@cloud-fan
Copy link
Contributor

+1 for @gatorsmile 's advices

@petermaxlee
Copy link
Contributor Author

Thanks for the feedback. I will update this later.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

For this one I think we should consider supporting only foldable literals for the path component. It's probably satisfy 99.999% of the use cases, and simplify the code.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

Also - rather than having concrete implementations for all of these, why don't we use RuntimeReplaceable?

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61947 has finished for PR 13991 at commit 2d48ae5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

petermaxlee commented Jul 8, 2016

I pushed a new change to this. We now have better error messages and test coverage for those. These expressions also now require foldable paths.

I also changed the test values to make sure we are not restricting the range of values an expression can return (e.g. for XPathInt the test case makes sure it can return a value larger than short). However, I don't think it'd make sense to test overflow behavior because those are mostly undefined throughout Spark SQL.

Re: @rxin's suggestion on using RuntimeReplaceable -- I did try that, but I'm afraid the current implementation of RuntimeReplaceable isn't meant for stacking multiple expressions together, and type checking actually breaks if XPathInt replaces itself with Cast(XPathDouble). I looked into fixing that, but it seemed more complicated and it doesn't actually save much space (it actually takes more code to use RuntimeReplaceable).

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61953 has finished for PR 13991 at commit d30d891.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61954 has finished for PR 13991 at commit 02df488.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

I just added the general xpath function that returns an array of string too.

@petermaxlee petermaxlee changed the title [SPARK-16318][SQL] Implement various xpath functions [SPARK-16318][SQL] Implement all remaining xpath functions Jul 8, 2016
@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61957 has finished for PR 13991 at commit d7d5f8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class XPathList(xml: Expression, path: Expression) extends XPathExtract

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61958 has finished for PR 13991 at commit 48311bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61967 has finished for PR 13991 at commit a86bc04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61975 has finished for PR 13991 at commit 24deac9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

@hvanhovell can you take a look at this?

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61992 has finished for PR 13991 at commit fecc293.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #62000 has finished for PR 13991 at commit 0c60d87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParseUrl(children: Seq[Expression])

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #3175 has finished for PR 13991 at commit 0c60d87.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParseUrl(children: Seq[Expression])

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@petermaxlee
Copy link
Contributor Author

@cloud-fan Jenkins already ran twice successfully before.

@cloud-fan
Copy link
Contributor

The latest result is This patch does not merge cleanly., I just wanna double check it.

@petermaxlee
Copy link
Contributor Author

I guess whatever generates that message is buggy?

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62070 has finished for PR 13991 at commit 0c60d87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParseUrl(children: Seq[Expression])

@yhuai
Copy link
Contributor

yhuai commented Jul 11, 2016

As a follow-up task. Can you take a look at the following query files and add useful tests in your test? Thanks.

.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/describe_xpath.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/input_testxpath.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/input_testxpath2.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/input_testxpath3.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/input_testxpath4.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_boolean.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_double.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_float.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_int.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_long.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_short.q
.//sql/hive/src/test/resources/ql/src/test/queries/clientpositive/udf_xpath_string.q

@petermaxlee
Copy link
Contributor Author

Actually I created the unit tests based on those.

@petermaxlee
Copy link
Contributor Author

BTW I have to say Hive's test coverage in this area is very spotty, so I don't actually think it's great to follow, but I used those.

@yhuai
Copy link
Contributor

yhuai commented Jul 11, 2016

OK. Thanks. Then, it will be good to add more tests for cases that are not covered by those hive tests.

@asfgit asfgit closed this in 82f0874 Jul 11, 2016
@cloud-fan
Copy link
Contributor

Thanks, merging to master! This doesn't merge clearly to 2.0, @petermaxlee can you submit a new PR against 2.0? thanks!

@petermaxlee
Copy link
Contributor Author

@cloud-fan thanks for merging!

@yhuai I think the degree to which we want to add more tests also depend on how much we trust the library we are using. XPath (with Query) is almost as complicated as SQL itself.

petermaxlee added a commit to petermaxlee/spark that referenced this pull request Jul 11, 2016
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.

Added unit tests and end-to-end tests.

Author: petermaxlee <petermaxlee@gmail.com>

Closes apache#13991 from petermaxlee/SPARK-16318.
@petermaxlee
Copy link
Contributor Author

Here it is for branch-2.0.
#14131

asfgit pushed a commit that referenced this pull request Jul 11, 2016
## What changes were proposed in this pull request?
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.

This is based on #13991 but for branch-2.0.

## How was this patch tested?
Added unit tests and end-to-end tests.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14131 from petermaxlee/xpath-branch-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants