[SPARK-5364] [SQL] HiveQL transform doesn't support the non output clause #4158

chenghao-intel · 2015-01-22T08:22:01Z

This is a quick fix for query (in HiveContext) like:

SELECT transform(key + 1, value) USING '/bin/cat' FROM src;
SELECT a, b from (SELECT transform(key + 1, value) using '/bin/cat' as (a float, b string) from src) t where a = 239.0;

Ideally, we need to refactor the ScriptTransformation, which should support the custom SerDe for reader & writer. Will do that in the follow up.

SparkQA · 2015-01-22T08:22:42Z

Test build #25959 has started for PR 4158 at commit c8fe7fc.

This patch merges cleanly.

viirya · 2015-01-22T08:55:57Z

Hi @chenghao-intel, I already did this and support for custom field delimiter and SerDe in PR #4014.

SparkQA · 2015-01-22T09:27:18Z

Test build #25959 has finished for PR 4158 at commit c8fe7fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-22T09:27:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25959/
Test PASSed.

chenghao-intel · 2015-01-23T00:36:17Z

Thank you @viirya . This is just a quick fix in my use case. Hope it merge soon. And I will give some comment in your PR.

SparkQA · 2015-01-23T05:32:42Z

Test build #25996 has started for PR 4158 at commit a7b6989.

This patch merges cleanly.

chenghao-intel · 2015-01-23T05:36:21Z

@viirya I've updated the code, which is a block issue for our partner, it's would be great if you can review this for me. And definitely the TODOs I leave there can be done in your PR #4014

SparkQA · 2015-01-23T06:36:18Z

Test build #25996 has finished for PR 4158 at commit a7b6989.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T06:36:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25996/
Test PASSed.

viirya · 2015-01-23T11:31:46Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

According to Hive manual, there should be only two outputs key and value when no output schema is defined. So I am not sure if it is a bug because it is explictly described in the manual. I suppose that it is a well-known and expected behavior?

Thanks for notice that. I think this's probably a bug in Hive.
I did the queries in Hive CLI:

set hive.cli.print.header=true; select transform(key + 1, key - 1, key) using '/bin/cat' from src limit 4;

create table test2 as select transform(key + 1, key - 1, key) using '/bin/cat' from src limit 4;

And print the result of the table test2:

You will see, it's not the expected result, of key and value, that's why I added the default field name for more than 2 columns.

I think it is expected results as the Hive manual describes about 'Schema-less Map-reduce Scripts
' in transform:

If there is no AS clause after USING my_script, Hive assumes that the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab.

So in your results, value column gets all query outputs after the first tab. The results of table test2 is just the alignment problem caused by tabs. It should follow the same rule too.

OK, I see. thanks for the explanation. I will update that.

viirya · 2015-01-23T11:34:38Z

@chenghao-intel overall it looks good for me except for small comments.

SparkQA · 2015-01-26T01:12:48Z

Test build #26069 has started for PR 4158 at commit 5618fa7.

This patch merges cleanly.

SparkQA · 2015-01-26T02:20:23Z

Test build #26069 has finished for PR 4158 at commit 5618fa7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T02:20:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26069/
Test PASSed.

chenghao-intel · 2015-02-03T00:25:33Z

Closing this since #4014 has been merged.

fix bug of transform in HiveQL

c8fe7fc

Solve buf of the data type casting exception for transform

a7b6989

viirya reviewed Jan 23, 2015
View reviewed changes

using partial function instead of the IF_ELSE

5618fa7

chenghao-intel closed this Feb 3, 2015

[SPARK-5364] [SQL] HiveQL transform doesn't support the non output clause #4158

[SPARK-5364] [SQL] HiveQL transform doesn't support the non output clause #4158

Uh oh!

Conversation

chenghao-intel commented Jan 22, 2015

Uh oh!

SparkQA commented Jan 22, 2015

Uh oh!

viirya commented Jan 22, 2015

Uh oh!

SparkQA commented Jan 22, 2015

Uh oh!

AmplabJenkins commented Jan 22, 2015

Uh oh!

chenghao-intel commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

chenghao-intel commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

AmplabJenkins commented Jan 23, 2015

Uh oh!

viirya Jan 23, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

AmplabJenkins commented Jan 26, 2015

Uh oh!

chenghao-intel commented Feb 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants