[SPARK-15694] Implement ScriptTransformation in sql/core (part 1) #14702

tejasapatil · 2016-08-18T17:03:33Z

What changes were proposed in this pull request?

Added ScriptTransformationExec which would run script operator in SQL mode (w/o Hive). Since this has to run w/o Hive, it does not support Hive serdes. ScriptTransformBase has common code across ScriptTransformationExec and HiveScriptTransformationExec.

Changes done:

Renamed ScriptTransformation to HiveScriptTransformationExec
Added ScriptTransformationExec which would run script operator in SQL mode (w/o Hive).
- The output of script would be read as a string and column values are extracted by using a delimiter (default : tab character)
ScriptTransformBase has common code used across ScriptTransformationExec and HiveScriptTransformationExec
For thread writing data to script, ScriptTransformationWriterThread has the core logic. HiveScriptTransformationWriterThread extends that for Hive specific stuff.
- ScriptTransformationWriterThread will be used for Spark SQL. It only supports writing data to script process by serializing column values as tab delimited string
- HiveScriptTransformationWriterThread additionally supports Hive serde
Added a Strategy named Scripts which would emit ScriptTransformationExec in physical plans. This would be used in non-Hive mode.

Future TODOs:

Support some notion of Serde in ScriptTransformationExec
For Hive, by default only serde's must be used
Cleanup past hacks that are observed (and people suggest / report)
- Move LogicalPlanToSQLSuite out of hive module and put inside sql
Use code-gen projection to serialize rows to output stream (suggestion by @hvanhovell )

How was this patch tested?

Added ScriptTransformationExecSuite
HiveScriptTransformationexecSuite to use HiveScriptTransformationExec

SparkQA · 2016-08-18T18:53:05Z

Test build #63995 has finished for PR 14702 at commit 9dde09d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-08-18T20:13:32Z

cc @rxin : who would be the best person to review this PR ?

rxin · 2016-08-19T05:58:56Z

Can you update the description to say more about what this pr includes, and what future todos are?

rxin · 2016-08-19T06:07:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ScriptTransformationExec.scala

can we create an execution.script package?

I suspect you will need to copy a bunch of things from Hive in follow-up prs. Might make sense to have a namespace for script related functionalities.

tejasapatil · 2016-08-23T22:55:18Z

@rxin : I have updated the description to include more info on changes done and future todos

SparkQA · 2016-08-23T23:02:11Z

Test build #64313 has finished for PR 14702 at commit 9863c7d.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ScriptTransformationExec(
- case class HiveScriptTransformationExec(
- class HiveScriptTransformationWriterThread(
- case class HiveScriptIOSchema (

SparkQA · 2016-08-24T00:18:20Z

Test build #64318 has finished for PR 14702 at commit 3e97ec3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-24T02:34:43Z

Test build #64321 has finished for PR 14702 at commit 9afbd5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-08-25T05:41:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/script/ScriptTransformationExec.scala

document what these parameters mean?

e.g. what schemaLess mean, and what the seq of string tuples mean for inputRowFormat and outputRowFormat?

documented the params

rxin · 2016-08-25T05:42:54Z

This looks reasonable.

cc @hvanhovell to take a look.

hvanhovell · 2016-08-25T12:01:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/script/ScriptTransformationExec.scala

This would call toString() on our internal representation of a value. This will lead to unexpected results as soon as you would use a Date or a Timestamp. See the discussion (and potential solution) in PR #14279.

We could also do this using a (code generated) projection. The beauty there would be that we could make it return a single UTF8String which can be dumped straight into the outputstream.

@hvanhovell : for now I am not going the code-gen route as it will make the diff bigger and delay things. I am doing what PR #14279 did and adding code-gen as Future TODOs in PR description.

SparkQA · 2016-09-07T01:07:51Z

Test build #65011 has finished for PR 14702 at commit 9afbd5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-09T02:52:06Z

Test build #65127 has finished for PR 14702 at commit f5256dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-09-14T17:06:03Z

@hvanhovell ping !!

hvanhovell · 2016-09-14T18:51:11Z

@tejasapatil I'll take another pass tomorrow (CET).

rxin · 2016-10-10T07:18:43Z

@tejasapatil I was checking with @hvanhovell. We should merge this one soon. Mind bringing it up to date?

tejasapatil · 2016-10-10T22:21:38Z

Jenkins test this please

SparkQA · 2016-10-11T00:06:19Z

Test build #66689 has finished for PR 14702 at commit c7741f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-10-11T02:45:02Z

jenkins test this please.

Failed test from earlier run was in KafkaSourceStressSuite which I don't see being related to this PR.

gatorsmile · 2016-10-11T05:58:51Z

retest this please

SparkQA · 2016-10-11T07:39:55Z

Test build #66723 has finished for PR 14702 at commit c7741f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-13T04:55:40Z

@tejasapatil looks like there is a legitimate failing test.

SparkQA · 2017-01-12T01:37:22Z

Test build #71231 has finished for PR 14702 at commit 704e4a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-01-12T16:48:08Z

can anyone please review this PR ?

SparkQA · 2017-01-22T23:15:14Z

Test build #71814 has finished for PR 14702 at commit a6e9e39.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T18:43:14Z

Test build #71941 has finished for PR 14702 at commit d9047f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-01-24T20:02:15Z

jenkins retest please

test failure from build HiveSparkSubmitSuite set hive.metastore.warehouse.dir is unrelated to the change

tejasapatil · 2017-01-28T23:09:46Z

Jenkins retest this please

SparkQA · 2017-01-29T01:36:08Z

Test build #72115 has finished for PR 14702 at commit d9047f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-02-03T01:44:34Z

can anyone please review this PR ?

gatorsmile · 2017-02-03T05:47:54Z

I will try to review it in the next few days. Thanks for working on it!

gatorsmile · 2017-02-05T01:22:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/script/ScriptTransformationExec.scala

+private[sql]
+object ScriptTransformIOSchema {
+  def apply(input: ScriptInputOutputSchema): ScriptTransformIOSchema = {
+    new ScriptTransformIOSchema(


case class ScriptInputOutputSchema( inputRowFormat: Seq[(String, String)], outputRowFormat: Seq[(String, String)], inputSerdeClass: Option[String], outputSerdeClass: Option[String], inputSerdeProps: Seq[(String, String)], outputSerdeProps: Seq[(String, String)], recordReaderClass: Option[String], recordWriterClass: Option[String], schemaLess: Boolean)

Except inputRowFormat , outputRowFormat and schemaLess , we ignore all the other fields. I think we should not silently ignore them. For example, we do not respect any user-specified conf values of hive.script.recordreader and hive.script.recordwriter. Thus, could we issue an exception when users set them?

gatorsmile · 2017-02-05T01:25:40Z

I might need more time to review this PR. Will keep posting my comments in the next week.

gatorsmile · 2017-02-05T04:26:54Z

So far, we do not have any end-to-end test case for ScriptTransformation without enabling Hive support. The test cases we have are all in hive/test: SQLQuerySuite.scala. Could you please use our SQLQueryTestSuite the testing frame work and add the test cases by creating a new .sql file? For example, we are adding the test for subquery in #16798

gatorsmile · 2017-02-05T04:33:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/script/ScriptTransformationExec.scala

+
+  def execute(sqlContext: SQLContext,
+              child: SparkPlan,
+              schema: StructType): RDD[InternalRow] = {


Nit: the indent issue:

def execute( sqlContext: SQLContext, child: SparkPlan, schema: StructType): RDD[InternalRow] = {

Can we replace sqlContext: SQLContext by hadoopConf: Configuration?

gatorsmile · 2017-02-05T05:10:47Z

Why still keeping this file https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala ?

Your PR needs to remove it or rename it to HiveScriptTransformationExec.scala ?

HyukjinKwon · 2017-05-11T13:13:48Z

@tejasapatil gentle ping for the comments above.

tejasapatil · 2017-05-11T14:08:50Z

I dont see I will be getting time to work on this. Will close the PR for now and revisit in future.

AngersZhuuuu · 2020-03-21T13:17:01Z

@HyukjinKwon
Can I continue this work? If there is any other people work on this now? I think I know enough about how spark use transform now and willing to work on this.

@tejasapatil
Do you mind I continue your work and based on your code?

tejasapatil · 2020-03-22T21:27:46Z

@AngersZhuuuu : Sure. I will be happy if you can take over the PR and make it better. Also, open for questions / reviews.

tejasapatil changed the title ~~[SPARK-15694] Implement ScriptTransformation in sql/core~~ [SPARK-15694] Implement ScriptTransformation in sql/core (part 1) Aug 18, 2016

rxin reviewed Aug 19, 2016
View reviewed changes

tejasapatil force-pushed the SPARK-15694_Transform branch from 9dde09d to 9863c7d Compare August 23, 2016 22:53

tejasapatil force-pushed the SPARK-15694_Transform branch from 3e97ec3 to 9afbd5e Compare August 24, 2016 01:01

rxin reviewed Aug 25, 2016
View reviewed changes

hvanhovell reviewed Aug 25, 2016
View reviewed changes

tejasapatil force-pushed the SPARK-15694_Transform branch from 9afbd5e to f5256dd Compare September 9, 2016 00:53

tejasapatil force-pushed the SPARK-15694_Transform branch from f5256dd to c7741f9 Compare October 10, 2016 22:21

tejasapatil force-pushed the SPARK-15694_Transform branch from 704e4a3 to a6e9e39 Compare January 22, 2017 23:04

tejasapatil added 9 commits January 24, 2017 08:59

SPARK-15694 Implement ScriptTransformation in sql/core

3da8330

remove exrta change

1e07095

Moved classes from execution to execution.script

dfd978c

compilation issues

07e7fed

review comments

52cdc6c

review comment : override `outputPartitioning

b9db1bd

review comment : override outputPartitioning

fb3448c

rebase

1fc7d18

build failure

d9047f0

tejasapatil force-pushed the SPARK-15694_Transform branch from a6e9e39 to d9047f0 Compare January 24, 2017 17:09

gatorsmile reviewed Feb 5, 2017

View reviewed changes

tejasapatil closed this May 11, 2017

[SPARK-15694] Implement ScriptTransformation in sql/core (part 1) #14702

[SPARK-15694] Implement ScriptTransformation in sql/core (part 1) #14702

Uh oh!

Conversation

tejasapatil commented Aug 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 18, 2016

Uh oh!

tejasapatil commented Aug 18, 2016

Uh oh!

rxin commented Aug 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Aug 23, 2016

Uh oh!

SparkQA commented Aug 23, 2016

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Aug 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

tejasapatil commented Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Sep 14, 2016

Uh oh!

rxin commented Oct 10, 2016

Uh oh!

tejasapatil commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

tejasapatil commented Oct 11, 2016

Uh oh!

gatorsmile commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

rxin commented Oct 13, 2016

Uh oh!

SparkQA commented Jan 12, 2017

Uh oh!

tejasapatil commented Jan 12, 2017

Uh oh!

SparkQA commented Jan 22, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

tejasapatil commented Jan 24, 2017

Uh oh!

tejasapatil commented Jan 28, 2017

Uh oh!

SparkQA commented Jan 29, 2017

Uh oh!

tejasapatil commented Feb 3, 2017

Uh oh!

tejasapatil commented Aug 18, 2016 •

edited

Loading

hvanhovell Aug 25, 2016 •

edited

Loading

tejasapatil commented Sep 14, 2016 •

edited

Loading

gatorsmile commented Feb 5, 2017 •

edited

Loading

HyukjinKwon commented May 11, 2017 •

edited

Loading