Skip to content

Conversation

@tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Aug 18, 2016

What changes were proposed in this pull request?

Added ScriptTransformationExec which would run script operator in SQL mode (w/o Hive). Since this has to run w/o Hive, it does not support Hive serdes. ScriptTransformBase has common code across ScriptTransformationExec and HiveScriptTransformationExec.

Changes done:

  • Renamed ScriptTransformation to HiveScriptTransformationExec
  • Added ScriptTransformationExec which would run script operator in SQL mode (w/o Hive).
    • The output of script would be read as a string and column values are extracted by using a delimiter (default : tab character)
  • ScriptTransformBase has common code used across ScriptTransformationExec and HiveScriptTransformationExec
  • For thread writing data to script, ScriptTransformationWriterThread has the core logic. HiveScriptTransformationWriterThread extends that for Hive specific stuff.
    • ScriptTransformationWriterThread will be used for Spark SQL. It only supports writing data to script process by serializing column values as tab delimited string
    • HiveScriptTransformationWriterThread additionally supports Hive serde
  • Added a Strategy named Scripts which would emit ScriptTransformationExec in physical plans. This would be used in non-Hive mode.

Future TODOs:

  • Support some notion of Serde in ScriptTransformationExec
  • For Hive, by default only serde's must be used
  • Cleanup past hacks that are observed (and people suggest / report)
    • Move LogicalPlanToSQLSuite out of hive module and put inside sql
  • Use code-gen projection to serialize rows to output stream (suggestion by @hvanhovell )

How was this patch tested?

  • Added ScriptTransformationExecSuite
  • HiveScriptTransformationexecSuite to use HiveScriptTransformationExec

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63995 has finished for PR 14702 at commit 9dde09d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil changed the title [SPARK-15694] Implement ScriptTransformation in sql/core [SPARK-15694] Implement ScriptTransformation in sql/core (part 1) Aug 18, 2016
@tejasapatil
Copy link
Contributor Author

cc @rxin : who would be the best person to review this PR ?

@rxin
Copy link
Contributor

rxin commented Aug 19, 2016

Can you update the description to say more about what this pr includes, and what future todos are?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create an execution.script package?

I suspect you will need to copy a bunch of things from Hive in follow-up prs. Might make sense to have a namespace for script related functionalities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from 9dde09d to 9863c7d Compare August 23, 2016 22:53
@tejasapatil
Copy link
Contributor Author

@rxin : I have updated the description to include more info on changes done and future todos

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64313 has finished for PR 14702 at commit 9863c7d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ScriptTransformationExec(
    • case class HiveScriptTransformationExec(
    • class HiveScriptTransformationWriterThread(
    • case class HiveScriptIOSchema (

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64318 has finished for PR 14702 at commit 3e97ec3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from 3e97ec3 to 9afbd5e Compare August 24, 2016 01:01
@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64321 has finished for PR 14702 at commit 9afbd5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document what these parameters mean?

e.g. what schemaLess mean, and what the seq of string tuples mean for inputRowFormat and outputRowFormat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented the params

@rxin
Copy link
Contributor

rxin commented Aug 25, 2016

This looks reasonable.

cc @hvanhovell to take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would call toString() on our internal representation of a value. This will lead to unexpected results as soon as you would use a Date or a Timestamp. See the discussion (and potential solution) in PR #14279.

Copy link
Contributor

@hvanhovell hvanhovell Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do this using a (code generated) projection. The beauty there would be that we could make it return a single UTF8String which can be dumped straight into the outputstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell : for now I am not going the code-gen route as it will make the diff bigger and delay things. I am doing what PR #14279 did and adding code-gen as Future TODOs in PR description.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65011 has finished for PR 14702 at commit 9afbd5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from 9afbd5e to f5256dd Compare September 9, 2016 00:53
@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65127 has finished for PR 14702 at commit f5256dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

tejasapatil commented Sep 14, 2016

@hvanhovell ping !!

@hvanhovell
Copy link
Contributor

@tejasapatil I'll take another pass tomorrow (CET).

@rxin
Copy link
Contributor

rxin commented Oct 10, 2016

@tejasapatil I was checking with @hvanhovell. We should merge this one soon. Mind bringing it up to date?

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from f5256dd to c7741f9 Compare October 10, 2016 22:21
@tejasapatil
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66689 has finished for PR 14702 at commit c7741f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

jenkins test this please.

Failed test from earlier run was in KafkaSourceStressSuite which I don't see being related to this PR.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66723 has finished for PR 14702 at commit c7741f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 13, 2016

@tejasapatil looks like there is a legitimate failing test.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71231 has finished for PR 14702 at commit 704e4a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

can anyone please review this PR ?

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from 704e4a3 to a6e9e39 Compare January 22, 2017 23:04
@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71814 has finished for PR 14702 at commit a6e9e39.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil force-pushed the SPARK-15694_Transform branch from a6e9e39 to d9047f0 Compare January 24, 2017 17:09
@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71941 has finished for PR 14702 at commit d9047f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

jenkins retest please

test failure from build HiveSparkSubmitSuite set hive.metastore.warehouse.dir is unrelated to the change

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72115 has finished for PR 14702 at commit d9047f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

can anyone please review this PR ?

@gatorsmile
Copy link
Member

I will try to review it in the next few days. Thanks for working on it!

private[sql]
object ScriptTransformIOSchema {
def apply(input: ScriptInputOutputSchema): ScriptTransformIOSchema = {
new ScriptTransformIOSchema(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case class ScriptInputOutputSchema(
    inputRowFormat: Seq[(String, String)],
    outputRowFormat: Seq[(String, String)],
    inputSerdeClass: Option[String],
    outputSerdeClass: Option[String],
    inputSerdeProps: Seq[(String, String)],
    outputSerdeProps: Seq[(String, String)],
    recordReaderClass: Option[String],
    recordWriterClass: Option[String],
    schemaLess: Boolean)

Except inputRowFormat , outputRowFormat and schemaLess , we ignore all the other fields. I think we should not silently ignore them. For example, we do not respect any user-specified conf values of hive.script.recordreader and hive.script.recordwriter. Thus, could we issue an exception when users set them?

@gatorsmile
Copy link
Member

I might need more time to review this PR. Will keep posting my comments in the next week.

@gatorsmile
Copy link
Member

So far, we do not have any end-to-end test case for ScriptTransformation without enabling Hive support. The test cases we have are all in hive/test: SQLQuerySuite.scala. Could you please use our SQLQueryTestSuite the testing frame work and add the test cases by creating a new .sql file? For example, we are adding the test for subquery in #16798


def execute(sqlContext: SQLContext,
child: SparkPlan,
schema: StructType): RDD[InternalRow] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the indent issue:

  def execute(
      sqlContext: SQLContext,
      child: SparkPlan,
      schema: StructType): RDD[InternalRow] = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace sqlContext: SQLContext by hadoopConf: Configuration?

@gatorsmile
Copy link
Member

gatorsmile commented Feb 5, 2017

Why still keeping this file https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala ?

Your PR needs to remove it or rename it to HiveScriptTransformationExec.scala ?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 11, 2017

@tejasapatil gentle ping for the comments above.

@tejasapatil
Copy link
Contributor Author

I dont see I will be getting time to work on this. Will close the PR for now and revisit in future.

@AngersZhuuuu
Copy link
Contributor

@HyukjinKwon
Can I continue this work? If there is any other people work on this now? I think I know enough about how spark use transform now and willing to work on this.

@tejasapatil
Do you mind I continue your work and based on your code?

@tejasapatil
Copy link
Contributor Author

@AngersZhuuuu : Sure. I will be happy if you can take over the PR and make it better. Also, open for questions / reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants