-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15694] Implement ScriptTransformation in sql/core (part 1) #14702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63995 has finished for PR 14702 at commit
|
|
cc @rxin : who would be the best person to review this PR ? |
|
Can you update the description to say more about what this pr includes, and what future todos are? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create an execution.script package?
I suspect you will need to copy a bunch of things from Hive in follow-up prs. Might make sense to have a namespace for script related functionalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
9dde09d to
9863c7d
Compare
|
@rxin : I have updated the description to include more info on changes done and future todos |
|
Test build #64313 has finished for PR 14702 at commit
|
|
Test build #64318 has finished for PR 14702 at commit
|
3e97ec3 to
9afbd5e
Compare
|
Test build #64321 has finished for PR 14702 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document what these parameters mean?
e.g. what schemaLess mean, and what the seq of string tuples mean for inputRowFormat and outputRowFormat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documented the params
|
This looks reasonable. cc @hvanhovell to take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would call toString() on our internal representation of a value. This will lead to unexpected results as soon as you would use a Date or a Timestamp. See the discussion (and potential solution) in PR #14279.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also do this using a (code generated) projection. The beauty there would be that we could make it return a single UTF8String which can be dumped straight into the outputstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell : for now I am not going the code-gen route as it will make the diff bigger and delay things. I am doing what PR #14279 did and adding code-gen as Future TODOs in PR description.
|
Test build #65011 has finished for PR 14702 at commit
|
9afbd5e to
f5256dd
Compare
|
Test build #65127 has finished for PR 14702 at commit
|
|
@hvanhovell ping !! |
|
@tejasapatil I'll take another pass tomorrow (CET). |
|
@tejasapatil I was checking with @hvanhovell. We should merge this one soon. Mind bringing it up to date? |
f5256dd to
c7741f9
Compare
|
Jenkins test this please |
|
Test build #66689 has finished for PR 14702 at commit
|
|
jenkins test this please. Failed test from earlier run was in KafkaSourceStressSuite which I don't see being related to this PR. |
|
retest this please |
|
Test build #66723 has finished for PR 14702 at commit
|
|
@tejasapatil looks like there is a legitimate failing test. |
|
Test build #71231 has finished for PR 14702 at commit
|
|
can anyone please review this PR ? |
704e4a3 to
a6e9e39
Compare
|
Test build #71814 has finished for PR 14702 at commit
|
a6e9e39 to
d9047f0
Compare
|
Test build #71941 has finished for PR 14702 at commit
|
|
jenkins retest please test failure from build HiveSparkSubmitSuite |
|
Jenkins retest this please |
|
Test build #72115 has finished for PR 14702 at commit
|
|
can anyone please review this PR ? |
|
I will try to review it in the next few days. Thanks for working on it! |
| private[sql] | ||
| object ScriptTransformIOSchema { | ||
| def apply(input: ScriptInputOutputSchema): ScriptTransformIOSchema = { | ||
| new ScriptTransformIOSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case class ScriptInputOutputSchema(
inputRowFormat: Seq[(String, String)],
outputRowFormat: Seq[(String, String)],
inputSerdeClass: Option[String],
outputSerdeClass: Option[String],
inputSerdeProps: Seq[(String, String)],
outputSerdeProps: Seq[(String, String)],
recordReaderClass: Option[String],
recordWriterClass: Option[String],
schemaLess: Boolean)Except inputRowFormat , outputRowFormat and schemaLess , we ignore all the other fields. I think we should not silently ignore them. For example, we do not respect any user-specified conf values of hive.script.recordreader and hive.script.recordwriter. Thus, could we issue an exception when users set them?
|
I might need more time to review this PR. Will keep posting my comments in the next week. |
|
So far, we do not have any end-to-end test case for |
|
|
||
| def execute(sqlContext: SQLContext, | ||
| child: SparkPlan, | ||
| schema: StructType): RDD[InternalRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the indent issue:
def execute(
sqlContext: SQLContext,
child: SparkPlan,
schema: StructType): RDD[InternalRow] = {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace sqlContext: SQLContext by hadoopConf: Configuration?
|
Why still keeping this file https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala ? Your PR needs to remove it or rename it to HiveScriptTransformationExec.scala ? |
|
@tejasapatil gentle ping for the comments above. |
|
I dont see I will be getting time to work on this. Will close the PR for now and revisit in future. |
|
@HyukjinKwon @tejasapatil |
|
@AngersZhuuuu : Sure. I will be happy if you can take over the PR and make it better. Also, open for questions / reviews. |
What changes were proposed in this pull request?
Added
ScriptTransformationExecwhich would run script operator in SQL mode (w/o Hive). Since this has to run w/o Hive, it does not support Hive serdes.ScriptTransformBasehas common code acrossScriptTransformationExecandHiveScriptTransformationExec.Changes done:
ScriptTransformationtoHiveScriptTransformationExecScriptTransformationExecwhich would run script operator in SQL mode (w/o Hive).ScriptTransformBasehas common code used acrossScriptTransformationExecandHiveScriptTransformationExecScriptTransformationWriterThreadhas the core logic.HiveScriptTransformationWriterThreadextends that for Hive specific stuff.ScriptTransformationWriterThreadwill be used for Spark SQL. It only supports writing data to script process by serializing column values as tab delimited stringHiveScriptTransformationWriterThreadadditionally supports Hive serdeStrategynamedScriptswhich would emitScriptTransformationExecin physical plans. This would be used in non-Hive mode.Future TODOs:
SerdeinScriptTransformationExecHow was this patch tested?
ScriptTransformationExecSuiteHiveScriptTransformationexecSuiteto useHiveScriptTransformationExec