Skip to content

Conversation

@yanboliang
Copy link
Contributor

Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM THIS'
where 'THIS' represents the underlying table of the input dataset.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37623 has started for PR 7465 at commit 51eb9e7.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37623 has finished for PR 7465 at commit 51eb9e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLTransformer (override val uid: String) extends Transformer

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@yanboliang yanboliang changed the title [SPARK-8345] [ML] Add an SQL node as a feature transformer [WIP] [SPARK-8345] [ML] Add an SQL node as a feature transformer Jul 17, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue but I don't think this is the right solution. See my comments below.

@liancheng
Copy link
Contributor

@mengxr I guess what we need here is essentially a wrapper helper function which wraps a DataFrame => DataFrame function as a transformer, and a SQL statement is just a (questionably) more convenient way to express this function. One of the benefit of DataFrame DSL over SQL is that you don't need a temporary table name.

@yanboliang yanboliang changed the title [WIP] [SPARK-8345] [ML] Add an SQL node as a feature transformer [SPARK-8345] [ML] Add an SQL node as a feature transformer Jul 20, 2015
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37827 has started for PR 7465 at commit 0d4bb15.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37827 has finished for PR 7465 at commit 0d4bb15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLTransformer (override val uid: String) extends Transformer

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@yanboliang
Copy link
Contributor Author

@mengxr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is okay to return the DataFrame from sqlContext.sql directly. User should use * if they want to keep existing columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have different behavior with other transformers in ml.feature. Other transformers will return the DataFrame which is composed of original DataFrame and transformed DataFrame. But here if user did not use *, he will not keep existing columns in the output DataFrame.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Aug 9, 2015

Test build #40267 has started for PR 7465 at commit b403fcb.

@SparkQA
Copy link

SparkQA commented Aug 9, 2015

Test build #40267 has finished for PR 7465 at commit b403fcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLTransformer (override val uid: String) extends Transformer

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@asfgit asfgit closed this in 8cad854 Aug 11, 2015
@mengxr
Copy link
Contributor

mengxr commented Aug 11, 2015

LGTM. Merged into master. I think it is okay if the output columns do not contain all input columns. It is not a requirement for transformers in the pipeline API. Thanks for working on this!

CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
where '__THIS__' represents the underlying table of the input dataset.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#7465 from yanboliang/spark-8345 and squashes the following commits:

b403fcb [Yanbo Liang] address comments
0d4bb15 [Yanbo Liang] a better transformSchema() implementation
51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
@yanboliang yanboliang deleted the spark-8345 branch August 26, 2015 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants