-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8345] [ML] Add an SQL node as a feature transformer #7465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
Test build #37623 has started for PR 7465 at commit |
|
Test build #37623 has finished for PR 7465 at commit
|
|
Merged build finished. Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue but I don't think this is the right solution. See my comments below.
|
@mengxr I guess what we need here is essentially a wrapper helper function which wraps a |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #37827 has started for PR 7465 at commit |
|
Test build #37827 has finished for PR 7465 at commit
|
|
Merged build finished. Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is okay to return the DataFrame from sqlContext.sql directly. User should use * if they want to keep existing columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have different behavior with other transformers in ml.feature. Other transformers will return the DataFrame which is composed of original DataFrame and transformed DataFrame. But here if user did not use *, he will not keep existing columns in the output DataFrame.
|
Merged build triggered. |
|
Merged build started. |
|
Test build #40267 has started for PR 7465 at commit |
|
Test build #40267 has finished for PR 7465 at commit
|
|
Merged build finished. Test PASSed. |
|
LGTM. Merged into master. I think it is okay if the output columns do not contain all input columns. It is not a requirement for transformers in the pipeline API. Thanks for working on this! |
Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM __THIS__' where '__THIS__' represents the underlying table of the input dataset. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#7465 from yanboliang/spark-8345 and squashes the following commits: b403fcb [Yanbo Liang] address comments 0d4bb15 [Yanbo Liang] a better transformSchema() implementation 51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM THIS'
where 'THIS' represents the underlying table of the input dataset.