-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62917 has finished for PR 14378 at commit
|
|
cc @srowen |
| @Since("2.0.0") | ||
| override def fit(dataset: Dataset[_]): MinMaxScalerModel = { | ||
| transformSchema(dataset.schema, logging = true) | ||
| transformSchema(dataset.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why you remove the logging flag here? I know it just adds some debug logging, but there are other similar calls that still have it set to true, should those be removed also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. Since MinMaxScaler override transformSchema with no argument logging, we should use that one rather than the function in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the transformSchema(schema: StructType, logging: Boolean) method of the base class PipelineStage would call the the overloaded transformSchema method without the logging param:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your remind, updated the PR.
|
I just had a minor question, but LGTM |
|
Test build #62970 has finished for PR 14378 at commit
|
|
Seems reasonable to me. |
…column type mistake ## What changes were proposed in this pull request? ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake. See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug. Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14378 from yanboliang/spark-16750. (cherry picked from commit 0557a45) Signed-off-by: Sean Owen <sowen@cloudera.com>
|
Merged to master/2.0 |
…ctorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14455 from yanboliang/transformSchema. (cherry picked from commit 6cbde33) Signed-off-by: Sean Owen <sowen@cloudera.com>
…ctorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for apache#14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14455 from yanboliang/transformSchema.
What changes were proposed in this pull request?
ML
GaussianMixturetraining failed due to feature column type mistake. The feature column type should beml.linalg.VectorUDTbut gotmllib.linalg.VectorUDTby mistake.See SPARK-16750 for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling
transformSchema(dataset.schema)firstly duringfitortransform. I will also add this function to all estimators/transformers (exceptStringIndexerandVectorAssemblerwhich will be addressed later in a follow up PR) who missed in this PR.How was this patch tested?
No new tests, should pass existing ones.