[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378

yanboliang · 2016-07-27T09:52:49Z

What changes were proposed in this pull request?

ML GaussianMixture training failed due to feature column type mistake. The feature column type should be ml.linalg.VectorUDT but got mllib.linalg.VectorUDT by mistake.
See SPARK-16750 for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling transformSchema(dataset.schema) firstly during fit or transform. I will also add this function to all estimators/transformers (except StringIndexer and VectorAssembler which will be addressed later in a follow up PR) who missed in this PR.

How was this patch tested?

No new tests, should pass existing ones.

SparkQA · 2016-07-27T10:44:16Z

Test build #62917 has finished for PR 14378 at commit a0a32ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-07-27T10:46:54Z

cc @srowen

BryanCutler · 2016-07-27T17:37:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala

  @Since("2.0.0")
  override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
-    transformSchema(dataset.schema, logging = true)
+    transformSchema(dataset.schema)


Just wondering why you remove the logging flag here? I know it just adds some debug logging, but there are other similar calls that still have it set to true, should those be removed also?

It's a good question. Since MinMaxScaler override transformSchema with no argument logging, we should use that one rather than the function in the base class.

Seems the transformSchema(schema: StructType, logging: Boolean) method of the base class PipelineStage would call the the overloaded transformSchema method without the logging param:

https://github.com/apache/spark/blob/v2.0.0/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L70

Thanks for your remind, updated the PR.

BryanCutler · 2016-07-27T17:38:24Z

I just had a minor question, but LGTM

SparkQA · 2016-07-28T14:28:29Z

Test build #62970 has finished for PR 14378 at commit 0663ad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-28T21:39:34Z

Seems reasonable to me.

…column type mistake ## What changes were proposed in this pull request? ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake. See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug. Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14378 from yanboliang/spark-16750. (cherry picked from commit 0557a45) Signed-off-by: Sean Owen <sowen@cloudera.com>

srowen · 2016-07-29T11:40:42Z

Merged to master/2.0

…ctorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14455 from yanboliang/transformSchema. (cherry picked from commit 6cbde33) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ctorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for apache#14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#14455 from yanboliang/transformSchema.

Fix GaussianMixture training failed due to feature column type mistake

a0a32ef

BryanCutler reviewed Jul 27, 2016
View reviewed changes

Change to use transformSchema(dataset.schema, logging = true)

0663ad9

asfgit closed this in 0557a45 Jul 29, 2016

yanboliang deleted the spark-16750 branch July 29, 2016 11:43

yanboliang mentioned this pull request Aug 2, 2016

[SPARK-16750] [Follow-up] [ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests. #14455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378

[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378

Uh oh!

yanboliang commented Jul 27, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

yanboliang commented Jul 27, 2016

Uh oh!

BryanCutler Jul 27, 2016

Uh oh!

yanboliang Jul 28, 2016

Uh oh!

lins05 Jul 28, 2016

Uh oh!

yanboliang Jul 28, 2016

Uh oh!

BryanCutler commented Jul 27, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

srowen commented Jul 28, 2016

Uh oh!

srowen commented Jul 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378

[SPARK-16750] [ML] Fix GaussianMixture training failed due to feature column type mistake #14378

Uh oh!

Conversation

yanboliang commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

yanboliang commented Jul 27, 2016

Uh oh!

BryanCutler Jul 27, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Jul 28, 2016

Choose a reason for hiding this comment

Uh oh!

lins05 Jul 28, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Jul 28, 2016

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jul 27, 2016

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

srowen commented Jul 28, 2016

Uh oh!

srowen commented Jul 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yanboliang commented Jul 27, 2016 •

edited

Loading