Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Jul 27, 2016

What changes were proposed in this pull request?

ML GaussianMixture training failed due to feature column type mistake. The feature column type should be ml.linalg.VectorUDT but got mllib.linalg.VectorUDT by mistake.
See SPARK-16750 for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling transformSchema(dataset.schema) firstly during fit or transform. I will also add this function to all estimators/transformers (except StringIndexer and VectorAssembler which will be addressed later in a follow up PR) who missed in this PR.

How was this patch tested?

No new tests, should pass existing ones.

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62917 has finished for PR 14378 at commit a0a32ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @srowen

@Since("2.0.0")
override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
transformSchema(dataset.schema, logging = true)
transformSchema(dataset.schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why you remove the logging flag here? I know it just adds some debug logging, but there are other similar calls that still have it set to true, should those be removed also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Since MinMaxScaler override transformSchema with no argument logging, we should use that one rather than the function in the base class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the transformSchema(schema: StructType, logging: Boolean) method of the base class PipelineStage would call the the overloaded transformSchema method without the logging param:

https://github.com/apache/spark/blob/v2.0.0/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your remind, updated the PR.

@BryanCutler
Copy link
Member

I just had a minor question, but LGTM

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62970 has finished for PR 14378 at commit 0663ad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 28, 2016

Seems reasonable to me.

asfgit pushed a commit that referenced this pull request Jul 29, 2016
…column type mistake

## What changes were proposed in this pull request?
ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.

## How was this patch tested?
No new tests, should pass existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14378 from yanboliang/spark-16750.

(cherry picked from commit 0557a45)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Jul 29, 2016

Merged to master/2.0

@asfgit asfgit closed this in 0557a45 Jul 29, 2016
@yanboliang yanboliang deleted the spark-16750 branch July 29, 2016 11:43
asfgit pushed a commit that referenced this pull request Aug 5, 2016
…ctorAssembler and fix failed tests.

## What changes were proposed in this pull request?
This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.

## How was this patch tested?
Modified unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14455 from yanboliang/transformSchema.

(cherry picked from commit 6cbde33)
Signed-off-by: Sean Owen <sowen@cloudera.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 5, 2016
…ctorAssembler and fix failed tests.

## What changes were proposed in this pull request?
This is follow-up for apache#14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.

## How was this patch tested?
Modified unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#14455 from yanboliang/transformSchema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants