[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

imatiach-msft · 2019-01-29T04:31:54Z

This is a follow-up to PR:
#21632

What changes were proposed in this pull request?

This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values).
In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below.

How was this patch tested?

This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values.

Eg in SBT:

./build/sbt
> project mllib
> testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights"

For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test):

      val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) {
        print("should not print this")
        partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
      } else {
        partValueCountMap
      }

…es - fix tolerance

imatiach-msft · 2019-01-29T04:34:23Z

@srowen would you be able to review this minor follow-up fix for the tolerance? Thank you!

SparkQA · 2019-01-29T05:52:38Z

Test build #101793 has finished for PR 23682 at commit 94335ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

SparkQA · 2019-01-29T17:18:31Z

Test build #101828 has finished for PR 23682 at commit b34d88e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

srowen · 2019-01-31T11:45:07Z

Merged to master

…es - fix tolerance This is a follow-up to PR: apache#21632 ## What changes were proposed in this pull request? This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values). In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below. ## How was this patch tested? This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values. Eg in SBT: ``` ./build/sbt > project mllib > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights" ``` For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test): ``` val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) { print("should not print this") partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples)) } else { partValueCountMap } ``` Closes apache#23682 from imatiach-msft/ilmat/sample-weights-tol. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision tre…

94335ae

…es - fix tolerance

imatiach-msft mentioned this pull request Jan 29, 2019

[SPARK-19591][ML][MLlib] Add sample weights to decision trees #21632

Closed

srowen approved these changes Jan 29, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala Outdated Show resolved Hide resolved

updated based on comments

b34d88e

srowen reviewed Jan 30, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala Show resolved Hide resolved

srowen closed this in b3b62ba Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

Uh oh!

imatiach-msft commented Jan 29, 2019 •

edited

Loading

Uh oh!

imatiach-msft commented Jan 29, 2019

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

Uh oh!

srowen commented Jan 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

Uh oh!

Conversation

imatiach-msft commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

imatiach-msft commented Jan 29, 2019

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

Uh oh!

srowen commented Jan 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

imatiach-msft commented Jan 29, 2019 •

edited

Loading