Skip to content

Conversation

@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Jan 29, 2019

This is a follow-up to PR:
#21632

What changes were proposed in this pull request?

This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values).
In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below.

How was this patch tested?

This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values.

Eg in SBT:

./build/sbt
> project mllib
> testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights"

For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test):

      val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) {
        print("should not print this")
        partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
      } else {
        partValueCountMap
      }

@imatiach-msft
Copy link
Contributor Author

@srowen would you be able to review this minor follow-up fix for the tolerance? Thank you!

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101793 has finished for PR 23682 at commit 94335ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101828 has finished for PR 23682 at commit b34d88e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 31, 2019

Merged to master

@srowen srowen closed this in b3b62ba Jan 31, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…es - fix tolerance

This is a follow-up to PR:
apache#21632

## What changes were proposed in this pull request?

This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values).
In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples.  Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below.

## How was this patch tested?

This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values.

Eg in SBT:
```
./build/sbt
> project mllib
> testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights"
```

For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test):
```
      val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) {
        print("should not print this")
        partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
      } else {
        partValueCountMap
      }
```

Closes apache#23682 from imatiach-msft/ilmat/sample-weights-tol.

Authored-by: Ilya Matiach <ilmat@microsoft.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants