[SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer #15817

techaddict · 2016-11-09T00:23:06Z

What changes were proposed in this pull request?

added the new handleInvalid param for these transformers to Python to maintain API parity.

How was this patch tested?

existing tests
testing is done with new doctests

SparkQA · 2016-11-09T00:59:45Z

Test build #68378 has finished for PR 15817 at commit b4720aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GBTClassifierWrapperWriter(instance: GBTClassifierWrapper)
- class GBTClassifierWrapperReader extends MLReader[GBTClassifierWrapper]
- class GBTRegressorWrapperWriter(instance: GBTRegressorWrapper)
- class GBTRegressorWrapperReader extends MLReader[GBTRegressorWrapper]

techaddict · 2016-11-11T02:11:58Z

cc: @sethah @jkbradley

MLnick

A few minor things, otherwise looks good.

MLnick · 2016-11-11T13:20:59Z

python/pyspark/ml/feature.py

+                 handleInvalid="error"):
        """
-        __init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001)
+        __init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001,


I think this needs to be

__init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, \ handleInvalid="error")

for API doc formatting

MLnick · 2016-11-11T13:22:08Z

python/pyspark/ml/feature.py

    @since("2.0.0")
-    def setParams(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001):
+    def setParams(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001,
+                  handleInvalid="error"):


missing handleInvalid in doc string below.

MLnick · 2016-11-11T13:23:45Z

python/pyspark/ml/feature.py

    @keyword_only
    @since("1.4.0")
-    def setParams(self, splits=None, inputCol=None, outputCol=None):
+    def setParams(self, splits=None, inputCol=None, outputCol=None, handleInvalid="error"):


Missing handleInvalid in doc string below.

SparkQA · 2016-11-11T14:38:04Z

Test build #68525 has finished for PR 15817 at commit 234d165.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-11T16:36:44Z

python/pyspark/ml/feature.py

              typeConverter=TypeConverters.toListFloat)

+    handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid entries. " +
+                          "Options are skip (filter out rows with invalid values), " +


can we put the options in single quotes, e.g. "Options are 'skip' ..."

@techaddict I don't think you addressed this comment?

To be fair we don't have it quoted in the scala param description, so if we want to make this change we should probably also change it in the scala side just for consistencies sake.

Yeah it's pretty minor. Maybe we can do it later in a follow up

Cool, since we've already cut RC1 and it would be nice to have these params in sooner rather than later and @techaddict seems to be a bit busy I've created a follow up JIRA ( SPARK-18628 ) for this so that we can maybe move ahead with this as is.

sethah · 2016-11-11T16:38:47Z

python/pyspark/ml/feature.py

+    ...     inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
    >>> qds.getRelativeError()
    0.01
+    >>> qds.getHandleInvalid()


We didn't add anything to the doctest of bucketizer. Actually, I think it would be nice in both places to set handleInvalid='skip' and then add an invalid value to the example data. That way we can show what we mean by invalid and prove that it works.

good idea! adding

SparkQA · 2016-11-11T20:09:22Z

Test build #68534 has finished for PR 15817 at commit d589515.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class AesCipher
- public class AesConfigMessage implements Encodable
- public class ByteArrayReadableChannel implements ReadableByteChannel

jkbradley · 2016-11-14T23:13:20Z

Can you please add "[ML]" to the PR title? Thanks!

jkbradley · 2016-11-15T02:34:14Z

Can you please implement the Param directly in Bucketizer and QuantileDiscretizer? Just like in Scala, HasHandleInvalid has built-in Param doc which applies to existing use cases but not Bucketizer and QuantileDiscretizer. It will be better to copy the Param, setter, and getter into Bucketizer and QuantileDiscretizer so that we can specialize the built-in Param doc.

This reverts commit af0d3f2.

techaddict · 2016-11-15T02:51:09Z

@jkbradley done 👍

SparkQA · 2016-11-15T03:15:44Z

Test build #68649 has finished for PR 15817 at commit 6687d3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

techaddict · 2016-11-24T10:44:06Z

ping @davies @jkbradley

holdenk · 2016-11-26T13:25:39Z

Thanks for working on this @techaddict - one super minor point , but could you also maybe update the PR description to mention the testing is done with new doctests? This is really minor but for people skimming the changelog the PR description will end up as the commit message.

holdenk · 2016-11-28T19:07:41Z

ok let's re-ping @MLnick / @sethah - I know we asked to update the docstring - but the current one is consistent with the Scala docstring so maybe it make sense as is (otherwise we should probably also update the scala docstring).

MLnick · 2016-11-29T14:06:01Z

Jenkins retest this please

SparkQA · 2016-11-29T14:37:35Z

Test build #69331 has finished for PR 15817 at commit 6687d3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-29T18:44:03Z

LGTM given our planned follow up to update the documentation for both Python and Scala.

…iscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes #15817 from techaddict/SPARK-18366. (cherry picked from commit fe854f2) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>

MLnick · 2016-11-30T09:35:28Z

Sorry for delay - this LGTM. Given it's been around for a while and given RC2 is likely to be cut, I've gone ahead and merged to master / branch-2.1. Thanks!

…iscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#15817 from techaddict/SPARK-18366.

techaddict added 5 commits November 9, 2016 05:33

add handleInvalid to QuantileDiscretizer

0e41b36

fix lint issues

3b5133c

handleInvalid to Bucketizer

20bfd9b

fix lint error

1922472

Merge branch 'master' into SPARK-18366

b4720aa

MLnick reviewed Nov 11, 2016

View reviewed changes

techaddict added 2 commits November 11, 2016 19:41

Merge branch 'master' into SPARK-18366

67a666f

address comments

234d165

sethah reviewed Nov 11, 2016

View reviewed changes

techaddict added 5 commits November 12, 2016 00:46

commit

0327d8a

use HasHandleInvalid

af0d3f2

minor changes

a8dc962

set Default

7ff8ad3

Merge branch 'master' into SPARK-18366

d589515

techaddict changed the title ~~[SPARK-18366][PYSPARK] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer~~ [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer Nov 14, 2016

techaddict added 3 commits November 15, 2016 08:14

Merge branch 'master' into SPARK-18366

08f8945

Revert "use HasHandleInvalid"

36ecddb

This reverts commit af0d3f2.

bring back handleInvalid

6687d3c

asfgit closed this in fe854f2 Nov 30, 2016

[SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer #15817

[SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer #15817

Uh oh!

Conversation

techaddict commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

techaddict commented Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 11, 2016

Uh oh!

jkbradley commented Nov 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbradley commented Nov 15, 2016

Uh oh!

techaddict commented Nov 15, 2016

Uh oh!

SparkQA commented Nov 15, 2016

Uh oh!

techaddict commented Nov 24, 2016

Uh oh!

holdenk commented Nov 26, 2016

Uh oh!

holdenk commented Nov 28, 2016

Uh oh!

MLnick commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

holdenk commented Nov 29, 2016

Uh oh!

MLnick commented Nov 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

techaddict commented Nov 9, 2016 •

edited

Loading

techaddict commented Nov 11, 2016 •

edited

Loading

jkbradley commented Nov 14, 2016 •

edited

Loading