-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer #15817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68378 has finished for PR 15817 at commit
|
|
cc: @sethah @jkbradley |
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things, otherwise looks good.
python/pyspark/ml/feature.py
Outdated
| handleInvalid="error"): | ||
| """ | ||
| __init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001) | ||
| __init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be
__init__(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, \
handleInvalid="error")
for API doc formatting
| @since("2.0.0") | ||
| def setParams(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001): | ||
| def setParams(self, numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, | ||
| handleInvalid="error"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing handleInvalid in doc string below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| @keyword_only | ||
| @since("1.4.0") | ||
| def setParams(self, splits=None, inputCol=None, outputCol=None): | ||
| def setParams(self, splits=None, inputCol=None, outputCol=None, handleInvalid="error"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing handleInvalid in doc string below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Test build #68525 has finished for PR 15817 at commit
|
| typeConverter=TypeConverters.toListFloat) | ||
|
|
||
| handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid entries. " + | ||
| "Options are skip (filter out rows with invalid values), " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put the options in single quotes, e.g. "Options are 'skip' ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@techaddict I don't think you addressed this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair we don't have it quoted in the scala param description, so if we want to make this change we should probably also change it in the scala side just for consistencies sake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's pretty minor. Maybe we can do it later in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, since we've already cut RC1 and it would be nice to have these params in sooner rather than later and @techaddict seems to be a bit busy I've created a follow up JIRA ( SPARK-18628 ) for this so that we can maybe move ahead with this as is.
python/pyspark/ml/feature.py
Outdated
| ... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error") | ||
| >>> qds.getRelativeError() | ||
| 0.01 | ||
| >>> qds.getHandleInvalid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't add anything to the doctest of bucketizer. Actually, I think it would be nice in both places to set handleInvalid='skip' and then add an invalid value to the example data. That way we can show what we mean by invalid and prove that it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! adding
|
Test build #68534 has finished for PR 15817 at commit
|
|
Can you please add "[ML]" to the PR title? Thanks! |
|
Can you please implement the Param directly in Bucketizer and QuantileDiscretizer? Just like in Scala, HasHandleInvalid has built-in Param doc which applies to existing use cases but not Bucketizer and QuantileDiscretizer. It will be better to copy the Param, setter, and getter into Bucketizer and QuantileDiscretizer so that we can specialize the built-in Param doc. |
|
@jkbradley done 👍 |
|
Test build #68649 has finished for PR 15817 at commit
|
|
ping @davies @jkbradley |
|
Thanks for working on this @techaddict - one super minor point , but could you also maybe update the PR description to mention the testing is done with new doctests? This is really minor but for people skimming the changelog the PR description will end up as the commit message. |
|
Jenkins retest this please |
|
Test build #69331 has finished for PR 15817 at commit
|
|
LGTM given our planned follow up to update the documentation for both Python and Scala. |
…iscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes #15817 from techaddict/SPARK-18366. (cherry picked from commit fe854f2) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
|
Sorry for delay - this LGTM. Given it's been around for a while and given RC2 is likely to be cut, I've gone ahead and merged to master / branch-2.1. Thanks! |
…iscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#15817 from techaddict/SPARK-18366.
…iscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#15817 from techaddict/SPARK-18366.
What changes were proposed in this pull request?
added the new handleInvalid param for these transformers to Python to maintain API parity.
How was this patch tested?
existing tests
testing is done with new doctests