-
Notifications
You must be signed in to change notification settings - Fork 520
[SYSTEMML-234] [SYSTEMML-208] Added mllearn library to support scikit-learn and MLPipeline #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hey, the script looks great! But I may be biased as a sklearn user and developer. What are you planning to do with the previous PR? In other words, what are the use-cases of exposing SystemML datastructures to the not-so-advanced user? |
|
@MechCoder Please see #197 for answer to your question. Also, since you are biased sklearn user/developer, you will be the best person to critique the API of our library :) ... The high-level pitch we can make as SystemML community is that we support subset of sklearn algorithm and if your application is using these sklearn algorithms, you can replace |
| # traceback.print_exc() | ||
|
|
||
| def getNumCols(numPyArr): | ||
| if len(numPyArr.shape) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.ndim is the preferred way to do this.
(Also in sklearn, we deprecated the use of 1-D arrays (since it is ambiguous if it is a single sample with n_features or n_samples with a single feature. All data should be provided as a 2-D array)
|
Would you be able to add some minor tests? Thanks! |
|
could we please batch these comments? |
| numArgs = len(args) + 1 | ||
| if numArgs == 1: | ||
| return self._fit(X) | ||
| elif numArgs == 2 and (isinstance(X, np.ndarray) or isinstance(X, pd.core.frame.DataFrame)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just change the signature to fit(X, y=None) and remove args?
|
@mboehm7 Sorry, but what are batched comments? |
|
@mboehm7 Inline comments at specific lines of code in the PR are super useful for reviewing and discussing the code without any confusion. Unless you're "Watching" the entire repo, "Unsubscribing" from this particular PR should limit the inbox noise. :) |
| pdfX = X | ||
| else: | ||
| raise Exception('The input type not supported') | ||
| return pdfX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would refactor this entire method this way in any case
# Let Pandas handle the conversion error internally and allow other array-like formats
if not instance(X, pd.DataFrame):
return pd.DataFrame(X, columns=['C' + str(i) for i in range(numCols)])
return X|
thanks @dusenberrymw. |
Do you have recommendations of how we can add Python tests along with JUnit ?
I added that to test MLPipeline's CrossValidator. But, for some reason, couldn't get it working. Not sure if the CrossValidator passes the parameters through object or through fit's params.
Done.
Good point. |
| self.updateLog() | ||
| if y is None: | ||
| return self._fit(X) | ||
| elif y is not None and (isinstance(X, np.ndarray) or isinstance(X, pd.core.frame.DataFrame)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the check for X is done internally in convertToPandasDF
|
@niketanpansare Thanks for addressing my comments! I made a first pass. Hope to get back to it on Monday. |
|
Thanks @MechCoder for your help and suggestions 👍 |
LinearRegression Only scikit learn way of usage tested
|
@niketanpansare Is it possible to just keep one ML model (LogisticRegression) in this PR and keep the rest for another PR after this has been merged. So that the reviewing can be focused on just the design? |
|
@MechCoder I would prefer to add other ML models as well in this PR for three reasons:
Please note: the API is WIP, so you are welcome to modify the added ML models or add new ML models once this PR is in :) |
| elif isinstance(inputCols, list): | ||
| return inputCols | ||
| else: | ||
| raise Exception('inputCols should be of type pandas.indexes.base.Index or list') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole method is just list(inputCols)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
Before I proceed any further:
|
|
Added documentation as well as created BaseSystemMLClassifier and BaseSystemMLRegressor classes in Python. Let's address remaining comments in next PR. For now, I believe this API is in reasonably stable state. Also Spark 2.0 support is dependent on this. |
|
f02f7c0 closes this PR. |
Using SystemML's Logistic Regression (scikit-learn way):
Points to note:
Using SystemML's Logistic Regression (MLPipeline way):
Using SystemML's Linear Regression (scikit-learn way):
Using SystemML's SVM (scikit-learn way):
Using SystemML's SVM (MLPipeline way):
Using SystemML's Naive Bayes (Scikit learn way):