[SYSTEMML-234] [SYSTEMML-208] Added mllearn library to support scikit-learn and MLPipeline #204

niketanpansare · 2016-08-05T15:35:52Z

Fixed bugs in scala LogisticRegression wrapper (handling of raw predictions and passing dfam).
Extended java MLContext to accept MatrixBlock. Also added utility function in Python file. (Also created https://issues.apache.org/jira/browse/SYSTEMML-846. It should be a good migration task to learn Python MLContext).
Added mllearn class to allow scikit-learn and MLPipeline users to use SystemML.

Using SystemML's Logistic Regression (scikit-learn way):

from sklearn import datasets, neighbors
import SystemML as sml
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target + 1
n_samples = len(X_digits)
X_train = X_digits[:.9 * n_samples]
y_train = y_digits[:.9 * n_samples]
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
import time
t1 = time.time()
logistic = sml.mllearn.LogisticRegression(sqlCtx)
print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
t2 = time.time()
# Convert to DataFrame for i/o: current way to transfer data
logistic = sml.mllearn.LogisticRegression(sqlCtx, transferUsingDF=True)
print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
t3 = time.time()
print('Execution time: without DF: %f and with DF: %f' % (t2-t1, t3-t2))

Points to note:

The execution time without DF is 1-2 seconds whereas with DF is 18-20 seconds.
The current version only supports numeric labels/features.
The above interface is especially useful if the input/output fit on the node, but the intermediate data doesn't.

Using SystemML's Logistic Regression (MLPipeline way):

from pyspark.ml import Pipeline
import SystemML as sml
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
training = sqlCtx.createDataFrame([
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 2.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 2.0),
    (4L, "b spark who", 1.0),
    (5L, "g d a y", 2.0),
    (6L, "spark fly", 1.0),
    (7L, "was mapreduce", 2.0),
    (8L, "e spark program", 1.0),
    (9L, "a e c l", 2.0),
    (10L, "spark compile", 1.0),
    (11L, "hadoop software", 2.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
lr = sml.mllearn.LogisticRegression(sqlCtx)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(training)
test = sqlCtx.createDataFrame([
    (12L, "spark i j k"),
    (13L, "l m n"),
    (14L, "mapreduce spark"),
    (15L, "apache hadoop")], ["id", "text"])
prediction = model.transform(test)
prediction.show()

Using SystemML's Linear Regression (scikit-learn way):

import numpy as np
from sklearn import datasets
import SystemML as sml
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = sml.mllearn.LinearRegression(sqlCtx)
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))

Using SystemML's SVM (scikit-learn way):

from sklearn import datasets, neighbors
import SystemML as sml
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target 
n_samples = len(X_digits)
X_train = X_digits[:.9 * n_samples]
y_train = y_digits[:.9 * n_samples]
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
svm = sml.mllearn.SVM(sqlCtx, is_multi_class=True)
print('LogisticRegression score: %f' % svm.fit(X_train, y_train).score(X_test, y_test))

Using SystemML's SVM (MLPipeline way):

from pyspark.ml import Pipeline
import SystemML as sml
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
training = sqlCtx.createDataFrame([
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 2.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 2.0),
    (4L, "b spark who", 1.0),
    (5L, "g d a y", 2.0),
    (6L, "spark fly", 1.0),
    (7L, "was mapreduce", 2.0),
    (8L, "e spark program", 1.0),
    (9L, "a e c l", 2.0),
    (10L, "spark compile", 1.0),
    (11L, "hadoop software", 2.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
svm = sml.mllearn.SVM(sqlCtx, is_multi_class=True)
pipeline = Pipeline(stages=[tokenizer, hashingTF, svm])
model = pipeline.fit(training)
test = sqlCtx.createDataFrame([
    (12L, "spark i j k"),
    (13L, "l m n"),
    (14L, "mapreduce spark"),
    (15L, "apache hadoop")], ["id", "text"])
prediction = model.transform(test)
prediction.show()

Using SystemML's Naive Bayes (Scikit learn way):

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import SystemML as sml
from sklearn import metrics
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
vectorizer = TfidfVectorizer()
# Both vectors and vectors_test are SciPy CSR matrix
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
nb = sml.mllearn.NaiveBayes(sqlCtx)
nb.fit(vectors, newsgroups_train.target)
pred = nb.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='weighted')

niketanpansare · 2016-08-05T15:38:26Z

@mboehm7 @frreiss @bertholdreinwald @dusenberrymw @deroneriksson @MechCoder @iyounus

MechCoder · 2016-08-05T19:00:50Z

Hey, the script looks great! But I may be biased as a sklearn user and developer.

What are you planning to do with the previous PR? In other words, what are the use-cases of exposing SystemML datastructures to the not-so-advanced user?

niketanpansare · 2016-08-05T20:33:59Z

@MechCoder Please see #197 for answer to your question. Also, since you are biased sklearn user/developer, you will be the best person to critique the API of our library :) ... The high-level pitch we can make as SystemML community is that we support subset of sklearn algorithm and if your application is using these sklearn algorithms, you can replace sk._ call with sml._ call and everything should work as expected.

MechCoder · 2016-08-05T21:55:08Z

src/main/java/org/apache/sysml/api/python/SystemML.py

        #    traceback.print_exc()
+
+def getNumCols(numPyArr):
+	if len(numPyArr.shape) == 1:


np.ndim is the preferred way to do this.

(Also in sklearn, we deprecated the use of 1-D arrays (since it is ambiguous if it is a single sample with n_features or n_samples with a single feature. All data should be provided as a 2-D array)

MechCoder · 2016-08-05T23:25:28Z

Would you be able to add some minor tests? Thanks!

mboehm7 · 2016-08-05T23:27:39Z

could we please batch these comments?

MechCoder · 2016-08-05T23:27:54Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+            numArgs = len(args) + 1
+            if numArgs == 1:
+                return self._fit(X)
+            elif numArgs == 2 and (isinstance(X, np.ndarray) or isinstance(X, pd.core.frame.DataFrame)):


Why not just change the signature to fit(X, y=None) and remove args?

MechCoder · 2016-08-05T23:29:23Z

@mboehm7 Sorry, but what are batched comments?

dusenberrymw · 2016-08-05T23:36:57Z

@mboehm7 Inline comments at specific lines of code in the PR are super useful for reviewing and discussing the code without any confusion. Unless you're "Watching" the entire repo, "Unsubscribing" from this particular PR should limit the inbox noise. :)

MechCoder · 2016-08-05T23:52:24Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+                pdfX = X
+            else:
+                raise Exception('The input type not supported')
+            return pdfX


I would refactor this entire method this way in any case

# Let Pandas handle the conversion error internally and allow other array-like formats if not instance(X, pd.DataFrame): return pd.DataFrame(X, columns=['C' + str(i) for i in range(numCols)]) return X

mboehm7 · 2016-08-05T23:55:47Z

thanks @dusenberrymw.

niketanpansare · 2016-08-06T00:00:23Z

Would you be able to add some minor tests? Thanks!

Do you have recommendations of how we can add Python tests along with JUnit ?

Why update again?

I added that to test MLPipeline's CrossValidator. But, for some reason, couldn't get it working. Not sure if the CrossValidator passes the parameters through object or through fit's params.

Why not just change the signature to fit(X, y=None) and remove args?

Done.

Let Pandas handle the conversion error internally and allow other array-like formats

Good point.

MechCoder · 2016-08-06T00:08:33Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+            self.updateLog()
+            if y is None:
+                return self._fit(X)
+            elif y is not None and (isinstance(X, np.ndarray) or isinstance(X, pd.core.frame.DataFrame)):


Seems like the check for X is done internally in convertToPandasDF

MechCoder · 2016-08-06T00:31:38Z

@niketanpansare Thanks for addressing my comments! I made a first pass. Hope to get back to it on Monday.

niketanpansare · 2016-08-07T01:34:49Z

Thanks @MechCoder for your help and suggestions 👍

LinearRegression Only scikit learn way of usage tested

MechCoder · 2016-08-08T19:22:26Z

@niketanpansare Is it possible to just keep one ML model (LogisticRegression) in this PR and keep the rest for another PR after this has been merged. So that the reviewing can be focused on just the design?

niketanpansare · 2016-08-08T19:50:05Z

@MechCoder I would prefer to add other ML models as well in this PR for three reasons:

All the other models (including LogisticRegression) inherit their implementation from BaseSystemMLEstimator. So, the focus should be on design of BaseSystemMLEstimator. If we are comfortable with that, it doesn't matter whether the implemented ML model is LogisticRegression or NaiveBayes or any other.
Due to overdesigning the initial implementation, very little progress has been made. As a side note, java LogisticRegression API has been there for almost a year and no model has been added since then (java or python).
The hope of this API is to increase adoption at least in algorithm front. As a first step towards that, we need to update the Algorithm documentation with interesting examples that motivates people to try SystemML. Adding just one algorithm (without strong reason) seems to defeat the purpose of that.

Please note: the API is WIP, so you are welcome to modify the added ML models or add new ML models once this PR is in :)

MechCoder · 2016-08-08T22:04:45Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+    elif isinstance(inputCols, list):
+        return inputCols
+    else:
+        raise Exception('inputCols should be of type pandas.indexes.base.Index or list')


This whole method is just list(inputCols)

MLContext

MechCoder · 2016-08-09T19:20:15Z

Before I proceed any further:

Can you run PEP8 on all the python files?
Can you add documentation to all of the methods? Please explicitly document the expected return types of all the methods?
With regards to making it pip-installable or even having a setup.py, I would postpone that to another PR. There is still some discussion (or non-discussion rather) in making pyspark pip installable. ([SPARK-1267][PYSPARK] Adds pip installer for pyspark spark#8318). Right now how would just focus on organizing the project structure. We can just have it similar to pyspark's project structure.

for Spark 2.0

niketanpansare · 2016-08-09T20:20:31Z

Added documentation as well as created BaseSystemMLClassifier and BaseSystemMLRegressor classes in Python.

Let's address remaining comments in next PR. For now, I believe this API is in reasonably stable state. Also Spark 2.0 support is dependent on this.

niketanpansare · 2016-08-09T20:31:35Z

f02f7c0 closes this PR.

niketanpansare mentioned this pull request Aug 5, 2016

[SYSTEMML-451] Python embedded DSL #197

Closed

MechCoder reviewed Aug 5, 2016
View reviewed changes

MechCoder reviewed Aug 6, 2016
View reviewed changes

Niketan Pansare added 9 commits August 8, 2016 10:40

[SYSTEMML-234] [SYSTEMML-208] Taking care of division issue

f223a0a

[SYSTEMML-234] [SYSTEMML-208] Avoid divide by zero

e7371aa

[SYSTEMML-234] [SYSTEMML-208] Refactored convertToPandas

9fff023

[SYSTEMML-234] [SYSTEMML-208] Fixed bugs in MLContext and added

41f1668

LinearRegression Only scikit learn way of usage tested

[SYSTEMML-234] [SYSTEMML-208] Added SVM and python test cases

397d729

[SYSTEMML-234] [SYSTEMML-208] Updating the documentation

d4aff09

Updated documentation

ca67134

Updating the LinRegDS documentation

cfe6087

Added naive bayes and scipy sparse matrix

65eb888

MechCoder reviewed Aug 8, 2016
View reviewed changes

Added BaseSystemMLClassifier and updated the classifier to use new

21e91c7

MLContext

niketanpansare force-pushed the mllearn branch from 601d67b to 21e91c7 Compare August 9, 2016 05:58

Modified Linear Regression to support new MLContext and added support

3a2a4cf

for Spark 2.0

niketanpansare closed this Aug 9, 2016

[SYSTEMML-234] [SYSTEMML-208] Added mllearn library to support scikit-learn and MLPipeline #204

[SYSTEMML-234] [SYSTEMML-208] Added mllearn library to support scikit-learn and MLPipeline #204

Uh oh!

Conversation

niketanpansare commented Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Using SystemML's Logistic Regression (scikit-learn way):

Using SystemML's Logistic Regression (MLPipeline way):

Using SystemML's Linear Regression (scikit-learn way):

Using SystemML's SVM (scikit-learn way):

Using SystemML's SVM (MLPipeline way):

Using SystemML's Naive Bayes (Scikit learn way):

Uh oh!

niketanpansare commented Aug 5, 2016

Uh oh!

MechCoder commented Aug 5, 2016

Uh oh!

niketanpansare commented Aug 5, 2016

Uh oh!

MechCoder Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 5, 2016

Uh oh!

mboehm7 commented Aug 5, 2016

Uh oh!

MechCoder Aug 5, 2016

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 5, 2016

Uh oh!

dusenberrymw commented Aug 5, 2016

Uh oh!

MechCoder Aug 5, 2016

Choose a reason for hiding this comment

Uh oh!

mboehm7 commented Aug 5, 2016

Uh oh!

niketanpansare commented Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MechCoder Aug 6, 2016

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 6, 2016

Uh oh!

niketanpansare commented Aug 7, 2016

Uh oh!

MechCoder commented Aug 8, 2016

Uh oh!

niketanpansare commented Aug 8, 2016

Uh oh!

MechCoder Aug 8, 2016

Choose a reason for hiding this comment

Uh oh!

niketanpansare Aug 8, 2016

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niketanpansare commented Aug 9, 2016

Uh oh!

niketanpansare commented Aug 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

niketanpansare commented Aug 5, 2016 •

edited

Loading

MechCoder Aug 5, 2016 •

edited

Loading

niketanpansare commented Aug 6, 2016 •

edited

Loading

MechCoder commented Aug 9, 2016 •

edited

Loading