[SYSTEMML-451] Python embedded DSL #197

niketanpansare · 2016-07-28T19:57:59Z

To make progress on adoption front, I think we need to support embedded Python DSL. So, I have created this PR as an initial proposal. I have added subset of builtin functions (namely, full/rand.*/mlprint/dot), binary/unary operators for matrices and parfor as an example language level construct.

Please use this PR to facilitate further discussion on this topic. Here are some initial points:

What is preferred usage scenario for a Python data scientist ?
a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext.
b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext.
c. Embedded DSL approach.
Even in embedded DSL approach, there are two possible implementation choices:
a. Execute code with a context (as in this PR).
Pros: Simple and elegant push-down mechanism and no redundant computation.
Cons: Difficult to implement mixed code.
b. Add lazy data structures which will be executed only when certain actions are invoked on them.
Push-down of language-level constructs (Will skip this in this commit):
a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code.
b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
Whether to add standalone support as it requires reimplementing Py4J bridge of PySpark.
APIs of the built-in functions: Should it be similar to PyDML ?

If after the discussion, we decide to go in different direction, I will delete this PR.

If we agree to go in this direction, I can create following tasks on JIRA:

Make Python embedded DSL feature complete.
Add documentation and examples for Python embedded DSL.
Create .py files for all our algorithms using Python embedded DSL. We can keep interface similar to scikit-learn.
Add additional input/output mechanism (other than DataFrame), for example: binary blocked RDD, string rdd, MLLib's BlockMatrix, NumPy arrays, etc.
Add Py4J support to allow standalone usage and create pip installer for systemml.

Please note: I am using older MLContext approach and we can update it to newer MLContext once it is delivered.

An example script supported by this commit:

# wget https://sparktc.ibmcloud.com/repo/latest/systemml-0.11.0-incubating-SNAPSHOT.jar
# pyspark --master local[*] --driver-class-path systemml-0.11.0-incubating-SNAPSHOT.jar

>>> import SystemML as sml
>>> import numpy as np
>>> sml.setSparkContext(sc)

Welcome to Apache SystemML!

>>> m1 = sml.matrix(np.ones((3,3)) + 2)
>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> m2 = m1 * (m2 + m1)
>>> m4 = 1.0 - m2
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
mVar1 = load(" ", format="csv")
mVar2 = load(" ", format="csv")
mVar3 = mVar2 + mVar1
mVar4 = mVar1 * mVar3
mVar5 = 1.0 - mVar4
save(mVar5, " ")

<SystemML.defmatrix.matrix object>
>>> m2.eval()
>>> m2
# This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.
<SystemML.defmatrix.matrix object>
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
mVar4 = load(" ", format="csv")
mVar5 = 1.0 - mVar4
save(mVar5, " ")

<SystemML.defmatrix.matrix object>
>>> m4.sum(axis=1).toNumPyArray()
array([[-60.],
       [-60.],
       [-60.]])
>>>

@mboehm7 @frreiss @bertholdreinwald @nakul02 @dusenberrymw @deroneriksson

dusenberrymw · 2016-07-29T00:25:37Z

@niketanpansare This is great! I'm definitely in favor of creating a Python DSL. Here are my thoughts on the proposed questions:

What is preferred usage scenario for a Python data scientist ?
a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext.
b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext.
c. Embedded DSL approach. This is where we need to move to if we want data scientists to start using the project since they will want to write code in Python itself.
Even in embedded DSL approach, there are two possible implementation choices:
a. Execute code with a context (as in this PR).
Pros: Simple and elegant push-down mechanism and no redundant computation.
Cons: Difficult to implement mixed code.
b. Add lazy data structures which will be executed only when certain actions are invoked on them. Ideally, we will want the user to be able to mix in NumPy and other Python code seamlessly. Option (a) is still a decent approach.
Push-down of language-level constructs:
a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code. We definitely do not want to ask the user to create a custom Python build. We should simply support PySpark with Python 2 or 3.
b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
Whether to add standalone support as it requires reimplementing Py4J bridge of PySpark. No, using PySpark on a laptop is already extremely simple.
APIs of the built-in functions: Should it be similar to PyDML ? I would vote for it to be equivalent to NumPy.

niketanpansare · 2016-07-29T20:17:24Z

Thanks @dusenberrymw for your thoughts 👍

Here are my comments:

Add lazy data structures which will be executed only when certain actions are invoked on them. Option (a) is still a decent approach.

My only worry with Option (2.b) is that adding lazy data structure would make Python API complex and hence make the embedded script difficult to debug. We can start by getting feedback about Option (2.a) from data scientists and go to Option (2.b) only if absolutely necessary.

Push-down of language-level constructs

I agree with you about Option (3a). Also, cleaned up the API for language-level constructs (and updated the example in my first comment). So Option (3.b) is not required.

Ideally, we will want the user to be able to mix in NumPy and other Python code seamlessly.

There are two options here:

A. We ask users to isolate the mixed code themselves:

with context(sqlCtx):
    m1 = random.uniform(rows=3, cols=2)
    m2 = random.normal(rows=2, cols=3)
    m3 = dot(m1, m2)
    registerOutput([m3])

# mixed code:
numpyM3 = m3.getNumPyArray()
numpyM4 = do certain numpy operations
m4 = Matrix(numpyM4)

with context(sqlCtx):
    some DML that uses m4

The above option requires writing appropriate input/output wrappers to convert binary blocked RDD into NumPy arrays and viceversa.

B. We do callbacks to Python using external builtin functions. Here the user will have to wrap up mixed code into Python UDF and then we provide some construct to push it down as external UDF. The external UDF will do necessary conversions and call Python UDF using Py4J framework.

iyounus · 2016-07-30T01:21:15Z

I like the idea of writing everything in python. My suggestion would be to make the API more python like, so that for a python user it doesn't feel unnatural. I would like the syntax to be something like the following, but I don't know if it is technically possible.

import numpy as np
import SystemML as sml
from pyspark.sql import SQLContext

mlContext = MLContext(sc)

m1 = sml.full(1, rows=3, cols=3)
m1 = m1 + 1

tmp = np.ones((3,3))  # numpy matrix
m2 = sml.matrix(tmp)  # ability to convert numpy to systemml matrix (???)
m2 += 2
m3 = m1 * m2
m3 /= m1
m1 = sml.random.uniform(rows=3, cols=2)
m2 = sml.random.normal(rows=2, cols=3)
m4 = sml.dot(m1, m2)

mlContext.registerOutput([m3, m4])
mlContext.execute()

m3.getDF().show()
m4.getDF().show()

niketanpansare · 2016-08-01T21:32:49Z

Thanks @iyounus ... Your suggestions are much appreciated. I have delivered minor fixes and conversion to/from numpy arrays to current prototype. Also, I have updated the code in my first comment to reflect it.

dusenberrymw · 2016-08-01T22:19:05Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+		except AttributeError:
+			pass
+		self.numTabs = 0
+		self.sqlCtx = sqlCtx


We can use self.sqlCtx = SQLContext.getOrCreate(sc), assuming a from pyspark.sql import SQLContext import statement up top. That way, we can get rid of the need to explicitly create and pass around a SQLContext object (which is going away anyways in 2.0).

Good point. I will update my PR. For some reason self.sqlCtx = SQLContext.getOrCreate(sc) is throwing an error. Instead am creating self.sqlCtx = SQLContext(sc) for now and we can replace it with cleaner call.

dusenberrymw · 2016-08-01T22:25:24Z

@niketanpansare The latest update is awesome! I really like not having to use with with ... construct for the main part of the code, as well as not having to explicitly call any of the registerInput stuff.

dusenberrymw · 2016-08-01T22:32:49Z

@niketanpansare Also, as a side note, can you replace the tabs in the file with spaces so that we can use this with Python 3? I made a commit on Friday addressing this with a couple of tabs accidentally added into the existing MLContext. Additionally, I made another slight change to executeScript on Friday as well for Python 3.

niketanpansare · 2016-08-01T22:43:07Z

@dusenberrymw Replaced tabs with spaces 👍

MechCoder · 2016-08-01T22:47:46Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+            self.df = df
+        Matrix.systemmlVarID += 1
+        self.ID = 'mVar' + str(Matrix.systemmlVarID)
+    def getID(self):


stylistic comment: Can you space the methods out with a space in between, so that it is easier to read?

MechCoder · 2016-08-01T23:08:35Z

I am not a SystemML developer but I am looking forward to testing this out.

@niketanpansare I still get errors on using this with Python3

  File "/home/manoj/incubator-systemml/SystemML.py", line 318
    try:
        ^
TabError: inconsistent use of tabs and spaces in indentation

MechCoder · 2016-08-01T23:20:24Z

Oh, I used the version before the last commit was made. Please ignore my previous comment.

niketanpansare · 2016-08-01T23:56:03Z

@MechCoder Can you please reply with your thoughts/preference (as @dusenberrymw did in his first comment) on the questions in my first comment ? Also, any additional comments about high-level design as well as usage wrt embedded Python API is welcomed :)

MechCoder · 2016-08-02T21:33:43Z

@niketanpansare I really like the script in the PR description, especially after the with constraints are done with. I think this is a good starting point.

MechCoder · 2016-08-02T21:39:00Z

src/main/java/org/apache/sysml/api/python/SystemML.py

+        self.matrixOutput = []
+
+
+    def stop(self, vals):


Can this be stop(self, *vals) or stop(self, vals=None)?

It is odd that I need to provide an argument to stop the MLContext

Yeah I think it might also be interesting to think about removing the start(...) function, and renaming this stop(...) function to something like eval(...).

@MechCoder Good point. Will update the PR with your suggestion :)

@dusenberrymw The only reason I added start and stop is specify the scope that SystemML will try to exploit as in some cases it might have adverse performance impact due to internal book-keeping. I will keep start for now, but will rename the stop to eval 👍

One possibility is to start with a "fresh" scope on creation of the original MLContext object, and then "restart" the scope on every call to eval(...).

dusenberrymw · 2016-08-03T00:21:03Z

I agree with @MechCoder that this is a great starting point. In general, we should really continue to focus on solid design for this API so that it is correctly Pythonic, and extremely well integrated with NumPy.

frreiss · 2016-08-03T22:32:18Z

Is it appropriate to use SYSTEMML-451 as the locus for this work? It seems like the Python DSL is going to proceed more or less independently of work in Scala and Java, at least in the short term.

dusenberrymw · 2016-08-05T16:41:44Z

By the way, let's keep this open for a bit as a WIP to continue thinking carefully about the overall design and trying it out. I also want to look around at other distributed NumPy-like projects.

niketanpansare · 2016-08-27T05:07:36Z

Updated the documentation: http://niketanpansare.github.io/incubator-systemml/beginners-guide-python.html

niketanpansare · 2016-08-27T18:39:05Z

Unless there are major concerns, I think we should merge this PR now and create new PR for additional fixes. This is because any additional Python related changes (eg: improvements to MLContext, merging Matrix & matrix classes, release-related changes (SystemML v/s systemml, pip installer), examples, etc) are dependent on merging this PR.

dusenberrymw · 2016-08-27T18:46:29Z

src/main/java/org/apache/sysml/api/python/SystemML.py

@@ -1,482 +0,0 @@
-#!/usr/bin/python


I don't think we should remove this entire file in this PR as it contains the the old MLContext that was in the previous release. I would just remove the mllearn and mlpipeline stuff from it, as that was added after the last official release.

dusenberrymw · 2016-08-27T18:53:44Z

@niketanpansare I agree that we should get this PR in and then create a set of JIRAs/PRs to address a bunch of the other concerns. Before we merge though, I left comments for a few things that should be addressed now:

I don't think we should completely remove the old src/main/java/org/apache/sysml/api/python/SystemML.py file in this PR, as it contains the old MLContext. We should just remove the mllearn and mlpipeline stuff from it, as those components were only added recently. That way we can deprecate both the Python & Java old MLContext APIs, and then remove them in the future.
There are still several places in mlcontext.py in which the docstrings were messed up in the merge, including missing comments. Can you add those back in before merging?

dusenberrymw · 2016-08-27T19:06:07Z

For the next set of JIRAs/PRs, we should aim to complete the following before the upcoming official release:

Update the package from SystemML to systemml. The entire src/main/python folder was added from scratch very recently with a new API that is not backwards compatible anyway, and the mllearn and mlpipeline components were also added recently after the previous release, we won't break anything, and will be more Pythonic in the process.
Remove the MLResults.getDataFrame and replace usages of getDataFrame("matrixName") with get("matrixName").toDF(). This way, the API stays clean and simple for an end user, and will help with the upcoming merge of Matrix and matrix. Keeping only get() makes it simple to get a mixture of datatypes out of a DML script at the same time. [See below example]
Remove the MLResults.getNumPyArray, add a Matrix.toArray() function on the Java and Python sides (returning a list in Python), add a Matrix.toNumPyArray() on the Python side, and replace usages of getNumPyArray("matrixName") with get("matrixName").toNumPyArray(). @deroneriksson should be able to assist with adding this on the Java side. Once again, this will keep the API clean, help the upcoming merge, and keeping only get() makes it simple to get a mixture of datatypes out of a DML script at the same time.
Merge Matrix with matrix, keeping the later name.

Overall, I'm excited for the direction the Python API is moving! 👍

Example of using get() for mixed datatypes (scalars, matrices, strings, etc):

from SystemML import MLContext, dml

rdd1 = sc.parallelize(["1.0,2.0", "3.0,4.0"])

sums = """
s1 = sum(m1)
m2 = m1 * 2
t = "whatever"
"""

ml = MLContext(sc)
script = dml(sums).input(m1=rdd1).out("s1", "m2", "t")
s1, m2, t = ml.execute(script).get("s1", "m2", "t")
s1, m2, t

Returns:

(10.0, Matrix, 'whatever')

Then,

m2.toDF()

Returns

DataFrame[ID: double, C1: double, C2: double]

Additionally, the Matrix m2 can be used directly as input in a subsequent script.

niketanpansare · 2016-08-27T19:44:34Z

+1 for all 4 points. Two comments on "Update the package from SystemML to systemml":

Let's send an email on the dev mailing list to notify everyone of this change once the PR is created.
The PR should also update the existing documentation and examples.

niketanpansare · 2016-08-27T20:02:48Z

@dusenberrymw Updated mlcontext.py as per your suggestions. Please let me know if there are any more changes.

dusenberrymw · 2016-08-27T21:01:49Z

@niketanpansare Awesome, LGTM. Let's merge this, and create JIRAs + PRs to cover the remaining 4 items above, and the remaining set of miscellaneous comments throughout this PR. After those two steps, let's focus on the important aspect of expanding and hardening each of the APIs (matrix, mllearn, mlpipeline, MLContext), i.e. getting complete feature parity with NumPy matrices, etc.

+1 for the documentation page you created too 👍

Also, when you merge this, can you append the line Closes #197. to the end of the commit message so that this PR and the commit will be automatically linked together?

- Added matrix class that supports lazy evaluation of elementary matrix operations. - Updated documentation for Python users that explains usage of mllearn, matrix and mlcontext. - Added a setup file for pip installer. Closes #197

- Added matrix class that supports lazy evaluation of elementary matrix operations. - Updated documentation for Python users that explains usage of mllearn, matrix and mlcontext. - Added a setup file for pip installer. Closes apache#197

dusenberrymw reviewed Aug 1, 2016
View reviewed changes

MechCoder reviewed Aug 1, 2016
View reviewed changes

MechCoder reviewed Aug 2, 2016
View reviewed changes

niketanpansare force-pushed the python branch from a51a94c to f6dc9ec Compare August 5, 2016 16:23

Niketan Pansare added 3 commits August 26, 2016 20:55

Updating the documentation

d3b2643

Updated MLContext as per Mike's comments

f2d1c9a

Missing div

53b88c3

niketanpansare force-pushed the python branch from 5ad9051 to 53b88c3 Compare August 27, 2016 04:06

niketanpansare mentioned this pull request Aug 27, 2016

[SYSTEMML-872] Add new SystemML.py to various artifacts #222

Closed

Updated logistic regression scikit learn example

2ae74f0

Niketan Pansare added 2 commits August 26, 2016 22:08

Updated documentation

f9bc745

Updated the documentation

705e947

dusenberrymw reviewed Aug 27, 2016
View reviewed changes

Niketan Pansare added 2 commits August 27, 2016 12:54

Added old mlcontext (we will decprecate this)

f98bdf2

Addressed docstring issues in mlcontext.py

8664b18

asfgit closed this in 48a7267 Aug 28, 2016

[SYSTEMML-451] Python embedded DSL #197

[SYSTEMML-451] Python embedded DSL #197

Uh oh!

Conversation

niketanpansare commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dusenberrymw commented Jul 29, 2016

Uh oh!

niketanpansare commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iyounus commented Jul 30, 2016

Uh oh!

niketanpansare commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dusenberrymw Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Aug 1, 2016

Uh oh!

dusenberrymw commented Aug 1, 2016

Uh oh!

niketanpansare commented Aug 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MechCoder commented Aug 1, 2016

Uh oh!

niketanpansare commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MechCoder commented Aug 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Aug 3, 2016

Uh oh!

frreiss commented Aug 3, 2016

Uh oh!

dusenberrymw commented Aug 5, 2016

Uh oh!

niketanpansare commented Aug 27, 2016

Uh oh!

niketanpansare commented Aug 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dusenberrymw commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niketanpansare commented Aug 27, 2016

Uh oh!

niketanpansare commented Aug 27, 2016

Uh oh!

dusenberrymw commented Aug 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

niketanpansare commented Jul 28, 2016 •

edited

Loading

niketanpansare commented Jul 29, 2016 •

edited

Loading

niketanpansare commented Aug 1, 2016 •

edited

Loading

dusenberrymw Aug 1, 2016 •

edited

Loading

MechCoder commented Aug 1, 2016 •

edited

Loading

niketanpansare commented Aug 1, 2016 •

edited

Loading

dusenberrymw commented Aug 27, 2016 •

edited

Loading

dusenberrymw commented Aug 27, 2016 •

edited

Loading