Skip to content

Conversation

@niketanpansare
Copy link
Contributor

@niketanpansare niketanpansare commented Jul 28, 2016

To make progress on adoption front, I think we need to support embedded Python DSL. So, I have created this PR as an initial proposal. I have added subset of builtin functions (namely, full/rand.*/mlprint/dot), binary/unary operators for matrices and parfor as an example language level construct.

Please use this PR to facilitate further discussion on this topic. Here are some initial points:

  1. What is preferred usage scenario for a Python data scientist ?
    a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext.
    b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext.
    c. Embedded DSL approach.
  2. Even in embedded DSL approach, there are two possible implementation choices:
    a. Execute code with a context (as in this PR).
    Pros: Simple and elegant push-down mechanism and no redundant computation.
    Cons: Difficult to implement mixed code.
    b. Add lazy data structures which will be executed only when certain actions are invoked on them.
  3. Push-down of language-level constructs (Will skip this in this commit):
    a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code.
    b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
  4. Whether to add standalone support as it requires reimplementing Py4J bridge of PySpark.
  5. APIs of the built-in functions: Should it be similar to PyDML ?

If after the discussion, we decide to go in different direction, I will delete this PR.

If we agree to go in this direction, I can create following tasks on JIRA:

  1. Make Python embedded DSL feature complete.
  2. Add documentation and examples for Python embedded DSL.
  3. Create .py files for all our algorithms using Python embedded DSL. We can keep interface similar to scikit-learn.
  4. Add additional input/output mechanism (other than DataFrame), for example: binary blocked RDD, string rdd, MLLib's BlockMatrix, NumPy arrays, etc.
  5. Add Py4J support to allow standalone usage and create pip installer for systemml.

Please note: I am using older MLContext approach and we can update it to newer MLContext once it is delivered.

An example script supported by this commit:

# wget https://sparktc.ibmcloud.com/repo/latest/systemml-0.11.0-incubating-SNAPSHOT.jar
# pyspark --master local[*] --driver-class-path systemml-0.11.0-incubating-SNAPSHOT.jar

>>> import SystemML as sml
>>> import numpy as np
>>> sml.setSparkContext(sc)

Welcome to Apache SystemML!

>>> m1 = sml.matrix(np.ones((3,3)) + 2)
>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> m2 = m1 * (m2 + m1)
>>> m4 = 1.0 - m2
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
mVar1 = load(" ", format="csv")
mVar2 = load(" ", format="csv")
mVar3 = mVar2 + mVar1
mVar4 = mVar1 * mVar3
mVar5 = 1.0 - mVar4
save(mVar5, " ")

<SystemML.defmatrix.matrix object>
>>> m2.eval()
>>> m2
# This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.
<SystemML.defmatrix.matrix object>
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
mVar4 = load(" ", format="csv")
mVar5 = 1.0 - mVar4
save(mVar5, " ")

<SystemML.defmatrix.matrix object>
>>> m4.sum(axis=1).toNumPyArray()
array([[-60.],
       [-60.],
       [-60.]])
>>>

@mboehm7 @frreiss @bertholdreinwald @nakul02 @dusenberrymw @deroneriksson

@dusenberrymw
Copy link
Contributor

@niketanpansare This is great! I'm definitely in favor of creating a Python DSL. Here are my thoughts on the proposed questions:

  1. What is preferred usage scenario for a Python data scientist ?
    a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext.
    b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext.
    c. Embedded DSL approach. This is where we need to move to if we want data scientists to start using the project since they will want to write code in Python itself.
  2. Even in embedded DSL approach, there are two possible implementation choices:
    a. Execute code with a context (as in this PR).
    Pros: Simple and elegant push-down mechanism and no redundant computation.
    Cons: Difficult to implement mixed code.
    b. Add lazy data structures which will be executed only when certain actions are invoked on them. Ideally, we will want the user to be able to mix in NumPy and other Python code seamlessly. Option (a) is still a decent approach.
  3. Push-down of language-level constructs:
    a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code. We definitely do not want to ask the user to create a custom Python build. We should simply support PySpark with Python 2 or 3.
    b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
  4. Whether to add standalone support as it requires reimplementing Py4J bridge of PySpark. No, using PySpark on a laptop is already extremely simple.
  5. APIs of the built-in functions: Should it be similar to PyDML ? I would vote for it to be equivalent to NumPy.

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Jul 29, 2016

Thanks @dusenberrymw for your thoughts 👍

Here are my comments:

Add lazy data structures which will be executed only when certain actions are invoked on them. Option (a) is still a decent approach.

My only worry with Option (2.b) is that adding lazy data structure would make Python API complex and hence make the embedded script difficult to debug. We can start by getting feedback about Option (2.a) from data scientists and go to Option (2.b) only if absolutely necessary.

Push-down of language-level constructs

I agree with you about Option (3a). Also, cleaned up the API for language-level constructs (and updated the example in my first comment). So Option (3.b) is not required.

Ideally, we will want the user to be able to mix in NumPy and other Python code seamlessly.

There are two options here:

A. We ask users to isolate the mixed code themselves:

with context(sqlCtx):
    m1 = random.uniform(rows=3, cols=2)
    m2 = random.normal(rows=2, cols=3)
    m3 = dot(m1, m2)
    registerOutput([m3])

# mixed code:
numpyM3 = m3.getNumPyArray()
numpyM4 = do certain numpy operations
m4 = Matrix(numpyM4)

with context(sqlCtx):
    some DML that uses m4

The above option requires writing appropriate input/output wrappers to convert binary blocked RDD into NumPy arrays and viceversa.

B. We do callbacks to Python using external builtin functions. Here the user will have to wrap up mixed code into Python UDF and then we provide some construct to push it down as external UDF. The external UDF will do necessary conversions and call Python UDF using Py4J framework.

@iyounus
Copy link
Contributor

iyounus commented Jul 30, 2016

I like the idea of writing everything in python. My suggestion would be to make the API more python like, so that for a python user it doesn't feel unnatural. I would like the syntax to be something like the following, but I don't know if it is technically possible.

import numpy as np
import SystemML as sml
from pyspark.sql import SQLContext

mlContext = MLContext(sc)

m1 = sml.full(1, rows=3, cols=3)
m1 = m1 + 1

tmp = np.ones((3,3))  # numpy matrix
m2 = sml.matrix(tmp)  # ability to convert numpy to systemml matrix (???)
m2 += 2
m3 = m1 * m2
m3 /= m1
m1 = sml.random.uniform(rows=3, cols=2)
m2 = sml.random.normal(rows=2, cols=3)
m4 = sml.dot(m1, m2)

mlContext.registerOutput([m3, m4])
mlContext.execute()

m3.getDF().show()
m4.getDF().show()

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Aug 1, 2016

Thanks @iyounus ... Your suggestions are much appreciated. I have delivered minor fixes and conversion to/from numpy arrays to current prototype. Also, I have updated the code in my first comment to reflect it.

except AttributeError:
pass
self.numTabs = 0
self.sqlCtx = sqlCtx
Copy link
Contributor

@dusenberrymw dusenberrymw Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use self.sqlCtx = SQLContext.getOrCreate(sc), assuming a from pyspark.sql import SQLContext import statement up top. That way, we can get rid of the need to explicitly create and pass around a SQLContext object (which is going away anyways in 2.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update my PR. For some reason self.sqlCtx = SQLContext.getOrCreate(sc) is throwing an error. Instead am creating self.sqlCtx = SQLContext(sc) for now and we can replace it with cleaner call.

@dusenberrymw
Copy link
Contributor

@niketanpansare The latest update is awesome! I really like not having to use with with ... construct for the main part of the code, as well as not having to explicitly call any of the registerInput stuff.

@dusenberrymw
Copy link
Contributor

@niketanpansare Also, as a side note, can you replace the tabs in the file with spaces so that we can use this with Python 3? I made a commit on Friday addressing this with a couple of tabs accidentally added into the existing MLContext. Additionally, I made another slight change to executeScript on Friday as well for Python 3.

@niketanpansare
Copy link
Contributor Author

@dusenberrymw Replaced tabs with spaces 👍

self.df = df
Matrix.systemmlVarID += 1
self.ID = 'mVar' + str(Matrix.systemmlVarID)
def getID(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic comment: Can you space the methods out with a space in between, so that it is easier to read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@MechCoder
Copy link
Contributor

MechCoder commented Aug 1, 2016

I am not a SystemML developer but I am looking forward to testing this out.

@niketanpansare I still get errors on using this with Python3

  File "/home/manoj/incubator-systemml/SystemML.py", line 318
    try:
        ^
TabError: inconsistent use of tabs and spaces in indentation

@MechCoder
Copy link
Contributor

Oh, I used the version before the last commit was made. Please ignore my previous comment.

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Aug 1, 2016

@MechCoder Can you please reply with your thoughts/preference (as @dusenberrymw did in his first comment) on the questions in my first comment ? Also, any additional comments about high-level design as well as usage wrt embedded Python API is welcomed :)

@MechCoder
Copy link
Contributor

@niketanpansare I really like the script in the PR description, especially after the with constraints are done with. I think this is a good starting point.

self.matrixOutput = []


def stop(self, vals):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be stop(self, *vals) or stop(self, vals=None)?

It is odd that I need to provide an argument to stop the MLContext

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it might also be interesting to think about removing the start(...) function, and renaming this stop(...) function to something like eval(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MechCoder Good point. Will update the PR with your suggestion :)

@dusenberrymw The only reason I added start and stop is specify the scope that SystemML will try to exploit as in some cases it might have adverse performance impact due to internal book-keeping. I will keep start for now, but will rename the stop to eval 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is to start with a "fresh" scope on creation of the original MLContext object, and then "restart" the scope on every call to eval(...).

@dusenberrymw
Copy link
Contributor

I agree with @MechCoder that this is a great starting point. In general, we should really continue to focus on solid design for this API so that it is correctly Pythonic, and extremely well integrated with NumPy.

@frreiss
Copy link
Contributor

frreiss commented Aug 3, 2016

Is it appropriate to use SYSTEMML-451 as the locus for this work? It seems like the Python DSL is going to proceed more or less independently of work in Scala and Java, at least in the short term.

@dusenberrymw
Copy link
Contributor

By the way, let's keep this open for a bit as a WIP to continue thinking carefully about the overall design and trying it out. I also want to look around at other distributed NumPy-like projects.

@niketanpansare
Copy link
Contributor Author

@niketanpansare
Copy link
Contributor Author

Unless there are major concerns, I think we should merge this PR now and create new PR for additional fixes. This is because any additional Python related changes (eg: improvements to MLContext, merging Matrix & matrix classes, release-related changes (SystemML v/s systemml, pip installer), examples, etc) are dependent on merging this PR.

@@ -1,482 +0,0 @@
#!/usr/bin/python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove this entire file in this PR as it contains the the old MLContext that was in the previous release. I would just remove the mllearn and mlpipeline stuff from it, as that was added after the last official release.

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Aug 27, 2016

@niketanpansare I agree that we should get this PR in and then create a set of JIRAs/PRs to address a bunch of the other concerns. Before we merge though, I left comments for a few things that should be addressed now:

  1. I don't think we should completely remove the old src/main/java/org/apache/sysml/api/python/SystemML.py file in this PR, as it contains the old MLContext. We should just remove the mllearn and mlpipeline stuff from it, as those components were only added recently. That way we can deprecate both the Python & Java old MLContext APIs, and then remove them in the future.
  2. There are still several places in mlcontext.py in which the docstrings were messed up in the merge, including missing comments. Can you add those back in before merging?

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Aug 27, 2016

For the next set of JIRAs/PRs, we should aim to complete the following before the upcoming official release:

  1. Update the package from SystemML to systemml. The entire src/main/python folder was added from scratch very recently with a new API that is not backwards compatible anyway, and the mllearn and mlpipeline components were also added recently after the previous release, we won't break anything, and will be more Pythonic in the process.
  2. Remove the MLResults.getDataFrame and replace usages of getDataFrame("matrixName") with get("matrixName").toDF(). This way, the API stays clean and simple for an end user, and will help with the upcoming merge of Matrix and matrix. Keeping only get() makes it simple to get a mixture of datatypes out of a DML script at the same time. [See below example]
  3. Remove the MLResults.getNumPyArray, add a Matrix.toArray() function on the Java and Python sides (returning a list in Python), add a Matrix.toNumPyArray() on the Python side, and replace usages of getNumPyArray("matrixName") with get("matrixName").toNumPyArray(). @deroneriksson should be able to assist with adding this on the Java side. Once again, this will keep the API clean, help the upcoming merge, and keeping only get() makes it simple to get a mixture of datatypes out of a DML script at the same time.
  4. Merge Matrix with matrix, keeping the later name.

Overall, I'm excited for the direction the Python API is moving! 👍

Example of using get() for mixed datatypes (scalars, matrices, strings, etc):

from SystemML import MLContext, dml

rdd1 = sc.parallelize(["1.0,2.0", "3.0,4.0"])

sums = """
s1 = sum(m1)
m2 = m1 * 2
t = "whatever"
"""

ml = MLContext(sc)
script = dml(sums).input(m1=rdd1).out("s1", "m2", "t")
s1, m2, t = ml.execute(script).get("s1", "m2", "t")
s1, m2, t

Returns:

(10.0, Matrix, 'whatever')

Then,

m2.toDF()

Returns

DataFrame[ID: double, C1: double, C2: double]

Additionally, the Matrix m2 can be used directly as input in a subsequent script.

@niketanpansare
Copy link
Contributor Author

+1 for all 4 points. Two comments on "Update the package from SystemML to systemml":

  • Let's send an email on the dev mailing list to notify everyone of this change once the PR is created.
  • The PR should also update the existing documentation and examples.

@niketanpansare
Copy link
Contributor Author

@dusenberrymw Updated mlcontext.py as per your suggestions. Please let me know if there are any more changes.

@dusenberrymw
Copy link
Contributor

@niketanpansare Awesome, LGTM. Let's merge this, and create JIRAs + PRs to cover the remaining 4 items above, and the remaining set of miscellaneous comments throughout this PR. After those two steps, let's focus on the important aspect of expanding and hardening each of the APIs (matrix, mllearn, mlpipeline, MLContext), i.e. getting complete feature parity with NumPy matrices, etc.

+1 for the documentation page you created too 👍

Also, when you merge this, can you append the line Closes #197. to the end of the commit message so that this PR and the commit will be automatically linked together?

@asfgit asfgit closed this in 48a7267 Aug 28, 2016
asfgit pushed a commit that referenced this pull request Aug 28, 2016
- Added matrix class that supports lazy evaluation of elementary matrix
  operations.

- Updated documentation for Python users that explains usage of mllearn,
  matrix and mlcontext.

- Added a setup file for pip installer.

Closes #197
j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017
- Added matrix class that supports lazy evaluation of elementary matrix
  operations.

- Updated documentation for Python users that explains usage of mllearn,
  matrix and mlcontext.

- Added a setup file for pip installer.

Closes apache#197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants