-
Notifications
You must be signed in to change notification settings - Fork 520
[SYSTEMML-451] Python embedded DSL #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@niketanpansare This is great! I'm definitely in favor of creating a Python DSL. Here are my thoughts on the proposed questions:
|
|
Thanks @dusenberrymw for your thoughts 👍 Here are my comments:
My only worry with Option (2.b) is that adding lazy data structure would make Python API complex and hence make the embedded script difficult to debug. We can start by getting feedback about Option (2.a) from data scientists and go to Option (2.b) only if absolutely necessary.
I agree with you about Option (3a). Also, cleaned up the API for language-level constructs (and updated the example in my first comment). So Option (3.b) is not required.
There are two options here: A. We ask users to isolate the mixed code themselves: with context(sqlCtx):
m1 = random.uniform(rows=3, cols=2)
m2 = random.normal(rows=2, cols=3)
m3 = dot(m1, m2)
registerOutput([m3])
# mixed code:
numpyM3 = m3.getNumPyArray()
numpyM4 = do certain numpy operations
m4 = Matrix(numpyM4)
with context(sqlCtx):
some DML that uses m4The above option requires writing appropriate input/output wrappers to convert binary blocked RDD into NumPy arrays and viceversa. B. We do callbacks to Python using external builtin functions. Here the user will have to wrap up mixed code into Python UDF and then we provide some construct to push it down as external UDF. The external UDF will do necessary conversions and call Python UDF using Py4J framework. |
|
I like the idea of writing everything in python. My suggestion would be to make the API more python like, so that for a python user it doesn't feel unnatural. I would like the syntax to be something like the following, but I don't know if it is technically possible. import numpy as np
import SystemML as sml
from pyspark.sql import SQLContext
mlContext = MLContext(sc)
m1 = sml.full(1, rows=3, cols=3)
m1 = m1 + 1
tmp = np.ones((3,3)) # numpy matrix
m2 = sml.matrix(tmp) # ability to convert numpy to systemml matrix (???)
m2 += 2
m3 = m1 * m2
m3 /= m1
m1 = sml.random.uniform(rows=3, cols=2)
m2 = sml.random.normal(rows=2, cols=3)
m4 = sml.dot(m1, m2)
mlContext.registerOutput([m3, m4])
mlContext.execute()
m3.getDF().show()
m4.getDF().show() |
|
Thanks @iyounus ... Your suggestions are much appreciated. I have delivered minor fixes and conversion to/from numpy arrays to current prototype. Also, I have updated the code in my first comment to reflect it. |
| except AttributeError: | ||
| pass | ||
| self.numTabs = 0 | ||
| self.sqlCtx = sqlCtx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use self.sqlCtx = SQLContext.getOrCreate(sc), assuming a from pyspark.sql import SQLContext import statement up top. That way, we can get rid of the need to explicitly create and pass around a SQLContext object (which is going away anyways in 2.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update my PR. For some reason self.sqlCtx = SQLContext.getOrCreate(sc) is throwing an error. Instead am creating self.sqlCtx = SQLContext(sc) for now and we can replace it with cleaner call.
|
@niketanpansare The latest update is awesome! I really like not having to use with |
|
@niketanpansare Also, as a side note, can you replace the tabs in the file with spaces so that we can use this with Python 3? I made a commit on Friday addressing this with a couple of tabs accidentally added into the existing MLContext. Additionally, I made another slight change to |
|
@dusenberrymw Replaced tabs with spaces 👍 |
| self.df = df | ||
| Matrix.systemmlVarID += 1 | ||
| self.ID = 'mVar' + str(Matrix.systemmlVarID) | ||
| def getID(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistic comment: Can you space the methods out with a space in between, so that it is easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
|
I am not a SystemML developer but I am looking forward to testing this out. @niketanpansare I still get errors on using this with Python3 File "/home/manoj/incubator-systemml/SystemML.py", line 318
try:
^
TabError: inconsistent use of tabs and spaces in indentation |
|
Oh, I used the version before the last commit was made. Please ignore my previous comment. |
|
@MechCoder Can you please reply with your thoughts/preference (as @dusenberrymw did in his first comment) on the questions in my first comment ? Also, any additional comments about high-level design as well as usage wrt embedded Python API is welcomed :) |
|
@niketanpansare I really like the script in the PR description, especially after the with constraints are done with. I think this is a good starting point. |
| self.matrixOutput = [] | ||
|
|
||
|
|
||
| def stop(self, vals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be stop(self, *vals) or stop(self, vals=None)?
It is odd that I need to provide an argument to stop the MLContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it might also be interesting to think about removing the start(...) function, and renaming this stop(...) function to something like eval(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MechCoder Good point. Will update the PR with your suggestion :)
@dusenberrymw The only reason I added start and stop is specify the scope that SystemML will try to exploit as in some cases it might have adverse performance impact due to internal book-keeping. I will keep start for now, but will rename the stop to eval 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is to start with a "fresh" scope on creation of the original MLContext object, and then "restart" the scope on every call to eval(...).
|
I agree with @MechCoder that this is a great starting point. In general, we should really continue to focus on solid design for this API so that it is correctly Pythonic, and extremely well integrated with NumPy. |
|
Is it appropriate to use SYSTEMML-451 as the locus for this work? It seems like the Python DSL is going to proceed more or less independently of work in Scala and Java, at least in the short term. |
|
By the way, let's keep this open for a bit as a WIP to continue thinking carefully about the overall design and trying it out. I also want to look around at other distributed NumPy-like projects. |
|
Updated the documentation: http://niketanpansare.github.io/incubator-systemml/beginners-guide-python.html |
|
Unless there are major concerns, I think we should merge this PR now and create new PR for additional fixes. This is because any additional Python related changes (eg: improvements to MLContext, merging Matrix & matrix classes, release-related changes (SystemML v/s systemml, pip installer), examples, etc) are dependent on merging this PR. |
| @@ -1,482 +0,0 @@ | |||
| #!/usr/bin/python | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should remove this entire file in this PR as it contains the the old MLContext that was in the previous release. I would just remove the mllearn and mlpipeline stuff from it, as that was added after the last official release.
|
@niketanpansare I agree that we should get this PR in and then create a set of JIRAs/PRs to address a bunch of the other concerns. Before we merge though, I left comments for a few things that should be addressed now:
|
|
For the next set of JIRAs/PRs, we should aim to complete the following before the upcoming official release:
Overall, I'm excited for the direction the Python API is moving! 👍 Example of using from SystemML import MLContext, dml
rdd1 = sc.parallelize(["1.0,2.0", "3.0,4.0"])
sums = """
s1 = sum(m1)
m2 = m1 * 2
t = "whatever"
"""
ml = MLContext(sc)
script = dml(sums).input(m1=rdd1).out("s1", "m2", "t")
s1, m2, t = ml.execute(script).get("s1", "m2", "t")
s1, m2, tReturns: (10.0, Matrix, 'whatever')Then, m2.toDF()Returns DataFrame[ID: double, C1: double, C2: double]Additionally, the |
|
+1 for all 4 points. Two comments on "Update the package from SystemML to systemml":
|
|
@dusenberrymw Updated mlcontext.py as per your suggestions. Please let me know if there are any more changes. |
|
@niketanpansare Awesome, LGTM. Let's merge this, and create JIRAs + PRs to cover the remaining 4 items above, and the remaining set of miscellaneous comments throughout this PR. After those two steps, let's focus on the important aspect of expanding and hardening each of the APIs (matrix, mllearn, mlpipeline, MLContext), i.e. getting complete feature parity with NumPy matrices, etc. +1 for the documentation page you created too 👍 Also, when you merge this, can you append the line |
- Added matrix class that supports lazy evaluation of elementary matrix operations. - Updated documentation for Python users that explains usage of mllearn, matrix and mlcontext. - Added a setup file for pip installer. Closes #197
- Added matrix class that supports lazy evaluation of elementary matrix operations. - Updated documentation for Python users that explains usage of mllearn, matrix and mlcontext. - Added a setup file for pip installer. Closes apache#197
To make progress on adoption front, I think we need to support embedded Python DSL. So, I have created this PR as an initial proposal. I have added subset of builtin functions (namely, full/rand.*/mlprint/dot), binary/unary operators for matrices and parfor as an example language level construct.
Please use this PR to facilitate further discussion on this topic. Here are some initial points:
a. Scikit-learn like-library. In this case, we can call the DML algorithms using MLContext.
b. External DSL approach: The data scientists will write her code in PyDML and then use MLContext.
c. Embedded DSL approach.
a. Execute code with a context (as in this PR).
Pros: Simple and elegant push-down mechanism and no redundant computation.
Cons: Difficult to implement mixed code.
b. Add lazy data structures which will be executed only when certain actions are invoked on them.
a. No modification to Python implementation, but slightly inelegant usage through use of functions (as in this PR). See parfor invocation in the below code.
b. Modify Python parser to support pushdown. This requires users to rebuilt oru version of python from scratch.
If after the discussion, we decide to go in different direction, I will delete this PR.
If we agree to go in this direction, I can create following tasks on JIRA:
Please note: I am using older MLContext approach and we can update it to newer MLContext once it is delivered.
An example script supported by this commit:
@mboehm7 @frreiss @bertholdreinwald @nakul02 @dusenberrymw @deroneriksson