-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13233][SQL] Python Dataset (basic version) #11347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confusing about where to put the python plans.
For the scala ones, we have a object.scala under org.apache.spark.sql.catalyst.plans.logical package which contains all encoder related logical plans, and we have a objects.scala under org.apache.spark.sql.execution package which contains all encoder related physical plans.
For the python ones, current we put all of them under org.apache.spark.sql.execution.python package, including logical plans, physical plans, rules, etc.
cc @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.spark.sql.execution.python is probably ok, since there is no concept of python in catalyst itself.
|
Test build #51878 has finished for PR 11347 at commit
|
|
Test build #51880 has finished for PR 11347 at commit
|
|
Test build #51924 has finished for PR 11347 at commit
|
|
retest this please |
|
Test build #51925 has finished for PR 11347 at commit
|
|
Test build #51926 has finished for PR 11347 at commit
|
|
Test build #51975 has finished for PR 11347 at commit
|
|
Test build #52009 has finished for PR 11347 at commit
|
|
Test build #52018 has finished for PR 11347 at commit
|
|
cc @yhuai |
|
cc @davies too |
python/pyspark/sql/context.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test for backward compatibility?
|
@cloud-fan can we break this into multiple patches? I find the size a bit hard to review. Also maybe let's not do the renaming for now, because it might make more sense in Python to have DataFrame be the main class, and Dataset just be an alias (there is no practical difference in Python), since Python users are more familiar with data frames. |
|
We should also add tests for the compatibility methods. |
|
Test build #52039 has finished for PR 11347 at commit
|
|
Test build #52041 has finished for PR 11347 at commit
|
|
Test build #52052 has finished for PR 11347 at commit
|
|
@cloud-fan Could you update the description of PR to say which approach is take in this PR, thanks! |
python/pyspark/sql/dataframe.py
Outdated
| # If the underlying java DataFrame's output is pickled, which means the query | ||
| # engine don't know the real schema of the data and just keep the pickled binary | ||
| # for each custom object(no batch). So we need to use non-batched serializer here. | ||
| deserializer = PickleSerializer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead of PickleSerializer is pretty high, it will serialize the class for each row, could you do some benchmark to see how is the difference between non-batched vs batched (both size and CPU time)?
|
In Scala, it's clear that DataFrame is Dataset[Row], and some of function could work with DataFrame, some may not, compiler could check the types. But in Python, it's confusing to me, sometimes the record is Row object, sometimes the record is just arbitrary object (for example, int). Especially when we create a new DataFrame, for example, Before this PR, it's clear that Python DataFrame always has Row with known schema with it. for example: df.rdd returns an RDD |
|
Test build #52168 has finished for PR 11347 at commit
|
|
Hi @davies I think this is a problem. We should find another way to overcome the issue that binary data at JVM side has wrong row count, so that we can still use batched serializer. |
|
I think we had solved this problem in RDD, you could take that as an example. |
| # planner should not crash without a join | ||
| broadcast(df1)._jdf.queryExecution().executedPlan() | ||
|
|
||
| def test_basic_typed_operations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be easier to reason about test failures if we break it to multiple defs?
|
Will send a simpler version that just use RDD. |
What changes were proposed in this pull request?
This PR introduce a new API: Python dataset. Conceptually it's a combination of Python DataFrame and Python RDD, supports both typed operations(e.g. map, flatMap, filter, etc.) and untyped operations(e.g. select, sort, etc.). This is a simpler version of #11117, without the aggregate part.
There are 3 ways(can be more) to add the
DatasetAPI to python:DataFrameto support typed operations and alias it withDatasetDataFrametoDataset, and aliasDatasetwithDataFramefor compatibility(so conceptually we only have Dataset). We also need to rename a lot of related stuff likecreateDataFrame, some documents, etc.DataFrameand makeDatasetextendDataFrame. Typed operations will always returnDataset, structured operations will always returnDataFrame, andDatasethas an extra API calledapplySchemato turn it intoDataFrame.This PR choose the first one.
How was this patch tested?
new tests are added in pyspark/sql/tests.py