-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24215][PySpark] Implement eager evaluation for DataFrame APIs in PySpark #21370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Not sure who is the right reviewer, maybe @rdblue @gatorsmile ? |
|
Test build #90834 has finished for PR 21370 at commit
|
|
we will wait for the tests to be fixed first. @xuanyuanking could you update the PR description to clarify which screenshot is "before" which is "after"? |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need to escape the values to make sure it is legal html too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why all these comment changes? could you revert them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this, I'll revert the IDE changes.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add all these to documentation too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Do it in next commit. Thanks for reminding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the output is truncated, does jupyter handle that properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or does that depend on the html output on L300?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the truncated string will be showed in table row and controlled by spark.jupyter.default.truncate
|
retest this please |
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jupyter ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Open -> Enable
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Yes you're right, thanks for your guidance, the new patch consider the escape and add new UT. @felixcheung |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need .map(_.toString)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this toString, because the appendRowString method including both column names and data in original logic, which call toString in data part.
https://github.com/apache/spark/pull/21370/files/f2bb8f334631734869ddf5d8ef1eca1fa29d334a#diff-7a46f10c3cedbf013cf255564d9483cdL312
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not your change, but the rows is already Seq[Seq[String]] and the row is Seq[String], so I think we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the cell.toString has been called here. https://github.com/apache/spark/pull/21370/files/f2bb8f334631734869ddf5d8ef1eca1fa29d334a#diff-7a46f10c3cedbf013cf255564d9483cdR271
Got it, I'll fix this in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add \n?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "\n" added in seperatedLine:https://github.com/apache/spark/pull/21370/files#diff-7a46f10c3cedbf013cf255564d9483cdR300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the header looks okay, but the data section will be a long line without \n? How about adding \n here and the data section, and just using "" for the seperatedLine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand your consideration. I'll add this in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. feb5f4a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.jupyter.eagerEval.showRows or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to spark.jupyter.eagerEval.showRows,thanks
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.jupyter.eagerEval.truncate or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, change to spark.jupyter.eagerEval.truncate
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be shown if spark.jupyter.eagerEval.enabled is False? Fallback the original automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to return None if self._eager_eval is False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be shown if spark.jupyter.eagerEval.enabled is False? Fallback the original automatically?
Yes, it will fallback to call __repr__.
We need to return None if self._eager_eval is False.
Got it, more clear in code logic.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.jupyter.default.truncate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, sorry for this.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we shouldn't hold these three values but define as @property or refer each time in _repr_html_. Otherwise, we'll hit unexpected behavior, e.g.:
df = ...
spark.conf.set("spark.jupyter.eagerEval.enabled", True)
dfwon't show the html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true instead of yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true is better since the default value is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
|
Test build #90872 has finished for PR 21370 at commit
|
|
Test build #90871 has finished for PR 21370 at commit
|
|
So one thing we might want to take a look at is application/vnd.dataresource+json for tables in the notebooks (see https://github.com/nteract/improved-spark-viz ). |
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Open -> Enable
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it usually we include a JIRA ticket in user document like this? I think we should just briefly describe this feature here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I felt the same thing too but there were the same few instances in this page.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true is better since the default value is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, should we do this html thing in python side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in python side, I implement it in scala side mainly consider to reuse the code and logic of show(), maybe it's more natural in show df as html call showString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be done in Dataset.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create refactor showString to make it reusable for both show and repr_html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for reusing relevant parts of show.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for guidance, I will do this in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya @gatorsmile @rdblue Sorry for the late commit, the refactor do in 94f3414. I spend some time on testing and implementing the transformation of rows between python and scala.
|
Thanks all reviewer's comments, I address all comments in this commit. Please have a look. |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a simple test for _repr_html too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I'll added in SQLTests in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. feb5f4a.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add sql? E.g., spark.sql.jupyter.eagerEval.enabled? Because this is just for SQL Dataset. @HyukjinKwon @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I'd prefer to add sql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, fix it in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Jupyter specific? Can we make it more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the config flag isn't jupyter specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to spark.sql.repl.eagerEval.enabled instead? It should also be documented that this is a hint. There's no guarantee that the repl you're using supports eager evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. feb5f4a.
|
Test build #90896 has finished for PR 21370 at commit
|
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the function descriptions above this line,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. feb5f4a.
|
Can we also do something a bit more generic that works for non-Jupyter notebooks as well? For example, in IPython or just plain Python REPL. This is in addition to the repl_html stuff which is really cool and useful. |
|
@rxin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but I don't think we need the middle \n, it's okay with only the last one.
"<tr><th>", "</th><th>", "</th></tr>\n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change the format in python _repr_html_ in 94f3414.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
Test build #91022 has finished for PR 21370 at commit
|
Can we accept
I test offline an it can work both python shell and Jupyter, if we agree this way, I'll add this support in next commit together will the refactor of showString in scala Dataset. |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repr -> repl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works either way. REPL is better in my opinion because these settings should (ideally) apply when using any REPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, change to REPL in 94f3414.
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to compile the pattern each time here since it's not going to be reused. You could just put this into re.sub I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fix it in 94f3414.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is obvious what "showRows" does. I would assume that it is a boolean, but it is a limit instead. What about calling this "limit" or "numRows"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxNumRows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, change it in 94f3414.
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this and showRows? Why are there two properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe he wants to follow what dataframe.show does, which truncates num characters within a cell. That's useful for console output, but not so much for notebooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I just want to keep the same behavior of dataframe.show.
That's useful for console output, but not so much for notebooks.
Notebooks aren't afraid for too many characters within a cell, so I just delete this?
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be better to respect spark.sql.repr.eagerEval.enabled here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, this implement in 94f3414.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use named arguments for boolean flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fix in 94f3414.
|
Test build #91449 has finished for PR 21370 at commit
|
|
Merged to master |
|
Thanks @HyukjinKwon and all reviewers. |
| <td>20</td> | ||
| <td> | ||
| Default number of rows in eager evaluation output HTML table generated by <code>_repr_html_</code> or plain text, | ||
| this only take effect when <code>spark.sql.repl.eagerEval.enabled</code> is set to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take -> takes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
| <td>20</td> | ||
| <td> | ||
| Default number of truncate in eager evaluation output HTML table generated by <code>_repr_html_</code> or | ||
| plain text, this only take effect when <code>spark.sql.repl.eagerEval.enabled</code> set to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take -> takes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks, address in next follow up PR.
|
@xuanyuanking @HyukjinKwon Sorry for the delay. Super busy in the week of Spark summit. Will carefully review this PR today or tomorrow. |
| finally: | ||
| shutil.rmtree(path) | ||
|
|
||
| def test_repr_html(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only covers the most basic positive case. We need also add more test cases. For example, the results when spark.sql.repl.eagerEval.enabled is set to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, the follow up pr will enhance this test.
| truncate: Int, | ||
| vertical: Boolean): Array[Any] = { | ||
| EvaluatePython.registerPicklers() | ||
| val numRows = _numRows.max(0).min(Int.MaxValue - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be also part of the conf description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will be fixed in another pr.
| self._support_repr_html = True | ||
| if self._eager_eval: | ||
| max_num_rows = max(self._max_num_rows, 0) | ||
| vertical = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any discussion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the discussion before linked below:
#21370 (comment)
and
#21370 (comment)
We need a named arguments for boolean flags here, but here limited by the named arguments can't work during python call _jdf func, so we do the work around like this.
| from JVM to Python worker for every task. | ||
| </td> | ||
| </tr> | ||
| <tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the SQL configurations should follow what we did in the section of Spark SQL https://spark.apache.org/docs/latest/configuration.html#spark-sql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These confs are not part of spark.sql("SET -v").show(numRows = 200, truncate = false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'll add there configurations into SQLConf.scala in the follow up pr.
| """Returns true if the eager evaluation enabled. | ||
| """ | ||
| return self.sql_ctx.getConf( | ||
| "spark.sql.repl.eagerEval.enabled", "false").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible we can avoid hard-coding these conf key values? cc @ueshin @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, we should access to SQLConf object. 1. Agree with not hardcoding it in general but 2. IMHO I want to avoid Py4J JVM accesses in the test because the test can likely be more flaky up to my knowledge, on the other hand (unlike Scala or Java side).
Maybe we should try to take a look about this hardcoding if we see more occurrences next time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ongoing release, a nice-to-have refactoring is to move all the Core Confs into a single file just like what we did in Spark SQL Conf. Default values, boundary checking, types and descriptions. Thus, in PySpark, it would be better to do it starting from now.
| } | ||
| } | ||
|
|
||
| private[sql] def getRowsToPython( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DataFrameSuite, we have multiple test cases for showString instead of getRows , which is introduced in this PR.
We also need the unit test cases for getRowsToPython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, the follow up pr I'm working on will add more test for getRows and getRowsToPython.
|
@xuanyuanking Thanks for your contributions! Test coverage is the most critical when we refactor the existing code and add new features. Hopefully, when you submit new PRs in the future, could you also improve this part? |
| <td><code>spark.sql.repl.eagerEval.enabled</code></td> | ||
| <td>false</td> | ||
| <td> | ||
| Enable eager evaluation or not. If true and the REPL you are using supports eager evaluation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. When the REPL does not support eager evaluation, could we do anything better instead of silently ignoring the user inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's hard to have a better way because we can hardly perceive it in dataset while REPL does not support eager evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely follow actually - what makes a "REPL does not support eager evaluation"?
in fact, this "eager evaluation" is just object rendering support build into REPL, notebook etc (like print), it's not really a design to be "eager evaluation"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. We are not adding the eager execution in the our Dataset/DataFrame. We just rely on the REPL to trigger it. We need an update on the parameter description to emphasize it.
| return int(self.sql_ctx.getConf( | ||
| "spark.sql.repl.eagerEval.truncate", "20")) | ||
|
|
||
| def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR also changed __repr__. Thus, we need to update the PR title and description. A better PR title should be like Implement eager evaluation for DataFrame APIs in PySpark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for this and also the PR description as your comment, sorry for this and I'll pay more attention to that next time when the implementation changes during PR review. Is there any way to change the committed git log? Sorry for this.
Of cause and sorry for the late reply, I'll do this in a follow up PR and answer all question from Xiao this night. Thanks for all your comments. |
|
@xuanyuanking Just for your reference, for this PR, the PR description can be improved to something like
|
| val toJava: (Any) => Any = EvaluatePython.toJava(_, ArrayType(ArrayType(StringType))) | ||
| val iter: Iterator[Array[Byte]] = new SerDeUtil.AutoBatchedPickler( | ||
| rows.iterator.map(toJava)) | ||
| PythonRDD.serveIterator(iter, "serve-GetRows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PythonRDD.serveIterator(iter, "serve-GetRows") returns Int, but the return type of getRowsToPython is Array[Any]. How does it work? cc @xuanyuanking @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we return Array[Any] for PythonRDD.serveIterator too.
| def serveIterator(items: Iterator[_], threadName: String): Array[Any] = { |
Did I maybe miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer with @HyukjinKwon about the return type, and actually the exact return type we need here is Array[Array[String]], this defined in toJava func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes of the return types were made in a commit [PYSPARK] Update py4j to version 0.10.7.... No PR was opened...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re the py4j commit - there's a good reason for it @gatorsmile
not sure if the change to return type is required with the py4j change though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we better don't talk about this here though.



What changes were proposed in this pull request?
This PR is to add eager execution into the
__repr__and_repr_html_of the DataFrame APIs in PySpark. When eager evaluation is enabled,_repr_html_returns a rich HTML version of the top-K rows of the DataFrame output. If_repr_html_is not called by REPL,_repr_will return the plain text of the top-K rows.This PR adds three new external SQL confs for controlling the behavior of eager evaluation:
spark.sql.repl.eagerEval.enabled: Enables eager evaluation or not. When true, the top K rows of Dataset will be displayed if and only if the REPL supports the eager evaluation. Currently, the eager evaluation is only supported in PySpark. For the notebooks like Jupyter, the HTML table (generated by_repr_html_) will be returned. For plain Python REPL, the returned outputs are formatted like dataframe.show().spark.sql.repl.eagerEval.maxNumRows: The max number of rows that are returned by eager evaluation. This only takes effect whenspark.sql.repl.eagerEval.enabledis set to true.spark.sql.repl.eagerEval.truncate: The max number of characters for each cell that is returned by eager evaluation. This only takes effect whenspark.sql.repl.eagerEval.enabledis set to true.How was this patch tested?
New ut in DataFrameSuite and manual test in jupyter. Some screenshot below.
After:

Before:
