Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented May 19, 2018

What changes were proposed in this pull request?

This PR is to add eager execution into the __repr__ and _repr_html_ of the DataFrame APIs in PySpark. When eager evaluation is enabled, _repr_html_ returns a rich HTML version of the top-K rows of the DataFrame output. If _repr_html_ is not called by REPL, _repr_ will return the plain text of the top-K rows.

This PR adds three new external SQL confs for controlling the behavior of eager evaluation:

  • spark.sql.repl.eagerEval.enabled: Enables eager evaluation or not. When true, the top K rows of Dataset will be displayed if and only if the REPL supports the eager evaluation. Currently, the eager evaluation is only supported in PySpark. For the notebooks like Jupyter, the HTML table (generated by _repr_html_) will be returned. For plain Python REPL, the returned outputs are formatted like dataframe.show().

  • spark.sql.repl.eagerEval.maxNumRows: The max number of rows that are returned by eager evaluation. This only takes effect when spark.sql.repl.eagerEval.enabled is set to true.

  • spark.sql.repl.eagerEval.truncate: The max number of characters for each cell that is returned by eager evaluation. This only takes effect when spark.sql.repl.eagerEval.enabled is set to true.

How was this patch tested?

New ut in DataFrameSuite and manual test in jupyter. Some screenshot below.

After:
image

Before:
image

@xuanyuanking
Copy link
Member Author

Not sure who is the right reviewer, maybe @rdblue @gatorsmile ?
Could you help me check whether it is the right implementation for the discussion in the dev list?

@SparkQA
Copy link

SparkQA commented May 19, 2018

Test build #90834 has finished for PR 21370 at commit ebc0b11.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@HyukjinKwon @holdenk @ueshin

@felixcheung
Copy link
Member

felixcheung commented May 19, 2018

we will wait for the tests to be fixed first.

@xuanyuanking could you update the PR description to clarify which screenshot is "before" which is "after"?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to escape the values to make sure it is legal html too right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why all these comment changes? could you revert them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this, I'll revert the IDE changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add all these to documentation too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Do it in next commit. Thanks for reminding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the output is truncated, does jupyter handle that properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or does that depend on the html output on L300?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the truncated string will be showed in table row and controlled by spark.jupyter.default.truncate

@HyukjinKwon
Copy link
Member

retest this please

Copy link
Member

@HyukjinKwon HyukjinKwon May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jupyter ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Open -> Enable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@xuanyuanking
Copy link
Member Author

xuanyuanking commented May 21, 2018

this will need to escape the values to make sure it is legal html too right?

Yes you're right, thanks for your guidance, the new patch consider the escape and add new UT. @felixcheung

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need .map(_.toString)?

Copy link
Member Author

@xuanyuanking xuanyuanking May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this toString, because the appendRowString method including both column names and data in original logic, which call toString in data part.
https://github.com/apache/spark/pull/21370/files/f2bb8f334631734869ddf5d8ef1eca1fa29d334a#diff-7a46f10c3cedbf013cf255564d9483cdL312

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not your change, but the rows is already Seq[Seq[String]] and the row is Seq[String], so I think we can remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add \n?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the header looks okay, but the data section will be a long line without \n? How about adding \n here and the data section, and just using "" for the seperatedLine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand your consideration. I'll add this in next commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. feb5f4a.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.jupyter.eagerEval.showRows or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to spark.jupyter.eagerEval.showRows,thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.jupyter.eagerEval.truncate or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, change to spark.jupyter.eagerEval.truncate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be shown if spark.jupyter.eagerEval.enabled is False? Fallback the original automatically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to return None if self._eager_eval is False.

Copy link
Member Author

@xuanyuanking xuanyuanking May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be shown if spark.jupyter.eagerEval.enabled is False? Fallback the original automatically?

Yes, it will fallback to call __repr__.

We need to return None if self._eager_eval is False.

Got it, more clear in code logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.jupyter.default.truncate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, sorry for this.

Copy link
Member

@ueshin ueshin May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we shouldn't hold these three values but define as @property or refer each time in _repr_html_. Otherwise, we'll hit unexpected behavior, e.g.:

df = ...
spark.conf.set("spark.jupyter.eagerEval.enabled", True)
df

won't show the html.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll fix it in next commit.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true instead of yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true is better since the default value is false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90872 has finished for PR 21370 at commit f2bb8f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90871 has finished for PR 21370 at commit ebc0b11.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented May 21, 2018

So one thing we might want to take a look at is application/vnd.dataresource+json for tables in the notebooks (see https://github.com/nteract/improved-spark-viz ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Open -> Enable

Copy link
Member

@viirya viirya May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it usually we include a JIRA ticket in user document like this? I think we should just briefly describe this feature here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I felt the same thing too but there were the same few instances in this page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true is better since the default value is false.

Copy link
Member

@viirya viirya May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, should we do this html thing in python side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in python side, I implement it in scala side mainly consider to reuse the code and logic of show(), maybe it's more natural in show df as html call showString.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be done in Dataset.scala.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create refactor showString to make it reusable for both show and repr_html?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for reusing relevant parts of show.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for guidance, I will do this in next commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya @gatorsmile @rdblue Sorry for the late commit, the refactor do in 94f3414. I spend some time on testing and implementing the transformation of rows between python and scala.

@xuanyuanking
Copy link
Member Author

Thanks all reviewer's comments, I address all comments in this commit. Please have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a simple test for _repr_html too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I'll added in SQLTests in next commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. feb5f4a.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add sql? E.g., spark.sql.jupyter.eagerEval.enabled? Because this is just for SQL Dataset. @HyukjinKwon @ueshin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I'd prefer to add sql.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, fix it in next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Jupyter specific? Can we make it more general?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the config flag isn't jupyter specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to spark.sql.repl.eagerEval.enabled instead? It should also be documented that this is a hint. There's no guarantee that the repl you're using supports eager evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. feb5f4a.

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90896 has finished for PR 21370 at commit a798cf2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the function descriptions above this line,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. feb5f4a.

@rxin
Copy link
Contributor

rxin commented May 21, 2018

Can we also do something a bit more generic that works for non-Jupyter notebooks as well? For example, in IPython or just plain Python REPL. This is in addition to the repl_html stuff which is really cool and useful.

@rdblue
Copy link
Contributor

rdblue commented May 21, 2018

@rxin, __repr__ is the equivalent for ipython and the python REPL. _repr_html_ is the convention used by jupyter to replicate __repr__ in notebooks with HTML output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I don't think we need the middle \n, it's okay with only the last one.
"<tr><th>", "</th><th>", "</th></tr>\n"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change the format in python _repr_html_ in 94f3414.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91022 has finished for PR 21370 at commit feb5f4a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

Can we also do something a bit more generic that works for non-Jupyter notebooks as well?

Can we accept spark.sql.repl.eagerEval.enabled to control both __repr__ and _repr_html_ ?
The behavior like below:

  1. If not support repr_html and open eagerEval.enable, just call something like show and trigger take inside.
  2. If support repr_html, use the html output. (Here need a small trick, we should add a var in python dataframe to check whether repr_html called or not, otherwise in this mode _repr_html and repr will both call showString).

I test offline an it can work both python shell and Jupyter, if we agree this way, I'll add this support in next commit together will the refactor of showString in scala Dataset.

image
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repr -> repl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works either way. REPL is better in my opinion because these settings should (ideally) apply when using any REPL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, change to REPL in 94f3414.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to compile the pattern each time here since it's not going to be reused. You could just put this into re.sub I believe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fix it in 94f3414.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is obvious what "showRows" does. I would assume that it is a boolean, but it is a limit instead. What about calling this "limit" or "numRows"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxNumRows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, change it in 94f3414.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this and showRows? Why are there two properties?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe he wants to follow what dataframe.show does, which truncates num characters within a cell. That's useful for console output, but not so much for notebooks.

Copy link
Member Author

@xuanyuanking xuanyuanking May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I just want to keep the same behavior of dataframe.show.

That's useful for console output, but not so much for notebooks.

Notebooks aren't afraid for too many characters within a cell, so I just delete this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be better to respect spark.sql.repr.eagerEval.enabled here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, this implement in 94f3414.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use named arguments for boolean flags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fix in 94f3414.

@SparkQA
Copy link

SparkQA commented Jun 4, 2018

Test build #91449 has finished for PR 21370 at commit 597b8d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master

@asfgit asfgit closed this in dbb4d83 Jun 5, 2018
@xuanyuanking
Copy link
Member Author

Thanks @HyukjinKwon and all reviewers.

<td>20</td>
<td>
Default number of rows in eager evaluation output HTML table generated by <code>_repr_html_</code> or plain text,
this only take effect when <code>spark.sql.repl.eagerEval.enabled</code> is set to true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take -> takes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

<td>20</td>
<td>
Default number of truncate in eager evaluation output HTML table generated by <code>_repr_html_</code> or
plain text, this only take effect when <code>spark.sql.repl.eagerEval.enabled</code> set to true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take -> takes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks, address in next follow up PR.

@gatorsmile
Copy link
Member

@xuanyuanking @HyukjinKwon Sorry for the delay. Super busy in the week of Spark summit. Will carefully review this PR today or tomorrow.

finally:
shutil.rmtree(path)

def test_repr_html(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only covers the most basic positive case. We need also add more test cases. For example, the results when spark.sql.repl.eagerEval.enabled is set to false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the follow up pr will enhance this test.

truncate: Int,
vertical: Boolean): Array[Any] = {
EvaluatePython.registerPicklers()
val numRows = _numRows.max(0).min(Int.MaxValue - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be also part of the conf description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will be fixed in another pr.

self._support_repr_html = True
if self._eager_eval:
max_num_rows = max(self._max_num_rows, 0)
vertical = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any discussion about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the discussion before linked below:
#21370 (comment)
and
#21370 (comment)
We need a named arguments for boolean flags here, but here limited by the named arguments can't work during python call _jdf func, so we do the work around like this.

from JVM to Python worker for every task.
</td>
</tr>
<tr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the SQL configurations should follow what we did in the section of Spark SQL https://spark.apache.org/docs/latest/configuration.html#spark-sql.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These confs are not part of spark.sql("SET -v").show(numRows = 200, truncate = false).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll add there configurations into SQLConf.scala in the follow up pr.

"""Returns true if the eager evaluation enabled.
"""
return self.sql_ctx.getConf(
"spark.sql.repl.eagerEval.enabled", "false").lower() == "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible we can avoid hard-coding these conf key values? cc @ueshin @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we should access to SQLConf object. 1. Agree with not hardcoding it in general but 2. IMHO I want to avoid Py4J JVM accesses in the test because the test can likely be more flaky up to my knowledge, on the other hand (unlike Scala or Java side).

Maybe we should try to take a look about this hardcoding if we see more occurrences next time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ongoing release, a nice-to-have refactoring is to move all the Core Confs into a single file just like what we did in Spark SQL Conf. Default values, boundary checking, types and descriptions. Thus, in PySpark, it would be better to do it starting from now.

}
}

private[sql] def getRowsToPython(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DataFrameSuite, we have multiple test cases for showString instead of getRows , which is introduced in this PR.

We also need the unit test cases for getRowsToPython.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the follow up pr I'm working on will add more test for getRows and getRowsToPython.

@gatorsmile
Copy link
Member

@xuanyuanking Thanks for your contributions! Test coverage is the most critical when we refactor the existing code and add new features. Hopefully, when you submit new PRs in the future, could you also improve this part?

<td><code>spark.sql.repl.eagerEval.enabled</code></td>
<td>false</td>
<td>
Enable eager evaluation or not. If true and the REPL you are using supports eager evaluation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. When the REPL does not support eager evaluation, could we do anything better instead of silently ignoring the user inputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's hard to have a better way because we can hardly perceive it in dataset while REPL does not support eager evaluation.

Copy link
Member

@felixcheung felixcheung Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely follow actually - what makes a "REPL does not support eager evaluation"?
in fact, this "eager evaluation" is just object rendering support build into REPL, notebook etc (like print), it's not really a design to be "eager evaluation"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. We are not adding the eager execution in the our Dataset/DataFrame. We just rely on the REPL to trigger it. We need an update on the parameter description to emphasize it.

return int(self.sql_ctx.getConf(
"spark.sql.repl.eagerEval.truncate", "20"))

def __repr__(self):
Copy link
Member

@gatorsmile gatorsmile Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR also changed __repr__. Thus, we need to update the PR title and description. A better PR title should be like Implement eager evaluation for DataFrame APIs in PySpark

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for this and also the PR description as your comment, sorry for this and I'll pay more attention to that next time when the implementation changes during PR review. Is there any way to change the committed git log? Sorry for this.

@xuanyuanking
Copy link
Member Author

xuanyuanking commented Jun 12, 2018

Test coverage is the most critical when we refactor the existing code and add new features. Hopefully, when you submit new PRs in the future, could you also improve this part?

Of cause and sorry for the late reply, I'll do this in a follow up PR and answer all question from Xiao this night. Thanks for all your comments.

@gatorsmile
Copy link
Member

gatorsmile commented Jun 12, 2018

@xuanyuanking Just for your reference, for this PR, the PR description can be improved to something like

This PR is to add eager execution into the __repr__ and _repr_html_ of the DataFrame APIs in PySpark. When eager evaluation is enabled, repr_html returns a rich HTML version of the top-K rows of the DataFrame output. If _repr_html_ is not called by REPL, _repr_ will return the plain text of the top-K rows.

This PR adds three new external SQL confs for controlling the behavior of eager evaluation:

  • spark.sql.repl.eagerEval.enabled: Enables eager evaluation or not. When true, the top K rows of Dataset will be displayed if and only if the REPL supports the eager evaluation. Currently, the eager evaluation is only supported in PySpark. For the notebooks like Jupyter, the HTML table (generated by repr_html) will be returned. For plain Python REPL, the returned outputs are formatted like dataframe.show().
  • spark.sql.repl.eagerEval.maxNumRows: The max number of rows that are returned by eager evaluation. This only takes effect when spark.sql.repl.eagerEval.enabled is set to true.
  • spark.sql.repl.eagerEval.truncate: The max number of characters of each row that is returned by eager evaluation. This only takes effect when spark.sql.repl.eagerEval.enabled is set to true.

val toJava: (Any) => Any = EvaluatePython.toJava(_, ArrayType(ArrayType(StringType)))
val iter: Iterator[Array[Byte]] = new SerDeUtil.AutoBatchedPickler(
rows.iterator.map(toJava))
PythonRDD.serveIterator(iter, "serve-GetRows")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PythonRDD.serveIterator(iter, "serve-GetRows") returns Int, but the return type of getRowsToPython is Array[Any]. How does it work? cc @xuanyuanking @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we return Array[Any] for PythonRDD.serveIterator too.

def serveIterator(items: Iterator[_], threadName: String): Array[Any] = {

Did I maybe miss something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer with @HyukjinKwon about the return type, and actually the exact return type we need here is Array[Array[String]], this defined in toJava func.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes of the return types were made in a commit [PYSPARK] Update py4j to version 0.10.7.... No PR was opened...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re the py4j commit - there's a good reason for it @gatorsmile
not sure if the change to return type is required with the py4j change though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better don't talk about this here though.

@xuanyuanking xuanyuanking changed the title [SPARK-24215][PySpark] Implement _repr_html_ for dataframes in PySpark [SPARK-24215][PySpark] Implement eager evaluation for DataFrame APIs in PySpark Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants