[SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD. #12243

nongli · 2016-04-07T21:50:02Z

What changes were proposed in this pull request?

This patch updates FileScanRDD to start reading from the next file while the current file
is being processed. The goal is to have better interleaving of CPU and IO. It does this
by launching a future which will asynchronously start preparing the next file to be read.
The expectation is that the async task is IO intensive and the current file (which
includes all the computation for the query plan) is CPU intensive. For some file formats,
this would just mean opening the file and the initial setup. For file formats like
parquet, this would mean doing all the IO for all the columns.

How was this patch tested?

Good coverage from existing tests. Added a new one to test the flag. Cluster testing on tpcds queries.

This patch updates FileScanRDD to start reading from the next file while the current file is being processed. The goal is to have better interleaving of CPU and IO. It does this by launching a future which will asynchronously start preparing the next file to be read. The expectation is that the async task is IO intensive and the current file (which includes all the computation for the query plan) is CPU intensive. For some file formats, this would just mean opening the file and the initial setup. For file formats like parquet, this would mean doing all the IO for all the columns.

holdenk · 2016-04-07T22:44:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

        } else {
-          SqlNewHadoopRDDState.unsetInputFileName()
-          false
+          nextFile = null


So we are going to keep setting nextFile to null every nextIterator call if asyncIO is false, could we change this to:
if (asyncIO) { if (files.hasNext) { nextFile = prepareNextFile() } else { nextFile = null } }

holdenk · 2016-04-07T22:45:10Z

This is just a question, but would it be simpler if when we were in nonAsync IO we just set the future to be a completed value - that way the code is a bit simpler (or would this be more complicated)?

SparkQA · 2016-04-07T22:52:18Z

Test build #55246 has finished for PR 12243 at commit cc6d98a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class NextFile(file: PartitionedFile, iter: Iterator[Object])

nongli · 2016-04-07T23:27:13Z

@holdenk I tried to simplify the logic. Let me know your thoughts.

SparkQA · 2016-04-08T00:53:18Z

Test build #55268 has finished for PR 12243 at commit bc11dd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-08T01:01:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+   * such as starting up connections to open the file and any initial buffering. The expectation
+   * is that `currentIterator` is CPU intensive and `nextFile` is IO intensive.
+   */
+  val asyncIO = sqlContext.conf.filesAsyncIO


Should we mark asyncIO and NextFile as private since them seem like implementation details we might not want to expose?

holdenk · 2016-04-08T01:02:27Z

Since this is for a performance improvement, do we have any benchmarks that show this helps?

rxin · 2016-04-08T07:22:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala


+object FileScanRDD {
+  private val ioExecutionContext = ExecutionContext.fromExecutorService(
+    ThreadUtils.newDaemonCachedThreadPool("FileScanRDD", 16))


we should set this to the total number of task slots on the executors, shouldn't we?

Shouldn't it be the total number of cores the user is willing to dedicate to a single Job? This looks to be similar to an issue in ParquetRelation where a parallelize call can end up tying up all of the cores (defaultParallelism) on a single Job. While this PR should allow better progress to be made during that kind of blocking, I'm thinking that what we really need is to implement what was suggested a while ago in the scheduling pools: a max cores limit in addition to the current min cores. With that in place and the max cores value exposed to these large IO operations, users who care about not blocking concurrent Jobs can use pools that neither consume all the available cores nor oversubscribe the cores that the pool does have.

It's difficult to model this as the total number of cores because what this is intended to do is background IO and use very little CPU. The async io will still use some CPU resources but expected to be very low, a small fraction of a core.

Why did you choose 16? Why not 8? Why not 32?
Would it be better to leave decision points in a comment?

HyukjinKwon · 2017-02-09T12:58:56Z

Hi @nongli, I just happened to look at this PR. It seems it has been inactive for few months without answering to review comments. Would this be better closed for now?

## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes apache#7963 Closes apache#8374 Closes apache#11192 Closes apache#11374 Closes apache#11692 Closes apache#12243 Closes apache#12583 Closes apache#12620 Closes apache#12675 Closes apache#12697 Closes apache#12800 Closes apache#13715 Closes apache#14266 Closes apache#15053 Closes apache#15159 Closes apache#15209 Closes apache#15264 Closes apache#15267 Closes apache#15871 Closes apache#15861 Closes apache#16319 Closes apache#16324 Closes apache#16890 Closes apache#12398 Closes apache#12933 Closes apache#14517 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16937 from HyukjinKwon/stale-prs-close.

holdenk reviewed Apr 7, 2016
View reviewed changes

Simplify and fix tests.

bc11dd5

holdenk reviewed Apr 8, 2016
View reviewed changes

rxin reviewed Apr 8, 2016
View reviewed changes

sameeragarwal mentioned this pull request Apr 25, 2016

[SPARK-14467][SQL] Experiments: Async I/O in FileScanRDD #12667

Closed

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

[SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD. #12243

[SPARK-14467][SQL] Interleave CPU and IO better in FileScanRDD. #12243

Uh oh!

Conversation

nongli commented Apr 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk Apr 7, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Apr 7, 2016

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

nongli commented Apr 7, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

holdenk Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Apr 8, 2016

Uh oh!

rxin Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

markhamstra Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

nongli Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

kiszk Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants