Branch 1.6 decision tree #16188

zhuangxue · 2016-12-07T05:56:25Z

What algorithm is used in spark decision tree (is ID3, C4.5 or CART)

java mapwithstate with Function3 has wrong conversion of java `Optional` to scala `Option`, fixed code uses same conversion used in the mapwithstate call that uses Function4 as an input. `Optional.fromNullable(v.get)` fails if v is `None`, better to use `JavaUtils.optionToOptional(v)` instead. Author: Gabriele Nizzoli <mail@nizzoli.net> Closes #11007 from gabrielenizzoli/branch-1.6.

…lumn name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #10741 from grzegorz-chilkiewicz/master. (cherry picked from commit b1835d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10964 from adrian-wang/npewriter. (cherry picked from commit 358300c) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes #10141 from swkimme/patch-1. (cherry picked from commit b377b03) Signed-off-by: Michael Armbrust <michael@databricks.com>

https://issues.apache.org/jira/browse/SPARK-13122 A race condition can occur in MemoryStore's unrollSafely() method if two threads that return the same value for currentTaskAttemptId() execute this method concurrently. This change makes the operation of reading the initial amount of unroll memory used, performing the unroll, and updating the associated memory maps atomic in order to avoid this race condition. Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID. Author: Adam Budde <budde@amazon.com> Closes #11012 from budde/master. (cherry picked from commit ff71261) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala

…uration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <mario.briggs@in.ibm.com> Author: mariobriggs <mariobriggs@in.ibm.com> Closes #11022 from mariobriggs/spark-12739. (cherry picked from commit e9eb248) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…ld not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. backport #11035 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #11042 from cloud-fan/branch-1.6.

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11081 from zsxwing/SPARK-13195. (cherry picked from commit 8e2f296) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <andrew@databricks.com>

There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes #11095 from davies/fix_expand.

…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle. (cherry picked from commit 34d0b70) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc6) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

… SpecificParquetRecordReaderBase This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11131 from JoshRosen/SPARK-12921-take-2.

Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes #11158 from raelawang/master. (cherry picked from commit 719973b) Signed-off-by: Reynold Xin <rxin@databricks.com>

…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265. (cherry picked from commit efb65e0) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…n error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047. (cherry picked from commit b354673) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e20) Signed-off-by: Xiangrui Meng <meng@databricks.com>

… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes #11180 from amitdev/master. (cherry picked from commit 331293c) Signed-off-by: Sean Owen <sowen@cloudera.com>

…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb5483) Signed-off-by: Sean Owen <sowen@cloudera.com>

There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c) Signed-off-by: Sean Owen <sowen@cloudera.com>

This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>

… default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…pares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 7856253) Signed-off-by: Andrew Or <andrew@databricks.com>

A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <michael@databricks.com> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa94) Signed-off-by: Michael Armbrust <michael@databricks.com>

…ome special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b399) Signed-off-by: Andrew Or <andrew@databricks.com>

In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

…deps ## What changes were proposed in this pull request? Also update Hadoop 1 deps file to reflect Snappy 1.1.2.6 ## How was this patch tested? N/A Author: Sean Owen <sowen@cloudera.com> Closes #14992 from srowen/SPARK-17378.2.

… retrieve HiveConf ## What changes were proposed in this pull request? Right now, we rely on Hive's `SessionState.get()` to retrieve the HiveConf used by ClientWrapper. However, this conf is actually the HiveConf set with the `state`. There is a small chance that we are trying to use the Hive client in a new thread while the global client has not been created yet. In this case, `SessionState.get()` will return a `null`, which causes a NPE when we call `SessionState.get(). getConf `. Since the conf that we want is actually the conf we set to `state`. I am changing the code to just call `state.getConf` (this is also what Spark 2.0 does). ## How was this patch tested? I have not figured out a good way to reproduce this. Author: Yin Huai <yhuai@databricks.com> Closes #14816 from yhuai/SPARK-17245.

…tion Client ## What changes were proposed in this pull request? If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons: 1. The Execution Client will actually generate garbage 2. The listener class needs to be both in the Spark Classpath and Hive Classpath This PR empties the listener configurations in HiveUtils.newTemporaryConfiguration so that the execution client will not contain the listener confs, but the metadata client will. ## How was this patch tested? Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15087 from brkyvz/overwrite-hive-listeners.

…che.spark.storage.MemoryStore` may lead to memory leak ## What changes were proposed in this pull request? The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not. If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map. See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465). Author: Xing SHI <shi-kou@indetail.co.jp> Closes #15022 from saturday-shi/SPARK-17465.

SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f737) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…ult on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show +----+ | a| +----+ |-6.0| +----+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…shed This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen <joshrosen@databricks.com> Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.

…ng entire job (branch-1.6 backport) This patch is a branch-1.6 backport of #15037: ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <joshrosen@databricks.com> Closes #15186 from JoshRosen/SPARK-17485-branch-1.6-backport.

…nousListenerBus (branch 1.6) ## What changes were proposed in this pull request? Backport #15220 to 1.6. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15226 from zsxwing/SPARK-17649-branch-1.6.

… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #15185 from JoshRosen/SPARK-17618.

From the original commit message: This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) Author: Charles Allen <charlesallen-net.com> (cherry picked from commit 2eaeafe) Author: Charles Allen <charles@allen-net.com> Closes #15270 from vanzin/SPARK-17696.

…atrix with SparseVector Backport PR of changes relevant to mllib only, but otherwise identical to #15296 jkbradley Author: Bjarne Fruergaard <bwahlgreen@gmail.com> Closes #15311 from bwahlgreen/bugfix-spark-17721-1.6.

This backports 733cbaa to Branch 1.6. It's a pretty simple patch, and would be nice to have for Spark 1.6.3. Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15380 from brkyvz/bp-SPARK-15062. Signed-off-by: Michael Armbrust <michael@databricks.com>

## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15454 from zsxwing/SPARK-17850-1.6.

…cala-2.11 repl ## What changes were proposed in this pull request? Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" configuration, so user cannot set a fixed port number through "spark.replClassServer.port". ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #15253 from jerryshao/SPARK-17678.

…m empty string to interval type ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15479 from priyankagargnitk/cast_empty_string_bug.

…ld not depends on local timezone ## What changes were proposed in this pull request? Back-port of #13784 to `branch-1.6` ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #15554 from srowen/SPARK-16078.

…executor loss ## What changes were proposed in this pull request? _This is the master branch-1.6 version of #15986; the original description follows:_ This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails. This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map. Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here. This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss. There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes. ## How was this patch tested? I added a new unit test to `TaskSchedulerImplSuite`. /cc kayousterhout and markhamstra, who reviewed #15986. Author: Josh Rosen <joshrosen@databricks.com> Closes #16070 from JoshRosen/fix-leak-following-total-executor-loss-1.6.

…functions No tests done for JDBCRDD#compileFilter. Author: Takeshi YAMAMURO <linguin.m.sgmail.com> Closes #10409 from maropu/AddTestsInJdbcRdd. (cherry picked from commit 8c1b867) Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #16124 from dongjoon-hyun/SPARK-12446-BRANCH-1.6.

AmplabJenkins · 2016-12-07T05:57:14Z

Can one of the admins verify this patch?

srowen · 2016-12-07T08:04:18Z

@zhuangxue close this PR please

Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206

sparkyengine and others added 30 commits February 2, 2016 10:57

[ML][DOC] fix wrong api link in ml onevsrest

2f390d3

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <meng@databricks.com>

[SPARK-13214][DOCS] update dynamicAllocation documentation

3ca5dc3

Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <andrew@databricks.com>

[SPARK-13350][DOCS] Config doc updated to state that PYSPARK_PYTHON's…

66106a6

… default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Update branch-1.6 for 1.6.1 release

40d11d0

srowen and others added 24 commits September 7, 2016 12:12

Prepare branch-1.6 for 1.6.3 release.

0f57785

Preparing Spark release v1.6.3

7375bb0

Preparing development version 1.6.4-SNAPSHOT

b95ac0d

Preparing Spark release v1.6.3-rc2

1e86074

Preparing development version 1.6.4-SNAPSHOT

9136e26

HyukjinKwon mentioned this pull request Dec 8, 2016

[BUILD] Closing some stale/inappropriate PRs #16207

Closed

asfgit closed this in 330fda8 Dec 8, 2016

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025

Close stale pull requests.

0de30bb

Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Branch 1.6 decision tree #16188

Branch 1.6 decision tree #16188

Uh oh!

zhuangxue commented Dec 7, 2016

Uh oh!

AmplabJenkins commented Dec 7, 2016

Uh oh!

srowen commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Branch 1.6 decision tree #16188

Branch 1.6 decision tree #16188

Uh oh!

Conversation

zhuangxue commented Dec 7, 2016

Uh oh!

AmplabJenkins commented Dec 7, 2016

Uh oh!

srowen commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants