-
Notifications
You must be signed in to change notification settings - Fork 29k
Branch 1.6 #12407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Branch 1.6 #12407
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```. cc mengxr jkbradley shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10324 from yanboliang/spark-12364. (cherry picked from commit 1a8b2a1) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
```
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
```
Author: Andrew Or <andrew@databricks.com>
Closes #10334 from andrewor14/rpc-typo.
(cherry picked from commit 861549a)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
`DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception. This was suggested by mateiz on #7699. It may have already turned up an issue in "zero split job". Author: Imran Rashid <irashid@cloudera.com> Closes #8466 from squito/SPARK-10248. (cherry picked from commit 38d9795) Signed-off-by: Andrew Or <andrew@databricks.com>
…addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyu <yuzhihong@gmail.com> Closes #10325 from ted-yu/master. (cherry picked from commit f590178) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
…ry string when redirecting. Author: Rohit Agarwal <rohita@qubole.com> Closes #10180 from mindprince/SPARK-12186. (cherry picked from commit fdb3822) Signed-off-by: Andrew Or <andrew@databricks.com>
Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10339 from vanzin/SPARK-12386. (cherry picked from commit d1508dd) Signed-off-by: Andrew Or <andrew@databricks.com>
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <rxin@databricks.com>
This commit is to resolve SPARK-12396. Author: echo2mei <534384876@qq.com> Closes #10354 from echoTomei/master. (cherry picked from commit 5a514b6) Signed-off-by: Davies Liu <davies.liu@gmail.com>
…er." This reverts commit da7542f.
For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null). The order of columns had been changed to match that with MySQL and PostgreSQL [1]. This PR also fix the nullability of output for outer join. [1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html Author: Davies Liu <davies@databricks.com> Closes #10353 from davies/fix_join. (cherry picked from commit a170d34) Signed-off-by: Davies Liu <davies.liu@gmail.com>
Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value. (cherry picked from commit 6e07716) Signed-off-by: Reynold Xin <rxin@databricks.com>
… server Fix problem with #10332, this one should fix Cluster mode on Mesos Author: Iulian Dragos <jaguarul@gmail.com> Closes #10359 from dragos/issue/fix-spark-12345-one-more-time. (cherry picked from commit 8184568) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
No change in functionality is intended. This only changes internal API. Author: Andrew Or <andrew@databricks.com> Closes #10343 from andrewor14/clean-bm-serializer. Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
…are not found Point users to spark-packages.org to find them. Author: Reynold Xin <rxin@databricks.com> Closes #10351 from rxin/SPARK-12397. (cherry picked from commit e096a65) Signed-off-by: Michael Armbrust <michael@databricks.com>
…erInvariantEquals method org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method. Author: Evan Chen <chene@us.ibm.com> Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite.
…en recovering from checkpoint data Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works: check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed. Author: jhu-chang <gt.hu.chang@gmail.com> Closes #9765 from jhu-chang/SPARK-11749. (cherry picked from commit f4346f6) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
I believe this fixes SPARK-12413. I'm currently running an integration test to verify. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #10366 from mgummelt/fix-zk-mesos. (cherry picked from commit 2bebaa3) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
…a Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <yhuai@databricks.com> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c5) Signed-off-by: Yin Huai <yhuai@databricks.com>
…Runtime.addShutdownHook() is called" This reverts commit 4af6438.
Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable. For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`. ``` case class TimestampContainer(timestamp: java.sql.Timestamp) val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis)) val df = rdd.toDF val ds = df.as[TimestampContainer] val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory ``` I'll add test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Michael Armbrust <michael@databricks.com> Closes #10357 from sarutak/SPARK-12404. (cherry picked from commit 6eba655) Signed-off-by: Michael Armbrust <michael@databricks.com>
``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe. (cherry picked from commit d655d37) Signed-off-by: Andrew Or <andrew@databricks.com>
…icationMaster ## What changes were proposed in this pull request? This patch is fixing the race condition in ApplicationMaster when receiving a signal. In the current implementation, if signal is received and with no any exception, this application will be finished with successful state in Yarn, and there's no another attempt. Actually the application is killed by signal in the runtime, so another attempt is expected. This patch adds a signal handler to handle the signal things, if signal is received, marking this application finished with failure, rather than success. ## How was this patch tested? This patch is tested with following situations: Application is finished normally. Application is finished by calling System.exit(n). Application is killed by yarn command. ApplicationMaster is killed by "SIGTERM" send by kill pid command. ApplicationMaster is killed by NM with "SIGTERM" in case of NM failure. Author: jerryshao <sshao@hortonworks.com> Closes #11690 from jerryshao/SPARK-13642-1.6-backport.
…b to install intr package. ## What changes were proposed in this pull request? In dev/lint-r.R, `install_github` makes our builds depend on a unstable source. This may cause un-expected test failures and then build break. This PR adds a specified commit sha1 ID to `install_github` to get a stable source. ## How was this patch tested? dev/lint-r Author: Sun Rui <rui.sun@intel.com> Closes #11913 from sun-rui/SPARK-14074. (cherry picked from commit 7d11750) Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request? GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value. ## How was this patch tested? Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors. Author: sethah <seth.hendrickson16@gmail.com> Closes #11944 from sethah/SPARK-14107. (cherry picked from commit 5850977) Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request? We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place. ## How was this patch tested? A logging change with a manual test. Author: Reynold Xin <rxin@databricks.com> Closes #11951 from rxin/SPARK-14149. (cherry picked from commit 70a6f0b) Signed-off-by: Reynold Xin <rxin@databricks.com>
## What changes were proposed in this pull request? Fix incorrect use of binarySearch in SparseMatrix ## How was this patch tested? Unit test added. Author: Chenliang Xu <chexu@groupon.com> Closes #11992 from luckyrandom/SPARK-14187. (cherry picked from commit c838829) Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request? This patch will ensure that we trim all path set in yarn.nodemanager.local-dirs and that the the scheme is well removed so the level db can be created. ## How was this patch tested? manual tests. Author: nfraison <nfraison@yahoo.fr> Closes #11475 from ashangit/level_db_creation_issue. (cherry picked from commit ff3bea3)
## What changes were proposed in this pull request?
Currently, `GraphOps.pickRandomVertex()` falls into infinite loops for graphs having only one vertex. This PR fixes it by modifying the following termination-checking condition.
```scala
- if (selectedVertices.count > 1) {
+ if (selectedVertices.count > 0) {
```
## How was this patch tested?
Pass the Jenkins tests (including new test case).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #12021 from dongjoon-hyun/SPARK-14219-2.
…askEnd avioding driver OOM ## What changes were proposed in this pull request? We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM ``` num #instances #bytes class name ---------------------------------------------- 1: 13845916 553836640 org.apache.spark.storage.BlockStatus 2: 14020324 336487776 org.apache.spark.storage.StreamBlockId 3: 13883881 333213144 scala.collection.mutable.DefaultEntry 4: 8907 89043952 [Lscala.collection.mutable.HashEntry; 5: 62360 65107352 [B 6: 163368 24453904 [Ljava.lang.Object; 7: 293651 20342664 [C ... ``` `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end. After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`. In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time. ## How was this patch tested? Existing unit tests and manual tests Author: jeanlyn <jeanlyn92@gmail.com> Closes #12028 from jeanlyn/fixoom1.6.
…r is removed with a multiple line reason. ## What changes were proposed in this pull request? The event timeline doesn't show on job page if an executor is removed with a multiple line reason. This PR replaces all new line characters in the reason string with spaces.  ## How was this patch tested? Verified on the Web UI. Author: Carson Wang <carson.wang@intel.com> Closes #12029 from carsonwang/eventTimeline. (cherry picked from commit 15c0b00) Signed-off-by: Andrew Or <andrew@databricks.com>
jira: https://issues.apache.org/jira/browse/SPARK-11507 "In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem." root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix. easy step to repro: ``` val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) ) val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) ) val sum = m1 + m2 Matrices.fromBreeze(sum) ``` Solution: By checking the code in [CSCMatrix](https://github.com/scalanlp/breeze/blob/28000a7b901bc3cfbbbf5c0bce1d0a5dda8281b0/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9520 from hhbyyh/matricesFromBreeze. (cherry picked from commit ca45861) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> Conflicts: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
…xceed JVM size limit for cached DataFrames
## What changes were proposed in this pull request?
This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using two approaches:
1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is actually used, for instantiating accessors
2. Group a lot of method calls (more than 4000) into a method
## How was this patch tested?
Added a new unit test to ```InMemoryColumnarQuerySuite```
Here is generate code
```java
/* 033 */ private org.apache.spark.sql.execution.columnar.CachedBatch batch = null;
/* 034 */
/* 035 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
/* 036 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
/* 037 */
/* 038 */ public SpecificColumnarIterator() {
/* 039 */ this.nativeOrder = ByteOrder.nativeOrder();
/* 030 */ this.mutableRow = new MutableUnsafeRow(rowWriter);
/* 041 */ }
/* 042 */
/* 043 */ public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes,
/* 044 */ boolean columnNullables[]) {
/* 044 */ this.input = input;
/* 046 */ this.columnTypes = columnTypes;
/* 047 */ this.columnIndexes = columnIndexes;
/* 048 */ }
/* 049 */
/* 050 */
/* 051 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) {
/* 052 */ byte[] buffer = batch.buffers()[columnIndexes[idx]];
/* 053 */ return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
/* 054 */ }
/* 055 */
/* 056 */
/* 057 */
/* 058 */
/* 059 */
/* 060 */
/* 061 */ public boolean hasNext() {
/* 062 */ if (currentRow < numRowsInBatch) {
/* 063 */ return true;
/* 064 */ }
/* 065 */ if (!input.hasNext()) {
/* 066 */ return false;
/* 067 */ }
/* 068 */
/* 069 */ batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
/* 070 */ currentRow = 0;
/* 071 */ numRowsInBatch = batch.numRows();
/* 072 */ accessor = getIntColumnAccessor(0);
/* 073 */ accessor1 = getIntColumnAccessor(1);
/* 074 */
/* 075 */ return hasNext();
/* 076 */ }
/* 077 */
/* 078 */ public InternalRow next() {
/* 079 */ currentRow += 1;
/* 080 */ bufferHolder.reset();
/* 081 */ rowWriter.zeroOutNullBytes();
/* 082 */ accessor.extractTo(mutableRow, 0);
/* 083 */ accessor1.extractTo(mutableRow, 1);
/* 084 */ unsafeRow.setTotalSize(bufferHolder.totalSize());
/* 085 */ return unsafeRow;
/* 086 */ }
```
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #11984 from kiszk/SPARK-14138.
…case unit. ## What changes were proposed in this pull request? This fix tries to address the issue in PySpark where `spark.python.worker.memory` could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to conform to the JVM memory string as is specified in the documentation . ## How was this patch tested? This fix adds additional test to cover the changes. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12163 from yongtang/SPARK-14368. (cherry picked from commit 7db5624) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
…locks
## What changes were proposed in this pull request?
This patch try to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses`
## How was this patch tested?
test("updated block statuses") in BlockManagerSuite.scala
Author: jeanlyn <jeanlyn92@gmail.com>
Closes #12150 from jeanlyn/updataBlock1.6.
…Optimizer ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12106 from hhbyyh/ldaTreeReduce. (cherry picked from commit 8cffcb6) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request? `OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468). Before: `OutputCommitCoordinator` is enabled only if speculation is enabled. After: `OutputCommitCoordinator` is always enabled. Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't... ## How was this patch tested? `OutputCommitCoordinator*Suite` Author: Andrew Or <andrew@databricks.com> Closes #12244 from andrewor14/always-occ. (cherry picked from commit 3e29e37) Signed-off-by: Andrew Or <andrew@databricks.com>
…ied exception ## What changes were proposed in this pull request? When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception. ## How was this patch tested? Added a test suite for the component that extracts the root cause of the error. Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException. Author: Jason Moore <jasonmoore2k@outlook.com> Closes #12228 from jasonmoore2k/SPARK-14357. (cherry picked from commit 22014e6) Signed-off-by: Andrew Or <andrew@databricks.com>
…emory copy in Netty's tran… ## What changes were proposed in this pull request? When netty transfer data that is not `FileRegion`, data will be in format of `ByteBuf`, If the data is large, there will occur significant performance issue because there is memory copy underlying in `sun.nio.ch.IOUtil.write`, the CPU is 100% used, and network is very low. In this PR, if data size is large, we will split it into small chunks to call `WritableByteChannel.write()`, so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will call `transferTo` multiple times. ## How was this patch tested? Spark unit test and manual test. Manual test: `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length` For more details, please refer to [SPARK-14290](https://issues.apache.org/jira/browse/SPARK-14290) Author: Zhang, Liye <liye.zhang@intel.com> Closes #12296 from liyezhang556520/apache-branch-1.6-spark-14290.
…failed Backports #12234 to 1.6. Original description below: ## What changes were proposed in this pull request? This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`: ```scala logError("Aborting task.", cause) // call failure callbacks first, so we could have a chance to cleanup the writer. TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause) if (currentWriter != null) { currentWriter.close() } abortTask() throw new SparkException("Task failed while writing rows.", cause) ``` If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception. ## How was this patch tested? No new functionality added Author: Sameer Agarwal <sameer@databricks.com> Closes #12272 from sameeragarwal/fix-exception-1.6.
…n archive.apache.org [archive.apache.org](https://archive.apache.org/) is undergoing maintenance, breaking our `build/mvn` script: > We are in the process of relocating this service. To save on the immense bandwidth that this service outputs, we have put it in maintenance mode, disabling all downloads for the next few days. We expect the maintenance to be complete no later than the morning of Monday the 11th of April, 2016. This patch fixes this issue by updating the script to use the regular mirror network to download Maven. (This is a backport of #12262 to 1.6) Author: Josh Rosen <joshrosen@databricks.com> Closes #12307 from JoshRosen/fix-1.6-mvn-download.
## What changes were proposed in this pull request? In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug. ## How was this patch tested? Existing tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12089 from yanboliang/spark-14298. (cherry picked from commit 56af8e8) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…decoder ## What changes were proposed in this pull request? In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` instead of the default size ( which is 16) when allocating `compositeBuffer` in `TransportFrameDecoder` because `compositeBuffer` will introduce too many memory copies underlying if `compositeBuffer` is with default `maxNumComponents` when the frame size is large (which result in many transport messages). For details, please refer to [SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242). ## How was this patch tested? spark unit tests and manual tests. For manual tests, we can reproduce the performance issue with following code: `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length` It's easy to see the performance gain, both from the running time and CPU usage. Author: Zhang, Liye <liye.zhang@intel.com> Closes #12038 from liyezhang556520/spark-14242.
…ransformer ## What changes were proposed in this pull request? Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are: * It doesn't work under HiveContext (in Spark 1.6) * Race conditions ## How was this patch tested? * Manual test with HiveContext. * Added a unit test for `transformSchema` to improve coverage. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #12330 from mengxr/SPARK-14563. (cherry picked from commit 1995c2e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request? This PR improve the performance of SQL UI by: 1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page. 2) break-all is super slow in Chrome recently, so switch to break-word. 3) Using "display: none" to hide a block. 4) using one js closure for for all the executions, not one for each. 5) remove the height limitation of details, don't need to scroll it in the tiny window. ## How was this patch tested? Exists tests.  Author: Davies Liu <davies@databricks.com> Closes #12311 from davies/ui_perf.
Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR #9241 Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <skedia@fb.com> Closes #12285 from sitalkedia/executor_oom. (cherry picked from commit d187e7d) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java
## What changes were proposed in this pull request? In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs. ## How was this patch tested? no tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12377 from jkbradley/regeval-doc. (cherry picked from commit bf65c87) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
Can one of the admins verify this patch? |
Contributor
|
@thinkborm can you please close the pull request? Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)