[pull] master from apache:master #61

pull · 2024-01-17T21:15:48Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request? In the PR, I propose to add new error class for unsupported method calls, and remove similar legacy error classes. New `apply()` method of `SparkUnsupportedOperationException` extracts method and class name from stack traces automatically, and places them to error class parameters. ### Why are the changes needed? To improve code maintenance by avoid boilerplate code (extract class and method names automatically), and to clean up `error-classes.json`. ### Does this PR introduce _any_ user-facing change? Yes, it can if user's code depends on the error class or message format of `SparkUnsupportedOperationException`. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" ``` and the affected test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "test:testOnly *ShuffleSpecSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44757 from MaxGekk/unsupported_call-error-class. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…n Apple Silicon for db2 and oracle with third-party docker environments ### What changes were proposed in this pull request? SPARK-46525 makes docker-integration-tests pass most tests on Apple Silicon except DB2 and Oracle. This PR modifies DockerJDBCIntegrationSuite to make it compatible with some of the third-party docker environments, such as the Colima docker environment. Developers can quickly bring up these tests for local testing after a simple pre-setup process. ### Why are the changes needed? Make it possible to test and debug locally for developers on Apple Silicon platforms. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - Passed OracleIntegrationSuite on my Apple Silicon Mac locally ``` [info] OracleIntegrationSuite: [info] - SPARK-33034: ALTER TABLE ... add new columns (21 seconds, 839 milliseconds) [info] .... [info] Run completed in 3 minutes, 16 seconds. [info] Total number of tests run: 26 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 26, failed 0, canceled 0, ignored 11, pending 0 [info] All tests passed. [success] Total time: 326 s (05:26), completed Jan 5, 2024, 7:10:36 PM ``` - Containers ran normally on my Apple Silicon Mac locally during and after tests above ``` 5ea7cc54165b ibmcom/db2:11.5.6.0a "/var/db2_setup/lib/…" 36 minutes ago Up 36 minutes 22/tcp, 55000/tcp, 60006-60007/tcp, 0.0.0.0:57898->50000/tcp, :::57898->50000/tcp strange_ritchie d31122b8a504 gvenzl/oracle-free:23.3 "container-entrypoin…" About an hour ago Up About an hour 0.0.0.0:64193->1521/tcp, :::64193->1521/tcp priceless_wright 75f9943fd4b6 mariadb:10.5.12 "/docker-entrypoint/…" 2 hours ago Up 2 hours 0.0.0.0:55052->3306/tcp, :::55052->3306/tcp angry_ganguly ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #44612 from yaooqinn/SPARK-46525-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR attaches codec extension to avro datasource files. ``` part-00000-2d4a2c78-a62a-4f7d-a286-5572dcdefade-c000.zstandard.avro part-00000-74c04de5-c991-4a40-8740-8d472f4ce2ec-c000.avro part-00000-965d0e93-9f86-40f9-8544-d71d14cc9787-c000.xz.avro part-00002-965d0e93-9f86-40f9-8544-d71d14cc9787-c000.snappy.avro ``` ### Why are the changes needed? Feature parity with parquet and orc file sources, which is useful to differentiate compression codecs of Avro files ### Does this PR introduce _any_ user-facing change? No, this more likely belong to underlying data storage layer ### How was this patch tested? new unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44770 from yaooqinn/SPARK-46746. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…thon UDF profiler ### What changes were proposed in this pull request? Basic support of SparkSession based Python UDF profiler. To enable the profiler, use a SQL conf `spark.sql.pyspark.udf.profiler`: - `"perf"`: enable cProfiler - `"memory"`: enable memory-profiler (TODO: [SPARK-46687](https://issues.apache.org/jira/browse/SPARK-46687)) ```py from pyspark.sql.functions import * spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") # enable cProfiler udf("string") def f(x): return str(x) df = spark.range(10).select(f(col("id"))) df.collect() pandas_udf("string") def g(x): return x.astype("string") df = spark.range(10).select(g(col("id"))) spark.conf.unset("spark.sql.pyspark.udf.profiler") # disable df.collect() # won't profile spark.showPerfProfiles() # show the result for only the first collect. ``` ### Why are the changes needed? The existing UDF profilers are SparkContext based, which can't support Spark Connect. We should introduce SparkSession based profilers and support Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, SparkSession-based UDF profilers will be available. ### How was this patch tested? Added the related tests, manually, and existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44697 from ueshin/issues/SPARK-46686/profiler. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? 1, unify the import; 2, delete unused helper functions and variables; ### Why are the changes needed? code clean up ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44771 from zhengruifeng/py_df_cleanup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR aims to document the following three environment variables for `Spark Standalone` cluster. - SPARK_LOG_DIR - SPARK_LOG_MAX_FILES - SPARK_PID_DIR ### Why are the changes needed? So far, the users need to look at the `spark-env.sh.template` or `spark-daemon.sh` files to see the descriptions and the default values. We had better document it officially. https://github.com/apache/spark/blob/9a2f39318e3af8b3817dc5e4baf52e548d82063c/conf/spark-env.sh.template#L67-L69 https://github.com/apache/spark/blob/9a2f39318e3af8b3817dc5e4baf52e548d82063c/sbin/spark-daemon.sh#L25-L28 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Generate HTML docs. ![Screenshot 2024-01-17 at 10 38 09 AM](https://github.com/apache/spark/assets/9700541/7b6106dc-5105-4653-94aa-0fc05af5a762) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44774 from dongjoon-hyun/SPARK-46749. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? As a part of SPARK-45869 (Revisit and Improve Spark Standalone Cluster), this PR aims to remove `*slave*` scripts from `sbin` directory for Apache Spark 4.0.0 codebase and binary distributions. ``` spark-3.5.0-bin-hadoop3:$ ls -al sbin/*slave* -rwxr-xr-x 1 dongjoon staff 981 Sep 8 19:08 sbin/decommission-slave.sh -rwxr-xr-x 1 dongjoon staff 957 Sep 8 19:08 sbin/slaves.sh -rwxr-xr-x 1 dongjoon staff 967 Sep 8 19:08 sbin/start-slave.sh -rwxr-xr-x 1 dongjoon staff 969 Sep 8 19:08 sbin/start-slaves.sh -rwxr-xr-x 1 dongjoon staff 965 Sep 8 19:08 sbin/stop-slave.sh -rwxr-xr-x 1 dongjoon staff 967 Sep 8 19:08 sbin/stop-slaves.sh ``` ### Why are the changes needed? `*slave*.sh` scripts are deprecated at Apache Spark 3.1.0 (March, 2021) via SPARK-32004 (July 2020). ### Does this PR introduce _any_ user-facing change? Yes, but - these are only wrapper scripts for legacy environments and were removed from all documents. - the new alternative corresponding scripts have been documented instead and used for last 3 years. - we can simplify the `sbin` directory of binary distributions for a better UX. - Apache Spark 4.0.0 is a good and the last chance to clean up these. ### How was this patch tested? Pass the CI and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44773 from dongjoon-hyun/SPARK-46748. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nstalled ### What changes were proposed in this pull request? This PR proposes to skip `test_datasource` if PyArrow is not installed because it requires `mapInArrow` that needs PyArrow. ### Why are the changes needed? To make the build pass with the env that does not have PyArrow installed. Currently scheduled job fails (with PyPy3): https://github.com/apache/spark/actions/runs/7557652490/job/20577472214 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Scheduled jobs should test it out. I also manually tested it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44776 from HyukjinKwon/SPARK-46751. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…benchmarks ### What changes were proposed in this pull request? This PR aims to use the default ORC compression in data source benchmarks. ### Why are the changes needed? Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - #44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - #44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? A previous refactor mistakenly used `isValid` for add. Since `defaultValidValue` was always `0`, this didn't cause any correctness issues. What we really want to do for add (and merge) is `if (isZero) _value = 0`. Also removing `isValid` since its redundant, if `defaultValidValue` is always `0`. ### Why are the changes needed? There are no correctness errors, but this is confusing and error-prone. A negative `defaultValidValue` was intended to allow creating optional metrics. With the previous behavior this would incorrectly add the sentinel value. `defaultValidValue` is supposed to determine what value is exposed to the user. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running the tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44649 from davintjong-db/sql-metric-add-fix. Authored-by: Davin Tjong <davin.tjong@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…e running spark-sql raise error ### What changes were proposed in this pull request? Fix a build issue, when building a runnable distribution from master code running spark-sql raise error: ``` Caused by: java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) ... 58 more ``` the problem is due to a gauva dependency in spark-connect-common POM that **conflicts** with the shade plugin of the parent pom. - the spark-connect-common contains `connect.guava.version` version of guava, and it is relocation as `${spark.shade.packageName}.guava` not the `${spark.shade.packageName}.connect.guava`; - The spark-network-common also contains guava related classes, it has also been relocation is `${spark.shade.packageName}.guava`, but guava version `${guava.version}`; - As a result, in the presence of different versions of the classpath org.sparkproject.guava.xx; In addition, after investigation, it seems that module spark-connect-common is not related to guava, so we can remove guava dependency from spark-connect-common. ### Why are the changes needed? Building a runnable distribution from master code is not runnable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I ran the build command output a runnable distribution package manually for the tests; Build command: ``` ./dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver -Pyarn -Pconnect ``` Test result: <img width="1276" alt="image" src="https://github.com/apache/spark/assets/51110188/aefbc433-ea5c-4287-8ebd-367806043ac8"> I also checked the `org.sparkproject.guava.cache.LocalCache` from jars dir; Before: ``` ➜ jars grep -lr 'org.sparkproject.guava.cache.LocalCache' ./ .//spark-connect_2.13-4.0.0-SNAPSHOT.jar .//spark-network-common_2.13-4.0.0-SNAPSHOT.jar .//spark-connect-common_2.13-4.0.0-SNAPSHOT.jar ``` Now: ``` ➜ jars grep -lr 'org.sparkproject.guava.cache.LocalCache' ./ .//spark-network-common_2.13-4.0.0-SNAPSHOT.jar ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43436 from Yikf/SPARK-45593. Authored-by: yikaifei <yikaifei@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? Purge pip cache in dockerfile ### Why are the changes needed? to save 4~5G disk space: before https://github.com/zhengruifeng/spark/actions/runs/7541725028/job/20530432798 ``` #45 [39/39] RUN df -h #45 0.090 Filesystem Size Used Avail Use% Mounted on #45 0.090 overlay 84G 70G 15G 83% / #45 0.090 tmpfs 64M 0 64M 0% /dev #45 0.090 shm 64M 0 64M 0% /dev/shm #45 0.090 /dev/root 84G 70G 15G 83% /etc/resolv.conf #45 0.090 tmpfs 7.9G 0 7.9G 0% /proc/acpi #45 0.090 tmpfs 7.9G 0 7.9G 0% /sys/firmware #45 0.090 tmpfs 7.9G 0 7.9G 0% /proc/scsi #45 DONE 2.0s ``` after https://github.com/zhengruifeng/spark/actions/runs/7549204209/job/20552796796 ``` #48 [42/43] RUN python3.12 -m pip cache purge #48 0.670 Files removed: 392 #48 DONE 0.7s #49 [43/43] RUN df -h #49 0.075 Filesystem Size Used Avail Use% Mounted on #49 0.075 overlay 84G 65G 19G 79% / #49 0.075 tmpfs 64M 0 64M 0% /dev #49 0.075 shm 64M 0 64M 0% /dev/shm #49 0.075 /dev/root 84G 65G 19G 79% /etc/resolv.conf #49 0.075 tmpfs 7.9G 0 7.9G 0% /proc/acpi #49 0.075 tmpfs 7.9G 0 7.9G 0% /sys/firmware #49 0.075 tmpfs 7.9G 0 7.9G 0% /proc/scsi ``` ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44768 from zhengruifeng/infra_docker_cleanup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Reduce the number of layers of testing dockerfile ### Why are the changes needed? to address #44768 (review) ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44781 from zhengruifeng/infra_docker_layers. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…rror ### What changes were proposed in this pull request? This PR improves the error messages for the `DATA_SOURCE_NOT_FOUND` error. ### Why are the changes needed? To make the error messages more user-friendly and update to date. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44620 from allisonwang-db/spark-46618-not-found-err. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…e definition and write options ### What changes were proposed in this pull request? This PR fixes the case sensitivity of 'compression' in the avro table definition and the write options, in order to make it consistent with other file sources. Also, the current logic for dealing invalid codec names is unreachable. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, 'compression'='Xz', 'compression'='XZ' now works as well as 'compression'='xz' ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44780 from yaooqinn/SPARK-46754. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…eFormatter ### What changes were proposed in this pull request? This PR propose to replace `SimpleDateFormat` with `DateTimeFormatter`. ### Why are the changes needed? According to the javadoc of `SimpleDateFormat`, it recommended to use `DateTimeFormatter` too. ![1](https://github.com/apache/spark/assets/8486025/97b16bbb-e5b7-4b3f-9bc8-0b0b8c907542) In addition, `DateTimeFormatter` have better performance than `SimpleDateFormat` too. Note: `SimpleDateFormat` and `DateTimeFormatter` are not completely compatible, for example, the formats supported by `parse` are not exactly the same. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA and manual test. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44616 from beliefer/replace-sdf-with-dtf. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…occur after variable declarations ### What changes were proposed in this pull request? In ResourceProfileManager, function calls should occur after variable declarations ### Why are the changes needed? As the title suggests, in `ResourceProfileManager`, function calls should be made after variable declarations. When determining `isSupport`, all variables are uninitialized, with booleans defaulting to false and objects to null. While the end result is correct, the evaluation process is abnormal. ![image](https://github.com/apache/spark/assets/46274164/0e15b7e6-bd91-4d46-b220-758c131392c7) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? through exists uts ### Was this patch authored or co-authored using generative AI tooling? No Closes #44705 from lyy-pineapple/SPARK-46696. Authored-by: liangyongyuan <liangyongyuan@xiaomi.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…edicate pushdown ### What changes were proposed in this pull request? This PR adds the field `throwable` to `Expression`. If an expression is marked as throwable, we will avoid pushing filters containing these expressions through joins, filters, and aggregations (i.e. operators that filter input). ### Why are the changes needed? For predicate pushdown, currently it is possible that we push down a filter that ends up being evaluated on more rows than before it was pushed down (e.g. if we push the filter through a selective join). In this case, it is possible that we now evaluate the filter on a row that will cause a runtime error to be thrown, when prior to pushing this would not have happened. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44716 from kelvinjian-db/SPARK-46707-throwable. Authored-by: Kelvin Jiang <kelvin.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Remove unused TODO ### Why are the changes needed? [SPARK-40546](https://issues.apache.org/jira/browse/SPARK-40546) is not a problem, and was already marked as resolved ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44785 from zhengruifeng/SPARK_40546. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…comment ### What changes were proposed in this pull request? This change adds logic to generate correct DDL for nested fields in STRUCT. In particular instead of generating list of fields with data type names it will add `NOT NULL` qualifier when necessary and field comment when present. For a table: ``` CREATE TABLE t(field STRUCT<one: STRING NOT NULL, two: DOUBLE NOT NULL COMMENT 'comment'>); SHOW CREATE TABLE t; ``` Before: ``` CREATE TABLE t(field STRUCT<one: STRING, two: DOUBLE>) ``` After ``` CREATE TABLE t(field STRUCT<one: STRING NOT NULL, two: DOUBLE NOT NULL COMMENT 'comment'>) ``` Closes #41016 ### Why are the changes needed? Generate correct DDL. ### Does this PR introduce _any_ user-facing change? No, we do not document behavior of this command for struct case. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44644 from vitaliili-db/SPARK-46629. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ql.avro.compression.codec ### What changes were proposed in this pull request? Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### Why are the changes needed? docfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #44783 from yaooqinn/avro_minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…evel for avro files ### What changes were proposed in this pull request? This PR introduces 2 keys in the form of 'spark.sql.avro.$codecName.level' just like the existing 'spark.sql.avro.deflate.level' for standard and xz codec. W/ this patch, users are able to play the trade-off between the speed and compression ratio when they use AVRO compressed by zstd or xz. ### Why are the changes needed? Avro supports compression level for deflate, xz and zstd, but we have only supported deflate. ### Does this PR introduce _any_ user-facing change? yes, new configurations added ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44786 from yaooqinn/SPARK-46759. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ata source save mode ### What changes were proposed in this pull request? This PR is a follow up for #44576 to change the single quotes to double quotes for data source name. ### Why are the changes needed? To make the error message format consistent. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44790 from allisonwang-db/spark-46576-follow-up. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Remove an unneeded method override in the definition of `ProductionTag`. ### Why are the changes needed? It's just noise. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I built the docs and confirmed that this script block shows up only when `PRODUCTION=1`: https://github.com/apache/spark/blob/71468ebcc85e2694935086dcf0b01bfe2bff745f/docs/_layouts/global.html#L35-L52 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44788 from nchammas/production-tagblock. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nonicalization of the plan ### What changes were proposed in this pull request? This PR proposes to fix the bug on canonicalizing the plan which contains the physical node of dropDuplicatesWithinWatermark (`StreamingDeduplicateWithinWatermarkExec`). ### Why are the changes needed? Canonicalization of the plan will replace the expressions (including attributes) to remove out cosmetic, including name, "and metadata", which denotes the event time column marker. StreamingDeduplicateWithinWatermarkExec assumes that the input attributes of child node contain the event time column, and it is determined at the initialization of the node instance. Once canonicalization is being triggered, child node will lose the notion of event time column from its attributes, and copy of StreamingDeduplicateWithinWatermarkExec will be performed which instantiating a new node of `StreamingDeduplicateWithinWatermarkExec` with new child node, which no longer has an event time column, hence instantiation will fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44688 from HeartSaVioR/SPARK-46676. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

… `seed` ### What changes were proposed in this pull request? Make `shuffle` specify the datatype of `seed` ### Why are the changes needed? `shuffle` function may fail with an extreme low possibility (~ 2e-10) : `shuffle` requires a Long type `seed`, in an unregistered function, and this Long value is extracted in Planner. in Scala client the `SparkClassUtils.random.nextLong` make sure the type; while in Python, `lit(random.randint(0, sys.maxsize))` may return a Literal Integer instead of Literal Long. ``` In [26]: from pyspark.sql import functions as sf In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data']) In [28]: df.select(sf.shuffle(df.data)).show() +-------------+ |shuffle(data)| +-------------+ |[1, 3, 5, 20]| +-------------+ In [29]: df.select(sf.call_udf("shuffle", df.data, sf.lit(123456789000000))).show() +-------------+ |shuffle(data)| +-------------+ |[20, 1, 5, 3]| +-------------+ In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show() ... SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal long, but got 12345 ``` Another case is `uuid`, but it is not supported in Python due to namespace conflicts. I don't find other similar cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually check ### Was this patch authored or co-authored using generative AI tooling? no Closes #44793 from zhengruifeng/py_shuffle_long. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR aims to remove legacy `docker-for-desktop` logic in favor of `docker-desktop`. ### Why are the changes needed? - Docker Desktop switched the underlying node name and context to `docker-desktop` in 2020. - docker/for-win#5089 (comment) - Since Apache Spark 3.2.2, we have been hiding it from the documentation via SPARK-38272 and now we can delete it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually test with Docker Desktop. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,local -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" ... [info] KubernetesSuite: [info] - SPARK-42190: Run SparkPi with local[*] (12 seconds, 759 milliseconds) [info] - Run SparkPi with no resources (13 seconds, 747 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (19 seconds, 688 milliseconds) [info] - Run SparkPi with a very long application name. (12 seconds, 436 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (17 seconds, 411 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (12 seconds, 352 milliseconds) [info] - Run SparkPi with an argument. (17 seconds, 481 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (12 seconds, 375 milliseconds) [info] - All pods have the same service account by default (17 seconds, 375 milliseconds) [info] - Run extraJVMOptions check on driver (9 seconds, 362 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - G1GC (12 seconds, 319 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - Other GC (9 seconds, 280 milliseconds [info] - SPARK-42769: All executor pods have SPARK_DRIVER_POD_IP env variable (12 seconds, 404 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (18 seconds, 198 milliseconds) [info] - Run SparkPi with env and mount secrets. (19 seconds, 463 milliseconds) [info] - Run PySpark on simple pi.py example (18 seconds, 373 milliseconds) [info] - Run PySpark to test a pyfiles example (14 seconds, 435 milliseconds) [info] - Run PySpark with memory customization (17 seconds, 334 milliseconds) [info] - Run in client mode. (5 seconds, 235 milliseconds) [info] - Start pod creation from template (12 seconds, 447 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (17 seconds, 351 milliseconds) [info] - Test basic decommissioning (45 seconds, 365 milliseconds) [info] - Test basic decommissioning with shuffle cleanup (49 seconds, 679 milliseconds) [info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 52 seconds) [info] - Test decommissioning timeouts (50 seconds, 379 milliseconds) [info] - SPARK-37576: Rolling decommissioning (1 minute, 17 seconds) [info] - Run SparkR on simple dataframe.R example (19 seconds, 453 milliseconds) [info] YuniKornSuite: [info] Run completed in 14 minutes, 39 seconds. [info] Total number of tests run: 27 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 1078 s (17:58), completed Jan 19, 2024, 12:12:23 AM ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44796 from dongjoon-hyun/SPARK-46770. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…-jre ### What changes were proposed in this pull request? This pr aims to upgrade Guava used by the `connect` module from 32.0.1-jre to 33.0-jre, at the same time, upgrade ·failureaccess·, which is used in conjunction with Guava, from version 1.01 to 1.02. ### Why are the changes needed? The new version bring some changes as follows: - net: Optimized InternetDomainName construction. (google/guava@3a1d18f, google/guava@eaa62eb) - util.concurrent: Changed our implementations to avoid eagerly initializing loggers during class loading. This can help performance. (google/guava@4fe1df5) The full release notes as follows: - https://github.com/google/guava/releases/tag/v32.1.2 - https://github.com/google/guava/releases/tag/v32.1.3 - https://github.com/google/guava/releases/tag/v33.0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44795 from LuciferYang/upgrade-connect-guava. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? This PR adds ZSTD Buffer Pool Support For AVRO datasource writing with zstd compression codec ### Why are the changes needed? Enable a tuning technique for users ### Does this PR introduce _any_ user-facing change? yes, add a new configuration ### How was this patch tested? passing existing ci shall be sufficient ### Was this patch authored or co-authored using generative AI tooling? no Closes #44792 from yaooqinn/SPARK-46766. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…ead pool ### What changes were proposed in this pull request? This PR aims to use a meaningful class name prefix for REST Submission API thread pool instead of the default value of Jetty QueuedThreadPool, `"qtp"+super.hashCode()`. https://github.com/dekellum/jetty/blob/3dc0120d573816de7d6a83e2d6a97035288bdd4a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L64 ### Why are the changes needed? This is helpful during JVM investigation. **BEFORE (4.0.0-preview2)** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28217 | grep qtp "qtp1925630411-52" #52 daemon prio=5 os_prio=31 cpu=0.07ms elapsed=19.06s tid=0x0000000134906c10 nid=0xde03 runnable [0x0000000314592000] "qtp1925630411-53" #53 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134ac6810 nid=0xc603 runnable [0x000000031479e000] "qtp1925630411-54" #54 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x000000013491ae10 nid=0xdc03 runnable [0x00000003149aa000] "qtp1925630411-55" #55 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=19.06s tid=0x0000000134ac9810 nid=0xc803 runnable [0x0000000314bb6000] "qtp1925630411-56" #56 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134ac9e10 nid=0xda03 runnable [0x0000000314dc2000] "qtp1925630411-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134aca410 nid=0xca03 runnable [0x0000000314fce000] "qtp1925630411-58" #58 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134acaa10 nid=0xcb03 runnable [0x00000003151da000] "qtp1925630411-59" #59 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x0000000134acb010 nid=0xcc03 runnable [0x00000003153e6000] "qtp1925630411-60-acceptor-0108e9815-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.11ms elapsed=19.06s tid=0x00000001317ffa10 nid=0xcd03 runnable [0x00000003155f2000] "qtp1925630411-61-acceptor-11d90f2aa-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.10ms elapsed=19.06s tid=0x00000001314ed610 nid=0xcf03 waiting on condition [0x00000003157fe000] ``` **AFTER** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28317 | grep StandaloneRestServer "StandaloneRestServer-52" #52 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284a8e10 nid=0xdb03 runnable [0x000000032cfce000] "StandaloneRestServer-53" #53 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284acc10 nid=0xda03 runnable [0x000000032d1da000] "StandaloneRestServer-54" #54 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284ae610 nid=0xd803 runnable [0x000000032d3e6000] "StandaloneRestServer-55" #55 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284aec10 nid=0xd703 runnable [0x000000032d5f2000] "StandaloneRestServer-56" #56 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284af210 nid=0xc803 runnable [0x000000032d7fe000] "StandaloneRestServer-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284af810 nid=0xc903 runnable [0x000000032da0a000] "StandaloneRestServer-58" #58 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284afe10 nid=0xcb03 runnable [0x000000032dc16000] "StandaloneRestServer-59" #59 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284b0410 nid=0xcc03 runnable [0x000000032de22000] "StandaloneRestServer-60-acceptor-04aefbaa8-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.13ms elapsed=60.05s tid=0x000000015cda1a10 nid=0xcd03 runnable [0x000000032e02e000] "StandaloneRestServer-61-acceptor-148976251-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.12ms elapsed=60.05s tid=0x000000015cd1c810 nid=0xce03 waiting on condition [0x000000032e23a000] ``` ### Does this PR introduce _any_ user-facing change? No, the thread names are accessed during the debugging. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48924 from dongjoon-hyun/SPARK-50385. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: panbingkun <panbingkun@apache.org>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

MaxGekk and others added 3 commits January 17, 2024 15:41

github-actions bot added SQL DOCS AVRO labels Jan 17, 2024

github-actions bot added BUILD CORE PYTHON CONNECT labels Jan 17, 2024

pull bot added ⤵️ pull and removed BUILD CORE SQL DOCS PYTHON CONNECT AVRO labels Jan 18, 2024

github-actions bot added BUILD CORE SQL DOCS PYTHON CONNECT AVRO labels Jan 18, 2024

dongjoon-hyun added 2 commits January 18, 2024 09:46

github-actions bot added the DEPLOY label Jan 18, 2024

HyukjinKwon and others added 5 commits January 17, 2024 18:29

github-actions bot added the INFRA label Jan 18, 2024

zhengruifeng and others added 4 commits January 18, 2024 15:41

github-actions bot added MLLIB STRUCTURED STREAMING labels Jan 18, 2024

lyy-pineapple and others added 4 commits January 18, 2024 02:21

github-actions bot added the PROTOBUF label Jan 18, 2024

yaooqinn and others added 7 commits January 18, 2024 08:19

github-actions bot added the KUBERNETES label Jan 19, 2024

LuciferYang and others added 2 commits January 19, 2024 17:20

pull bot merged commit c4af64e into huangxiaopingRD:master Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #61

[pull] master from apache:master #61

Uh oh!

pull bot commented Jan 17, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

[pull] master from apache:master #61

[pull] master from apache:master #61

Uh oh!

Conversation

pull bot commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

pull bot commented Jan 17, 2024 •

edited

Loading