Fix conflicts and DataFrameFunctionsSuite #21

ueshin · 2024-11-21T22:02:24Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…ROR_TEMP_2075`: `UNSUPPORTED_FEATURE.WRITE_FOR_BINARY_SOURCE` ### What changes were proposed in this pull request? This PR proposes to Integrate `_LEGACY_ERROR_TEMP_2075 ` into `UNSUPPORTED_FEATURE.WRITE_FOR_BINARY_SOURCE ` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48780 from itholic/LEGACY_2075. Lead-authored-by: Haejoon Lee <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <haejoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ROR_TEMP_2058`: `INVALID_PARTITION_VALUE` ### What changes were proposed in this pull request? This PR proposes to Integrate `_LEGACY_ERROR_TEMP_2058 ` into `INVALID_PARTITION_VALUE ` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48778 from itholic/LEGACY_2058. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ROR_TEMP_2167`: `INVALID_JSON_RECORD_TYPE` ### What changes were proposed in this pull request? This PR proposes to Integrate `_LEGACY_ERROR_TEMP_2167 ` into `INVALID_JSON_RECORD_TYPE ` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48775 from itholic/LEGACY_2167. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…re SortMergeJoin is forced ### What changes were proposed in this pull request? I propose extending existing tests in `CollationSuite` and add cases where `SortMergeJoin` is forced and tested for correctness and use of `CollationKey`. ### Why are the changes needed? These changes are needed to properly test behavior of join with collated data when different configs are enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The change is a test itself. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48774 from vladanvasi-db/vladanvasi-db/collation-suite-test-extension. Authored-by: Vladan Vasić <vladan.vasic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… Java exceptions ### What changes were proposed in this pull request? `MakeDTInterval` and `MakeYMInterval` do not catch Java exceptions in nullSafeEval like it does `MakeInterval`. So we making behavior similar. ### Why are the changes needed? To show to users readable nice error message. ### Does this PR introduce _any_ user-facing change? Improved error message ### How was this patch tested? There already were few tests to check behavior, I just changed expected error type. ### Was this patch authored or co-authored using generative AI tooling? Yes, Copilot used. Closes apache#48773 from gotocoding-DB/SPARK-50226-overflow-error. Authored-by: Ruzel Ibragimov <ruzel.ibragimov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…timeReplaceable) ### What changes were proposed in this pull request? The pr aims to add `Codegen` Support for `schema_of_xml`. ### Why are the changes needed? - improve codegen coverage. - simplified code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA & Existed UT (eg: XmlFunctionsSuite#`*schema_of_xml*`) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48594 from panbingkun/SPARK-50066. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…rn null on no result for country and language ### What changes were proposed in this pull request? It was noticed that we return null for country and language for collations TVF when collation is UTF8_*, but when information is missing in ICU we return empty string. ### Why are the changes needed? Making behaviour consistent. ### Does this PR introduce _any_ user-facing change? No, this is all in Spark 4.0, so addition of this TVF was not released yet. ### How was this patch tested? Existing test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48835 from mihailom-db/fix-collations-table. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…dbc code stack ### What changes were proposed in this pull request? I propose reverting the PR for changing pattern matching of `StringType` in the jdbc code stack, since it may lead to collated column being mapped to uncollated column in some dialects. For the time being, this is not the correct behavior. ### Why are the changes needed? These changes are needed in order to preserve proper behavior in the dialects regarding datatype mapping. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No testing was needed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48833 from vladanvasi-db/vladanvasi-db/jdbc-refactor-revert. Authored-by: Vladan Vasić <vladan.vasic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… checking StateMessage_pb2.py and StateMessage_pb2.pyi ### What changes were proposed in this pull request? This pr includes the following changes: 1. Refactor the `dev/connect-gen-protos.sh` script to support the generation of `.py` files from `.proto` files for both the `connect` and `streaming` modules simultaneously. Rename the script to `dev/gen-protos.sh`. Additionally, to maintain compatibility with previous development practices, this pull request (PR) introduces `dev/connect-gen-protos.sh` and `dev/streaming-gen-protos.sh` as wrappers around `dev/gen-protos.sh`. After this PR, you can use: ``` dev/gen-protos.sh connect dev/gen-protos.sh streaming ``` or ``` dev/connect-gen-protos.sh dev/streaming-gen-protos.sh ``` to regenerate the corresponding `.py` files for the respective modules. 2. Refactor the `dev/connect-check-protos.py` script to check the generated results for both the `connect` and `streaming` modules simultaneously, and rename it to `dev/check-protos.py`. Additionally, update the invocation of the check script in `build_and_test.yml`. ### Why are the changes needed? Provid tools for re-generate and checking `StateMessage_pb2.py` and `StateMessage_pb2.pyi`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48815 from LuciferYang/streaming-gen-protos. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…P_2000 ### What changes were proposed in this pull request? Introducing two new error classes instead of _LEGACY_ERROR_TEMP_2000. Classes introduced: - DATETIME_FIELD_OUT_OF_BOUNDS - INVALID_INTERVAL_WITH_MICROSECONDS_ADDITION ### Why are the changes needed? We want to assign names for all existing error classes. ### Does this PR introduce _any_ user-facing change? Yes, error message changed. ### How was this patch tested? Existing tests cover error raising. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48332 from mihailom-db/invalid_date_argument_value. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to use `mirror host` instead of `archive.apache.org`. ### Why are the changes needed? Currently, Apache Spark CI is flaky due to the checksum download failure like the following. It took over 9 minutes and failed eventually. - https://github.com/apache/spark/actions/runs/11818847971/job/32927380452 - https://github.com/apache/spark/actions/runs/11818847971/job/32927382179 ``` exec: curl --retry 3 --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz?action=download exec: curl --retry 3 --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz.sha512 curl: (28) Failed to connect to archive.apache.org port 443 after 135199 ms: Connection timed out curl: (28) Failed to connect to archive.apache.org port 443 after 134166 ms: Connection timed out curl: (28) Failed to connect to archive.apache.org port 443 after 135213 ms: Connection timed out curl: (28) Failed to connect to archive.apache.org port 443 after 135260 ms: Connection timed out Verifying checksum from /home/runner/work/spark/spark/build/apache-maven-3.9.9-bin.tar.gz.sha512 shasum: /home/runner/work/spark/spark/build/apache-maven-3.9.9-bin.tar.gz.sha512: no properly formatted SHA checksum lines found Bad checksum from https://archive.apache.org/dist/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz.sha512 Error: Process completed with exit code 2. ``` **BEFORE** ``` $ build/mvn clean exec: curl --retry 3 --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz?action=download exec: curl --retry 3 --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz.sha512 ``` **AFTER** ``` $ build/mvn clean exec: curl --retry 3 --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz?action=download exec: curl --retry 3 --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.9/binaries/apache-maven-3.9.9-bin.tar.gz.sha512?action=download ``` ### Does this PR introduce _any_ user-facing change? No, this is a dev-only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48836 from dongjoon-hyun/SPARK-50300. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to remove `(any|empty).proto` from RAT exclusion. ### Why are the changes needed? `(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging. - apache#37710 - Added: apache@45c7bc5 - Excluded from RAT check: apache@cf6b19a - Removed: apache@4971980 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs or manual check. ``` $ ./dev/check-license Ignored 0 lines in your exclusion files as comments or empty lines. RAT checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48837 from dongjoon-hyun/SPARK-50304. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to note Python 3.13 in `pyspark-connect` package as its supported version. ### Why are the changes needed? To officially support Python 3.13 ### Does this PR introduce _any_ user-facing change? Yes, in `pyspark-connect` package, Python 3.13 will be explicitly noted as a supported Python version. ### How was this patch tested? CI passed at https://github.com/apache/spark/actions/runs/11824865909 ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48839 from HyukjinKwon/SPARK-50306. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… scopes ### What changes were proposed in this pull request? We are introducing checks for unique label names. New rules for label names: - Labels can't have the same name as some of the labels in scope surrounding them - Labels can have the same name as other labels in the same scope **Valid** code: ``` BEGIN lbl: BEGIN SELECT 1; END; lbl: BEGIN SELECT 2; END; BEGIN lbl: WHILE 1=1 DO LEAVE lbl; END WHILE; END; END ``` **Invalid** code: ``` BEGIN lbl: BEGIN lbl: BEGIN SELECT 1; END; END; END ``` #### Design explanation: Even though there are _Listeners_ with `enterRule` and `exitRule` methods to check labels before and remove them from `seenLabels` after visiting node, we favor this approach because minimal changes were needed and code is more compact to avoid dependency issues. Additionally, generating label text would need to be done in 2 places and we wanted to avoid duplicated logic: - `enterRule` - `visitRule` ### Why are the changes needed? It will be needed in future when we release Local Scoped Variables for SQL Scripting so users can target variables from outer scopes if they are shadowed. ### How was this patch tested? New unit tests in 'SqlScriptingParserSuite.scala'. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48795 from miland-db/milan-dankovic_data/unique_labels_scripting. Authored-by: Milan Dankovic <milan.dankovic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…o 5.11.3 ### What changes were proposed in this pull request? This pr aims to upgrade `jupiter-interface` from 0.13.0 to 0.13.1 and Junit5 to the latest version（Platform 1.11.3 + Jupiter 5.11.3）. ### Why are the changes needed? The new version of `jupiter-interface` brings two fixes: - sbt/sbt-jupiter-interface#122 - sbt/sbt-jupiter-interface#116 Simultaneously upgraded Junit dependencies to Platform 1.11.3 + Jupiter 5.11.3: - sbt/sbt-jupiter-interface#119 The full release notes of `jupiter-interface` as follows: - https://github.com/sbt/sbt-jupiter-interface/releases/tag/v0.13.1 and the full release notes between Junit 5.11.0 to 5.11.3 as follows: - https://junit.org/junit5/docs/5.11.3/release-notes/#release-notes-5.11.3 - https://junit.org/junit5/docs/5.11.3/release-notes/#release-notes-5.11.2 - https://junit.org/junit5/docs/5.11.3/release-notes/#release-notes-5.11.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48834 from LuciferYang/junit5-5.11.3. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ithState` ### What changes were proposed in this pull request? This follow-ups for apache#47133 to add missing API ref docs ### Why are the changes needed? Provide proper API ref doc for `transformWithState` ### Does this PR introduce _any_ user-facing change? No API changes but only the user-facing API ref docs will include the new API ### How was this patch tested? The existing doc build in CI should pass ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48840 from itholic/SPARK-48755-followup. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ional arrays ### What changes were proposed in this pull request? There is a bug introduced in this PR apache#46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays. This PR has introduced a bug for one case. Following scenario is broken: - User has a table t1 on Postgres and does CTAS command to create table t2 with same data. PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column. - This query returns correct dimensionality for table t1, but for table t2 that is created via CTAS it returns 0 always. This leads to all of the arrays being mapped to 0-D array which is the type itself (for example int). This is a bug on Postgres side. - As a solution, we can query array_ndims function on given column that will return the dimension of the column. It works for CTAS-like-created tables too. We can get the result of this function on first row of table. This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query. Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY<INT>> and for the other that it is ARRAY\<INT\>. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. It just fixes one case that doesn't work currently. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#48625 from PetarVasiljevic-DB/fix_postgres_multidimensional_arrays. Authored-by: Petar Vasiljevic <petar.vasiljevic@databricks.com> Signed-off-by: Kent Yao <yao@apache.org>

…aFrame in Spark Classic ### What changes were proposed in this pull request? The PR targets at Spark Classic only. Spark Connect will be handled in a follow-up PR. `verifySchema` parameter of createDataFrame decides whether to verify data types of every row against schema. Now it only takes effect for with createDataFrame with - egular Python instances The PR proposes to make it work with createDataFrame with - `pyarrow.Table` - `pandas.DataFrame` with Arrow optimization - `pandas.DataFrame` without Arrow optimization By default, `verifySchema` parameter is `pyspark._NoValue`, if not provided, createDataFrame with - `pyarrow.Table`, **verifySchema = False** - `pandas.DataFrame` with Arrow optimization, **verifySchema = spark.sql.execution.pandas.convertToArrowArraySafely** - `pandas.DataFrame` without Arrow optimization, **verifySchema = True** - regular Python instances, **verifySchema = True** (existing behavior) ### Why are the changes needed? The change makes schema validation consistent across all formats, improving data integrity and helping prevent errors. It also enhances flexibility by allowing users to choose schema verification regardless of the input type. Part of [SPARK-50146](https://issues.apache.org/jira/browse/SPARK-50146). ### Does this PR introduce _any_ user-facing change? Setup: ```py >>> import pyarrow as pa >>> import pandas as pd >>> from pyspark.sql.types import * >>> >>> data = { ... "id": [1, 2, 3], ... "value": [100000000000, 200000000000, 300000000000] ... } >>> schema = StructType([StructField("id", IntegerType(), True), StructField("value", IntegerType(), True)]) ``` Usage - createDataFrame with `pyarrow.Table` ```py >>> table = pa.table(data) >>> spark.createDataFrame(table, schema=schema).show() # verifySchema defaults to False +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ >>> spark.createDataFrame(table, schema=schema, verifySchema=True).show() ... pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647 ``` Usage - createDataFrame with `pandas.DataFrame` without Arrow optimization ```py >>> pdf = pd.DataFrame(data) >>> spark.createDataFrame(pdf, schema=schema).show() # verifySchema defaults to True ... pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000 >>> spark.createDataFrame(table, schema=schema, verifySchema=False).show() +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ ``` Usage - createDataFrame with `pandas.DataFrame` with Arrow optimization ```py >>> pdf = pd.DataFrame(data) >>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) >>> spark.conf.get("spark.sql.execution.pandas.convertToArrowArraySafely") 'false' >>> spark.createDataFrame(pdf, schema=schema).show() # verifySchema defaults to "spark.sql.execution.pandas.convertToArrowArraySafely" +---+-----------+ | id| value| +---+-----------+ | 1| 1215752192| | 2|-1863462912| | 3| -647710720| +---+-----------+ >>> spark.conf.set("spark.sql.execution.pandas.convertToArrowArraySafely", True) >>> spark.createDataFrame(pdf, schema=schema).show() ... pyspark.errors.exceptions.base.PySparkValueError: [VALUE_OUT_OF_BOUNDS] Value for `obj` must be between -2147483648 and 2147483647 (inclusive), got 100000000000 >>> spark.createDataFrame(table, schema=schema, verifySchema=True).show() ... pyarrow.lib.ArrowInvalid: Integer value 100000000000 not in range: -2147483648 to 2147483647 ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48677 from xinrong-meng/arrowSafe. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ationNameToId` outside of cases ### What changes were proposed in this pull request? In this PR, UTF8_BINARY performance regression is addressed, that was first identified here apache#48721. The regression is traced back to this PR apache#48222 when it first occurred, however this isn't the actual source of performance degradation. ### Why are the changes needed? The PR apache#48222 caused the regression because it changed the `collationNameToId` function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression. This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testing surface, benchmarks. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48804 from stevomitric/stevomitric/fix-utf8_binary-regression. Authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ryExecutionMetrics`'s logs clearer ### What changes were proposed in this pull request? The pr aims to add `name` to `RuleExecutor` to make printing `QueryExecutionMetrics`'s logs clearer. Otherwise, the following printing is meaningless (without knowing that `RuleExecutor`'s metric is being output) ```shell 24/10/29 15:12:33 WARN PlanChangeLogger: === Metrics of Executed Rules === Total number of runs: 100 Total time: 0.8585 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms 24/10/29 15:12:33 WARN PlanChangeLogger: === Metrics of Executed Rules === Total number of runs: 196 Total time: 0.78946 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms ``` ### Why are the changes needed? There are many `similar outputs` printed in the log, but it seems difficult for `spark developers` to know which `RuleExecutor` generated them. - Before: ```shell === Metrics of Executed Rules === Total number of runs: 199 Total time: 1.394873 ms Total number of effective runs: 2 Total time of effective runs: 0.916459 ms === Metrics of Executed Rules === Total number of runs: 196 Total time: 0.525134 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms === Metrics of Executed Rules === Total number of runs: 1 Total time: 0.00175 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms === Metrics of Executed Rules === Total number of runs: 166 Total time: 0.876414 ms Total number of effective runs: 1 Total time of effective runs: 0.130166 ms === Metrics of Executed Rules === Total number of runs: 1 Total time: 0.007375 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms ``` - After: ```shell === Metrics of Executed Rules org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 === Total number of runs: 199 Total time: 32.982158 ms Total number of effective runs: 2 Total time of effective runs: 32.067459 ms === Metrics of Executed Rules org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 === Total number of runs: 196 Total time: 0.630705 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms === Metrics of Executed Rules org.apache.spark.sql.catalyst.expressions.codegen.package$ExpressionCanonicalizer === Total number of runs: 1 Total time: 0.105459 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms === Metrics of Executed Rules org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 === Total number of runs: 166 Total time: 2.308457 ms Total number of effective runs: 1 Total time of effective runs: 1.22025 ms === Metrics of Executed Rules org.apache.spark.sql.catalyst.expressions.codegen.package$ExpressionCanonicalizer === Total number of runs: 1 Total time: 0.009166 ms Total number of effective runs: 0 Total time of effective runs: 0.0 ms ``` ### Does this PR introduce _any_ user-facing change? Yes, When `Spark developers` observe the logs printed by `PlanChangeLogger#logMetrics`, their meaning becomes clearer. ### How was this patch tested? Manually check ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48688 from panbingkun/SPARK-50153. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR continues the work from apache#43064 and apache#45801 to support Hive Metastore Server 4.0. CHAR/VARCHAR type partition filter pushdown is not included in this PR, as it requires further investment. ### Why are the changes needed? Enhance the multiple hive metastore server support feature ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing HiveClient*Suites w/ 4.0 ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48823 from yaooqinn/SPARK-45265. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade ORC to 2.0.3 for Apache Spark 4.0.0. ### Why are the changes needed? To bring the latest bug fixes ### Does this PR introduce _any_ user-facing change? - https://github.com/apache/orc/releases/tag/v2.0.3 - https://orc.apache.org/news/2024/11/14/ORC-2.0.3/ ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48846 from dongjoon-hyun/SPARK-50317. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…cate code between interpreted and codegen ### What changes were proposed in this pull request? It's refactoring of existing code. Add `makeYearMonthInterval` and remove duplicated code from `MakeDTInterval` ### Why are the changes needed? Better code quality. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Using existing tests. ### Was this patch authored or co-authored using generative AI tooling? Yes, Copilot was used. Closes apache#48848 from gotocoding-DB/deduplicate-code. Authored-by: Ruzel Ibragimov <ruzel.ibragimov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…rror when kerberos is true ### What changes were proposed in this pull request? When kerberos is enabled and SparkThriftServer is started, the service delivery parameters keytab and principal are created when hadoop authentication errors occur `saslServer = ShimLoader.getHadoopThriftAuthBridge().createServer(principal, keytab);` `public Server createServer(String keytabFile, String principalConf) throws TTransportException { return new Server(keytabFile, principalConf); }` ### Why are the changes needed? Failed to start SparkThriftServer when kerberos is true ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? verified ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48855 from CuiYanxiang/SPARK-50312. Authored-by: cuiyanxiang <kaer@startdt.com> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? Add `reportDriverMetrics` method to `Write` API and post custom metrics from driver after v2 write commits. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? apache#37205 supported reporting custom driver metrics when reading from v2 table. This is to support that when writing to v2 table. ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48573 from manuzhang/v2write-metrics. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Zhang, Manu <tianlzhang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ROR_TEMP_2138-9`: `CIRCULAR_CLASS_REFERENCE` ### What changes were proposed in this pull request? This PR proposes to Integrate `_LEGACY_ERROR_TEMP_2138-9` into `CIRCULAR_CLASS_REFERENCE ` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48769 from itholic/LEGACY_2139. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Find some code style issues for `if/for/while` statements ### Why are the changes needed? Fix code style for `if/for/while` statements ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N.A ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48425 from exmy/fix-style. Authored-by: exmy <xumovens@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ngle-pass Analyzer ### What changes were proposed in this pull request? Factor out alias resolution code to the `AliasResolution` object. ### Why are the changes needed? Some Analyzer code will be used in both fixed-point and single-pass Analyzers. Also, Analyzer.scala is 4K+ lines long, so it makes sense to gradually split it. Context: https://issues.apache.org/jira/browse/SPARK-49834 ### Does this PR introduce _any_ user-facing change? No. It's a pure refactoring. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48857 from vladimirg-db/vladimirg-db/refactor-alias-resolution. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? In the PR, I propose to postpone parameters resolution till `UnresolvedWithCTERelations` is resolved. ### Why are the changes needed? To fix the query failure: ```sql execute immediate 'with v1 as (select * from tt1 where 1 = (Select * from identifier(:tab))) select * from v1' using 'tt1' as tab; [UNBOUND_SQL_PARAMETER] Found the unbound parameter: tab. Please, fix `args` and provide a mapping of the parameter to either a SQL literal or collection constructor functions such as `map()`, `array()`, `struct()`. SQLSTATE: 42P02 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.ParametersSuite ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48847 from MaxGekk/fix-parameter-subquery. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… single-pass Analyzer ### What changes were proposed in this pull request? Factor out function resolution code to the `FunctionResolution` object. ### Why are the changes needed? Some Analyzer code will be used in both fixed-point and single-pass Analyzers. Also, Analyzer.scala is 4K+ lines long, so it makes sense to gradually split it. Context: https://issues.apache.org/jira/browse/SPARK-49834 ### Does this PR introduce _any_ user-facing change? No. It's a pure refactoring. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48858 from vladimirg-db/vladimirg-db/refactor-function-resolution. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? fix miss semicolon on create table example sql ### Why are the changes needed? fix miss semicolon on create table example sql. ### Does this PR introduce _any_ user-facing change? Yes. the patch fix docs miss semicolon sql. ### How was this patch tested? Manually by inspecting generated docs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48916 from camilesing/fix_docs_miss_semicolon. Authored-by: camilesing <camilesing@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…e rebase mode settings ### What changes were proposed in this pull request? The current default values for these have been changed from EXCEPTION to CORRECTED ### Why are the changes needed? docfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing existing ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48919 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to upgrade `commons-io` from `2.17.0` to `2.18.0`. ### Why are the changes needed? The full release notes: https://commons.apache.org/proper/commons-io/changes-report.html#a2.18.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48910 from panbingkun/SPARK-50375. Authored-by: panbingkun <panbingkun@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… PB file ### What changes were proposed in this pull request? The pr aims to - extract `common` logic for `reading the descriptor of PB file` to one place. - at the same time, when using the `from_protobuf` or `to_protobuf` function in `connect-client` and `spark-sql` (or `spark-shell`), the spark error-condition thrown when `the PB file is not found` or `read fails` will be aligned. ### Why are the changes needed? I found that the logic for `reading the descriptor of PB file` is scattered in various places in the `spark code repository`, eg: https://github.com/apache/spark/blob/a01856de20013e5551d385ee000772049a0e1bc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/toFromProtobufSqlFunctions.scala#L37-L48 https://github.com/apache/spark/blob/a01856de20013e5551d385ee000772049a0e1bc0/sql/api/src/main/scala/org/apache/spark/sql/protobuf/functions.scala#L304-L315 https://github.com/apache/spark/blob/a01856de20013e5551d385ee000772049a0e1bc0/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala#L231-L241 - I think we should gather it together to reduce the cost of maintenance. - Align `spark error-condition` to improve consistency in end-user experience. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48874 from panbingkun/SPARK-50334. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… MsSqlServer and future connectors ### What changes were proposed in this pull request? This PR proposes to propagate the `isPredicate` info in `V2ExpressionBuilder` and wrap the children of CASE WHEN expression (only `Predicate`s) with `IIF(<>, 1, 0)` for MsSqlServer. This is done to force returning an int instead of a boolean, as SqlServer cannot handle boolean expressions as a return type in CASE WHEN. E.g. ```CASE WHEN ... ELSE a = b END``` Old behavior: ```CASE WHEN ... ELSE a = b END = 1``` New behavior: Since in SqlServer a `= 1` is appended to the CASE WHEN, THEN and ELSE blocks must return an int. Therefore the final expression becomes: ```CASE WHEN ... ELSE IIF(a = b, 1, 0) END = 1``` ### Why are the changes needed? A user cannot work with an MsSqlServer data with CASE WHEN clauses or IF clauses if they wish to return a boolean value. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests to MsSqlServerIntegrationSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48621 from andrej-db/SPARK-50087-CaseWhen. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: andrej-db <andrej.gobeljic@databricks.com> Co-authored-by: Andrej Gobeljić <andrej.gobeljic@databricks.com> Co-authored-by: andrej-gobeljic_data <andrej.gobeljic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR aims to support `spark.master.rest.maxThreads`. ### Why are the changes needed? To provide users a way to control the number of maximum threads of REST API. Previously, Apache Spark uses a default constructor whose value is fixed to `200` always. https://github.com/apache/spark/blob/2e1c3dc8004b4f003cde8dfae6857f5bef4bb170/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala#L94 https://github.com/jetty/jetty.project/blob/5dfc59a691b748796f922208956bd1f2794bcd16/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L118-L121 ### Does this PR introduce _any_ user-facing change? No, the default value of new configuration is identical with the previously-used Jetty's default value. ### How was this patch tested? Pass the CIs with a newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48921 from dongjoon-hyun/SPARK-50381. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…pressions in `buildAggExprList` ### What changes were proposed in this pull request? Trim aliases before matching Sort/Having/Filter expressions with semantically equal expression from the Aggregate below in `buildAggExprList` ### Why are the changes needed? For a query like: ``` SELECT course, year, GROUPING(course) FROM courseSales GROUP BY CUBE(course, year) ORDER BY GROUPING(course) ``` Plan after `ResolveReferences` and before `ResolveAggregateFunctions` looks like: ``` !Sort [cast((shiftright(tempresolvedcolumn(spark_grouping_id#18L, spark_grouping_id, false), 1) & 1) as tinyint) AS grouping(course)#22 ASC NULLS FIRST], true +- Aggregate [course#19, year#20, spark_grouping_id#18L], [course#19, year#20, cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint) AS grouping(course)#21 AS grouping(course)#15] .... ``` Because aggregate list has `Alias(Alias(cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint))` expression from `SortOrder` won't get matched as semantically equal and it will result in adding an unnecessary `Project`. By stripping inner aliases from aggregate list (that are going to get removed anyways in `CleanupAliases`) we can match `SortOrder` expression and resolve it as `grouping(course)#15` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51339 from mihailotim-db/mihailotim-db/fix_inner_aliases_semi_structured. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

itholic and others added 30 commits November 13, 2024 09:15

camilesing and others added 8 commits November 21, 2024 11:24

Merge branch 'master' into pr48820

69324bd

Fix.

349df78

github-actions bot added INFRA BUILD SQL CONNECT PYTHON DOCS EXAMPLES CORE AVRO DSTREAM ML MLLIB STRUCTURED STREAMING R KUBERNETES WEB UI PROTOBUF labels Nov 21, 2024

ueshin added 2 commits November 21, 2024 15:04

Fix.

1079339

Fix.

c6b0651

ueshin closed this Nov 22, 2024

ueshin deleted the pr48820 branch November 22, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix conflicts and DataFrameFunctionsSuite #21

Fix conflicts and DataFrameFunctionsSuite #21

Uh oh!

ueshin commented Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Fix conflicts and DataFrameFunctionsSuite #21

Fix conflicts and DataFrameFunctionsSuite #21

Uh oh!

Conversation

ueshin commented Nov 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants