[pull] master from apache:master #43

pull · 2023-02-17T12:39:14Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…ases ### What changes were proposed in this pull request? This PR aims to add JVM GC option test coverage to K8s Integration Suite. To reuse the existing code, `isG1GC` variable is moved from `MemoryManager` to `Utils`. ### Why are the changes needed? To provide more test coverage for JVM Options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs ``` [info] KubernetesSuite: [info] - SPARK-42190: Run SparkPi with local[*] (4 seconds, 990 milliseconds) [info] - Run SparkPi with no resources (7 seconds, 101 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (7 seconds, 27 milliseconds) [info] - Run SparkPi with a very long application name. (7 seconds, 100 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (7 seconds, 947 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (6 seconds, 932 milliseconds) [info] - Run SparkPi with an argument. (9 seconds, 47 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (6 seconds, 969 milliseconds) [info] - All pods have the same service account by default (6 seconds, 916 milliseconds) [info] - Run extraJVMOptions check on driver (3 seconds, 964 milliseconds) [info] - Run extraJVMOptions JVM GC option check - G1GC (3 seconds, 948 milliseconds) [info] - Run extraJVMOptions JVM GC option check - Other GC (4 seconds, 51 milliseconds) ... ``` Closes #40062 from dongjoon-hyun/SPARK-42474. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR adds the following functions to Spark Connect Scala Client: - Sort Functions - Aggregate Functions - Misc Functions - Math Functions ### Why are the changes needed? We want to the Spark Connect Scala Client to reach parity with the original functions API. ### Does this PR introduce _any_ user-facing change? Yes, it adds a lot of functions. ### How was this patch tested? Added test for all functions and their significant variations. Closes #40050 from hvanhovell/SPARK-42461. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Adding more API to `agg` including max,min,mean,count,avg,sum. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40070 from amaliujia/rw-agg2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? This pr cleans up unused declarations in the Hive module: - Input parameter `dataTypes` of `HiveInspectors#wrap` method: the input parameter `dataTypes` introduced by SPARK-9354, but after SPARK-17509, the implementation of `HiveInspectors#wrap` no longer needs to explicitly pass `dataTypes` and it becomes a unused, and `inputDataTypes` in `HiveSimpleUDF` becomes a unused after this pr - `UNLIMITED_DECIMAL_PRECISION` and `UNLIMITED_DECIMAL_SCALE` in `HiveShim`: these two `val` introduced by SPARK-6909 for unlimited decimals, but SPARK-9069 remove unlimited precision support for DecimalType and SPARK-14877 deleted `object HiveMetastoreTypes` and used `.catalogString` instead, these two `val` are not used anymore. ### Why are the changes needed? Code clean up. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #40053 from LuciferYang/sql-hive-unused. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: huaxingao <huaxin_gao@apple.com>

### What changes were proposed in this pull request? This aims to regenerate benchmark results on `master` branch as a baseline for Spark 3.5.0 and a way to comparing Apache Spark 3.4.0 branch. ### Why are the changes needed? These are reference values with minor changes. ``` - OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure + OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure - OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure + OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure - OpenJDK 64-Bit Server VM 17.0.5+8 on Linux 5.15.0-1023-azure + OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #40072 from dongjoon-hyun/SPARK-42483. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR adds a new error class `INVALID_TEMP_OBJ_REFERENCE ` and replaces two existing error classes with this new one: - _LEGACY_ERROR_TEMP_1283 - _LEGACY_ERROR_TEMP_1284 ### Why are the changes needed? To improve the error messages. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes a user-facing error message. ### How was this patch tested? Existing unit tests. Closes #39910 from allisonwang-db/spark-42337-persistent-over-temp-err. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ANSI interval types ### What changes were proposed in this pull request? As #40005 (review) pointed out, the java doc for data type recommends using factory methods provided in org.apache.spark.sql.types.DataTypes. Since the ANSI interval types missed the `DataTypes` as well, this PR also revise their doc. ### Why are the changes needed? Unify the data type doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local preview <img width="826" alt="image" src="https://user-images.githubusercontent.com/1097932/219821685-321c2fd1-6248-4930-9c61-eec68f0dcb50.png"> Closes #40074 from gengliangwang/reviseNTZDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_2332, "UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY". ### Why are the changes needed? We should assign proper name to _LEGACY_ERROR_TEMP_* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*` Closes #39977 from itholic/LEGACY_2332. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Implemented the basic Dataset#write API to allow users to write the df into tables, csv etc. files. ### Why are the changes needed? Basic write operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Integration tests. Closes #40061 from zhenlineo/write. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…ataFrame.write ### What changes were proposed in this pull request? Enables a doctest for `DataFrame.write`. ### Why are the changes needed? Now that `DataFrame.write.saveAsTable` was fixed, we can enabled the doctest. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the doctest. Closes #40071 from ueshin/issues/SPARK-41818/doctest. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is the scala version of #39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes #40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…ocksDB state store instance ``` "23/02/23 23:57:44 INFO Executor: Running task 2.0 in stage 57.1 (TID 363) "23/02/23 23:58:44 ERROR RocksDB StateStoreId(opId=0,partId=3,name=default): RocksDB instance could not be acquired by [ThreadId: Some(49), task: 3.0 in stage 57, TID 363] as it was not released by [ThreadId: Some(51), task: 3.1 in stage 57, TID 342] after 60002 ms. ``` We are seeing those error messages for a testing query. The `taskId != partitionId` but we fail to be clear on this in the error log. It's confusing when we see those logs: the second log entry seems to talk about `task 3.0` (it's actually partition 3 and retry attempt 0), but the `TID 363` is already occupied by `task 2.0 in stage 57.1`. Also, it's unclear at which stage retry attempt, the lock is acquired (or fails to be acquired) ### What changes were proposed in this pull request? * add `partition ` after `task: ` in the log message for clarification * add stage attempt to distinguish different stage retries. ### Why are the changes needed? improve the log message for a better debuggability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? only log message change Closes #40161 from huanliwang-db/rocksdb. Authored-by: Huanli Wang <huanli.wang@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

### What changes were proposed in this pull request? Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark. A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next). ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.mapInPandas` is supported. An example is as shown below. ```py >>> df = spark.range(2) >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... >>> df.mapInPandas(filter_func, df.schema) DataFrame[id: bigint] >>> df.mapInPandas(filter_func, df.schema).show() +---+ | id| +---+ | 1| +---+ ``` ### How was this patch tested? Unit tests. Closes #40104 from xinrong-meng/mapInPandas. Lead-authored-by: Xinrong Meng <xinrong@apache.org>] Co-authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org>

### What changes were proposed in this pull request? The pr aims to > 1.Fix typos 'WriteOperaton -> WriteOperation' for `SaveModeConverter`. > 2.Update comment. ### Why are the changes needed? Fix typos. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #40158 from panbingkun/MINOR_CONNECT_TYPO. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Implements `SparkSession.conf`. Took #39995 over. ### Why are the changes needed? `SparkSession.conf` is a missing feature. ### Does this PR introduce _any_ user-facing change? Yes, `SparkSession.conf` will be available. ### How was this patch tested? Added/enabled related tests. Closes #40150 from ueshin/issues/SPARK-41834/conf. Lead-authored-by: Takuya UESHIN <ueshin@databricks.com> Co-authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…s properly while planning ### What changes were proposed in this pull request? Fixes `SparkConnectStreamHandler` to handle configs properly while planning. The whole process should be done in `session.withActive` to take the proper `SQLConf` into account. ### Why are the changes needed? Some components for planning need to check configs in `SQLConf.get` while building the plan, but currently it's unavailable. For example, `spark.sql.legacy.allowNegativeScaleOfDecimal` needs to check when construct `DecimalType` but it's not set while planning, thus it causes an error when trying to cast to `DecimalType(1, -1)` with the config set to `"true"`: ``` [INTERNAL_ERROR] Negative scale is not allowed: -1. Set the config "spark.sql.legacy.allowNegativeScaleOfDecimal" to "true" to allow it. ``` ### Does this PR introduce _any_ user-facing change? The configs will take effect while planning. ### How was this patch tested? Enabled a related test. Closes #40165 from ueshin/issues/SPARK-42568/withActive. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Support Pivot with provided pivot column values. Not supporting Pivot without providing column values because that requires to do max value check which depends on the implementation of Spark configuration in Spark Connect. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40145 from amaliujia/rw-pivot. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…fter getting input iterator from inputRDD The current behavior of the `compute` method in both `StateStoreRDD` and `ReadStateStoreRDD` is: we first get the state store instance and then get the input iterator for the inputRDD. For RocksDB state store, the running task will acquire and hold the lock for this instance. The retried task or speculative task will fail to acquire the lock and eventually abort the job if there are some network issues. For example, When we shrink the executors, the alive one will try to fetch data from the killed ones because it doesn't know the target location (prefetched from the driver) is dead until it tries to fetch data. The query might be hanging for a long time as the executor will retry `spark.shuffle.io.maxRetries=3` times and for each retry wait for `spark.shuffle.io.connectionTimeout` (default value is 120s) before timeout. In total, the task could be hanging for about 6 minutes. And the retried or speculative tasks won't be able to acquire the lock in this period. Making lock acquisition happen after retrieving the input iterator should be able to avoid this situation. ### What changes were proposed in this pull request? Making lock acquisition happen after retrieving the input iterator. ### Why are the changes needed? Avoid the failure like the following when there is a network issue ``` java.lang.IllegalStateException: StateStoreId(opId=0,partId=3,name=default): RocksDB instance could not be acquired by [ThreadId: Some(47), task: 3.1 in stage 57, TID 793] as it was not released by [ThreadId: Some(51), task: 3.1 in stage 57, TID 342] after 60003 ms. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT should be good enough Closes #40162 from huanliwang-db/rocksdb-2. Authored-by: Huanli Wang <huanli.wang@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…og warning if it exceeds threshold ### What changes were proposed in this pull request? Track load time for state store provider and log warning if it exceeds threshold ### Why are the changes needed? We have seen that the initial state store provider load can be blocked by external factors such as filesystem initialization. This log enables us to track cases where this load takes too long and we log a warning in such cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented some of the tests to verify the logging is working as expected. Sample logs: ``` 14:58:51.784 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Loaded state store provider in loadTimeMs=2049 for storeId=StateStoreId[ checkpointRootLocation=file:/Users/anish.shrigondekar/spark/spark/target/tmp/streaming.metadata-1f2ff296-1ece-4a0c-b4b4-48aa0e909b49/ state, operatorId=0, partitionId=2, storeName=default ] and queryRunId=a4063603-3929-4340-9920-eca206ebec36 14:58:53.838 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Loaded state store provider in loadTimeMs=2046 for storeId=StateStoreId[ checkpointRootLocation=file:/Users/anish.shrigondekar/spark/spark/target/tmp/streaming.metadata-1f2ff296-1ece-4a0c-b4b4-48aa0e909b49/ state, operatorId=0, partitionId=3, storeName=default ] and queryRunId=a4063603-3929-4340-9920-eca206ebec36 14:58:55.885 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Loaded state store provider in loadTimeMs=2044 for storeId=StateStoreId[ checkpointRootLocation=file:/Users/anish.shrigondekar/spark/spark/target/tmp/streaming.metadata-1f2ff296-1ece-4a0c-b4b4-48aa0e909b49/ state, operatorId=0, partitionId=4, storeName=default ] and queryRunId=a4063603-3929-4340-9920-eca206ebec36 ``` Closes #40163 from anishshri-db/task/SPARK-42567. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…l major client APIs ### What changes were proposed in this pull request? Make binary compatibility check for SparkSession/Dataset/Column/functions etc. ### Why are the changes needed? Help us to have a good understanding of the current API coverage of the Scala client. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #40168 from zhenlineo/comp-it. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…nnectFunSuite ### What changes were proposed in this pull request? Make all client tests to extend from ConnectFunSuite to avoid `// scalastyle:ignore funsuite` when extending directly from `AnyFunSuite` ### Why are the changes needed? Simple dev work. ### Does this PR introduce _any_ user-facing change? No. test only. ### How was this patch tested? n/a Closes #40169 from zhenlineo/life-savor. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Add temp view API to Dataset ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40167 from amaliujia/add_temp_view_api. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

… API ### What changes were proposed in this pull request? Match https://github.com/apache/spark/blob/6a2433070e60ad02c69ae45706a49cdd0b88a082/python/pyspark/sql/connect/dataframe.py#L1500 to throw unsupported exceptions in Scala client. ### Why are the changes needed? Better indicating a API is not supported yet. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40164 from amaliujia/unsupported_op. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

… types ### What changes were proposed in this pull request? This pr aims add more types support of `sql.functions#lit` function, include: - Decimal - Instant - Timestamp - LocalDateTime - Date - Duration - Period - CalendarInterval ### Why are the changes needed? Make ·sql.functions#lit· function support more types ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Add new test - Manual checked new case with Scala-2.13 Closes #40143 from LuciferYang/functions-lit. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

… source ### What changes were proposed in this pull request? Fixes `DataFrameReader` to use the default source. ### Why are the changes needed? ```py spark.read.load(path) ``` should work and use the default source without specifying the format. ### Does this PR introduce _any_ user-facing change? The `format` doesn't need to be specified. ### How was this patch tested? Enabled related tests. Closes #40166 from ueshin/issues/SPARK-42570/reader. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Add `groupBy(col1: String, cols: String*)` to Scala client Dataset API. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40173 from amaliujia/2nd_groupby. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…ion/order in subquery ### What changes were proposed in this pull request? Extend the CollapseWindow rule to collapse Window nodes, when one window in subquery. ### Why are the changes needed? ``` select a, b, c, row_number() over (partition by a order by b) as d from ( select a, b, rank() over (partition by a order by b) as c from t1) t2 == Optimized Logical Plan == before Window [row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS d#26], [a#11], [b#12 ASC NULLS FIRST] +- Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS c#25], [a#11], [b#12 ASC NULLS FIRST] +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#7] +- *(1) MapElements org.apache.spark.sql.DataFrameSuite$$Lambda$1517/16288483683a479fda, obj#5: scala.Tuple2 +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: java.lang.Long +- *(1) Range (0, 10, step=1, splits=2) after Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS c#25, row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS d#26], [a#11], [b#12 ASC NULLS FIRST] +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#7] +- *(1) MapElements org.apache.spark.sql.DataFrameSuite$$Lambda$1518/19280286724d7a64ca, obj#5: scala.Tuple2 +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: java.lang.Long +- *(1) Range (0, 10, step=1, splits=2) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40115 from zml1206/SPARK-42525. Authored-by: zml1206 <zhuml1206@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…mn names ### What changes were proposed in this pull request? Fixes `DataFrame.toPandas` to handle duplicated column names. ### Why are the changes needed? Currently ```py spark.sql("select 1 v, 1 v").toPandas() ``` fails with the error: ```py Traceback (most recent call last): ... File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas return self._session.client.to_pandas(query) File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` Simliar to #28210. ### Does this PR introduce _any_ user-facing change? Duplicated column names will be available when calling `toPandas()`. ### How was this patch tested? Enabled related tests. Closes #40170 from ueshin/issues/SPARK-42574/toPandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rsist ### What changes were proposed in this pull request? Follow up #40164 to also throw unsupported operation exception for `persist`. Right now we are ok to depends on the `StorageLevel` in core module but in the future that shall be refactored and moved to a common module. ### Why are the changes needed? Better way to indicate a non-supported API. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40172 from amaliujia/unsupported_op_2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… Connect Column API ### What changes were proposed in this pull request? This PR proposes to migrate `TypeError` into error framework for Spark Connect Column API. ### Why are the changes needed? To improve errors by leveraging the PySpark error framework for Spark Connect Column APIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fixed & added UTs. Closes #39991 from itholic/SPARK-42419. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…n.time ### What changes were proposed in this pull request? The pr aims to implement SparkSession.version and SparkSession.time. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT. Closes #40176 from panbingkun/SPARK-42564. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…in `connect` module tests ### What changes were proposed in this pull request? This PR aims to use `wrapper versions` for SBT and Maven in `connect` test module's exceptions and comments. ### Why are the changes needed? To clarity the versions we used. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #40180 from dongjoon-hyun/SPARK-42587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR adds the ColumnName for the Spark Connect Scala Client. This is a stepping stone to implement the SQLImplicits. ### Why are the changes needed? API parity with the current API. ### Does this PR introduce _any_ user-facing change? Yes. It adds new API. ### How was this patch tested? Added existing tests tot `ColumnTestSuite`. Closes #40179 from hvanhovell/SPARK-42560. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? This is a follow-up of #40180. ### Why are the changes needed? At previous PR, `Scalastyle` is checked but `scalafmt` was missed at the last commit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass all CI linter jobs. Closes #40183 from dongjoon-hyun/SPARK-42587-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Reimplement `PercentileHeap` such that: - the percentile value is always in the `topHeap`, this speeds up `percentile` access - rebalance the heaps more efficiently by checking which heap should grow due to the new insertion and doing a rebalance based on target heap sizes - the heaps are java PriorityQueue's *without* comparators. Comparator call overhead slows down `poll`/`offer` by more than 2x. Instead implement a max-heap by `poll`/`offer` on the negated domain of numbers. ### Why are the changes needed? `PercentileHeap` is heavy weight enough to cause scheduling delays if inserted inside the scheduler loop. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added more extensive unittests. Closes #40121 from alkis/faster-percentile-heap. Lead-authored-by: Alkis Evlogimenos <alkis.evlogimenos@databricks.com> Co-authored-by: Alkis Evlogimenos <alkis@evlogimenos.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s in create/replace table statements ### What changes were proposed in this pull request? Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr. For example the following will be supported: ```sql CREATE TABLE default.example ( time TIMESTAMP, date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) ) ``` To be more specific this PR 1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java). 4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception. ### Why are the changes needed? `GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL. ### Does this PR introduce _any_ user-facing change? Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception. Previous behavior: ``` spark-sql> CREATE TABLE default.example ( > time TIMESTAMP, > date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) > ) > ; Error in query: Syntax error at or near 'GENERATED'(line 3, pos 14) == SQL == CREATE TABLE default.example ( time TIMESTAMP, date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) --------------^^^ ) ``` New behavior: ``` AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and namespace to make sure the qualified table name is expected, and also check the catalog implementation which is configured by "spark.sql.catalog". ``` ### How was this patch tested? Adds unit tests Closes #38823 from allisonport-db/parser-support. Authored-by: Allison Portis <allison.portis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…lizable JobID in FileWriterFactory ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40064 from Yikf/write-job-id. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun and others added 2 commits February 17, 2023 01:11

github-actions bot added CONNECT CORE EXAMPLES KUBERNETES SQL labels Feb 17, 2023

pull bot added ⤵️ pull and removed CORE SQL CONNECT KUBERNETES EXAMPLES labels Feb 17, 2023

github-actions bot added CONNECT CORE EXAMPLES KUBERNETES SQL labels Feb 18, 2023

LuciferYang and others added 2 commits February 17, 2023 20:51

github-actions bot added AVRO MLLIB labels Feb 18, 2023

allisonwang-db and others added 2 commits February 18, 2023 10:17

github-actions bot added the DOCS label Feb 18, 2023

itholic and others added 3 commits February 19, 2023 13:25

github-actions bot added the PYTHON label Feb 20, 2023

hvanhovell and others added 5 commits February 24, 2023 13:05

github-actions bot added the PANDAS API ON SPARK label Feb 25, 2023

ueshin and others added 22 commits February 25, 2023 10:38

huangxiaopingRD merged commit d46b15d into huangxiaopingRD:master Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #43

[pull] master from apache:master #43

Uh oh!

pull bot commented Feb 17, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[pull] master from apache:master #43

[pull] master from apache:master #43

Uh oh!

Conversation

pull bot commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Feb 17, 2023 •

edited

Loading