Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49911][SQL] Fix semantic of support binary equality #1

Open
wants to merge 251 commits into
base: master
Choose a base branch
from

Conversation

jovanpavl-db
Copy link
Owner

What changes were proposed in this pull request?

With introduction of trim collation, what was known as supportsBinaryEquality changes, it is now split in isUtf8BinaryType and usesTrimCollation so that it has correct semantics.

Why are the changes needed?

With introduction of trim collation, what was known as supportsBinaryEquality changes, it is now split in isUtf8BinaryType and usesTrimCollation so that it has correct semantics.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Everything is covered with existing tests, no new functionality is added.

Was this patch authored or co-authored using generative AI tooling?

No.

zhengruifeng and others added 30 commits September 18, 2024 09:28
…umn names

### What changes were proposed in this pull request?
Function `substring` should accept column names

### Why are the changes needed?
Bug fix:

```
In [1]:     >>> import pyspark.sql.functions as sf
   ...:     >>> df = spark.createDataFrame([('Spark', 2, 3)], ['s', 'p', 'l'])
   ...:     >>> df.select('*', sf.substring('s', 'p', 'l')).show()
```

works in PySpark Classic, but fail in Connect with:
```
NumberFormatException                     Traceback (most recent call last)
Cell In[2], line 1
----> 1 df.select('*', sf.substring('s', 'p', 'l')).show()

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1170, in DataFrame.show(self, n, truncate, vertical)
   1169 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
-> 1170     print(self._show_string(n, truncate, vertical))

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:927, in DataFrame._show_string(self, n, truncate, vertical)
    910     except ValueError:
    911         raise PySparkTypeError(
    912             errorClass="NOT_BOOL",
    913             messageParameters={
   (...)
    916             },
    917         )
    919 table, _ = DataFrame(
    920     plan.ShowString(
    921         child=self._plan,
    922         num_rows=n,
    923         truncate=_truncate,
    924         vertical=vertical,
    925     ),
    926     session=self._session,
--> 927 )._to_table()
    928 return table[0][0].as_py()

File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1844, in DataFrame._to_table(self)
   1842 def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
   1843     query = self._plan.to_proto(self._session.client)
-> 1844     table, schema, self._execution_info = self._session.client.to_table(
   1845         query, self._plan.observations
   1846     )
   1847     assert table is not None
   1848     return (table, schema)

File ~/Dev/spark/python/pyspark/sql/connect/client/core.py:892, in SparkConnectClient.to_table(self, plan, observations)
    890 req = self._execute_plan_request_with_metadata()
    891 req.plan.CopyFrom(plan)
--> 892 table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(req, observations)
    894 # Create a query execution object.
    895 ei = ExecutionInfo(metrics, observed_metrics)

File ~/Dev/spark/python/pyspark/sql/connect/client/core.py:1517, in SparkConnectClient._execute_and_fetch(self, req, observations, self_destruct)
   1514 properties: Dict[str, Any] = {}
   1516 with Progress(handlers=self._progress_handlers, operation_id=req.operation_id) as progress:
-> 1517     for response in self._execute_and_fetch_as_iterator(
   1518         req, observations, progress=progress
   1519     ):
   1520         if isinstance(response, StructType):
   1521             schema = response

File ~/Dev/spark/python/pyspark/sql/connect/client/core.py:1494, in SparkConnectClient._execute_and_fetch_as_iterator(self, req, observations, progress)
   1492     raise kb
   1493 except Exception as error:
-> 1494     self._handle_error(error)

File ~/Dev/spark/python/pyspark/sql/connect/client/core.py:1764, in SparkConnectClient._handle_error(self, error)
   1762 self.thread_local.inside_error_handling = True
   1763 if isinstance(error, grpc.RpcError):
-> 1764     self._handle_rpc_error(error)
   1765 elif isinstance(error, ValueError):
   1766     if "Cannot invoke RPC" in str(error) and "closed" in str(error):

File ~/Dev/spark/python/pyspark/sql/connect/client/core.py:1840, in SparkConnectClient._handle_rpc_error(self, rpc_error)
   1837             if info.metadata["errorClass"] == "INVALID_HANDLE.SESSION_CHANGED":
   1838                 self._closed = True
-> 1840             raise convert_exception(
   1841                 info,
   1842                 status.message,
   1843                 self._fetch_enriched_error(info),
   1844                 self._display_server_stack_trace(),
   1845             ) from None
   1847     raise SparkConnectGrpcException(status.message) from None
   1848 else:

NumberFormatException: [CAST_INVALID_INPUT] The value 'p' of the type "STRING" cannot be cast to "INT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
...
```

### Does this PR introduce _any_ user-facing change?
yes, Function `substring` in Connect can properly handle column names

### How was this patch tested?
new doctests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48135 from zhengruifeng/py_substring_fix.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…R_TEMP_13[44-46]

### What changes were proposed in this pull request?

rename err class _LEGACY_ERROR_TEMP_13[44-46]: 44 removed, 45 to DEFAULT_UNSUPPORTED, 46 to ADD_DEFAULT_UNSUPPORTED

### Why are the changes needed?

replace legacy err class name with  semantically explicits.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Re run the UT class modified in the PR (org.apache.spark.sql.sources.InsertSuite & org.apache.spark.sql.types.StructTypeSuite)

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46320 from PaysonXu/SPARK-47263.

Authored-by: xuping <13289341606@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Continue the discussion from apache#47425 to this PR because I can't push to Yuchen's account

###  What changes were proposed in this pull request?
The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth.

### Why are the changes needed?
Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is:
```
{
  "type": "record",
  "name": "LongList",
  "fields" : [
    {"name": "value", "type": "long"},
    {"name": "next", "type": ["null", "LongList"]}
  ]
}
```
This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it.

### Does this PR introduce any user-facing change?
Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth.

### How was this patch tested?
Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Co-authored-by: Wei Liu <wei.liudatabricks.com>

Closes apache#48043 from WweiL/yuchen-avro-recursive-schema.

Lead-authored-by: Yuchen Liu <yuchen.liu@databricks.com>
Co-authored-by: Wei Liu <wei.liu@databricks.com>
Co-authored-by: Yuchen Liu <170372783+eason-yuchen-liu@users.noreply.github.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
…teral date and datetime

### What changes were proposed in this pull request?
Refine the string representation of literal date and datetime

### Why are the changes needed?
1, we should not represent those literals with internal values;
2, the string representation should be consistent with PySpark Classic if possible (we cannot make sure the representations are always the same because we only hold an unresolved expression in connect, but we can try our best to do so)

### Does this PR introduce _any_ user-facing change?
yes

before:
```
In [3]: lit(datetime.date(2024, 7, 10))
Out[3]: Column<'19914'>

In [4]: lit(datetime.datetime(2024, 7, 10, 1, 2, 3, 456))
Out[4]: Column<'1720544523000456'>
```

after:
```
In [3]: lit(datetime.date(2024, 7, 10))
Out[3]: Column<'2024-07-10'>

In [4]: lit(datetime.datetime(2024, 7, 10, 1, 2, 3, 456))
Out[4]: Column<'2024-07-10 01:02:03.000456'>
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48137 from zhengruifeng/py_connect_lit_dt.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE

### What changes were proposed in this pull request?
Increases the default `maxBatchSize` from 4MiB * 0.7 to 128MiB (=
CONNECT_GRPC_MAX_MESSAGE_SIZE) * 0.7. This makes better use of the allowed maximum message size.
This limit is used when creating Arrow batches for the `SqlCommandResult` in the `SparkConnectPlanner` and for `ExecutePlanResponse.ArrowBatch` in `processAsArrowBatches`. This, for example, lets us return much larger `LocalRelations` in the `SqlCommandResult` (i.e., for the `SHOW PARTITIONS` command) while still staying within the GRPC message size limit.

### Why are the changes needed?
There are `SqlCommandResults` that exceed 0.7 * 4MiB.

### Does this PR introduce _any_ user-facing change?
Now support `SqlCommandResults` <= 0.7 * 128 MiB instead of only <= 0.7 * 4MiB and ExecutePlanResponses will now better use the limit of 128MiB.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48122 from dillitz/increase-sql-command-batch-size.

Authored-by: Robert Dillitz <robert.dillitz@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…e` failure

### What changes were proposed in this pull request?

Add a short wait loop to ensure that the test pre-condition is met. To be specific, VerifyEvents.executeHolder is set asynchronously by MockSparkListener.onOtherEvent whereas the test assumes that VerifyEvents.executeHolder is always available.

### Why are the changes needed?

For smoother development experience.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SparkConnectServiceSuite.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48142 from changgyoopark-db/SPARK-49688.

Authored-by: Changgyoo Park <changgyoo.park@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR removes the self type parameter from Dataset. This turned out to be a bit noisy. The self type is replaced by a combination of covariant return types and abstract types. Abstract types are used when a method takes a Dataset (or a KeyValueGroupedDataset) as an argument.

### Why are the changes needed?
The self type made using the classes in sql/api a bit noisy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48146 from hvanhovell/SPARK-49568.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
… managers

### What changes were proposed in this pull request?

Eliminate the use of global locks in the session and execution managers. Those locks residing in the streaming query manager cannot be easily removed because the tag and query maps seemingly need to be synchronised.

### Why are the changes needed?

In order to achieve true scalability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48131 from changgyoopark-db/SPARK-49684.

Authored-by: Changgyoo Park <changgyoo.park@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds `Dataset.groupByKey(..)` to the shared interface. I forgot to add in the previous PR.

### Why are the changes needed?
The shared interface needs to support all functionality.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48147 from hvanhovell/SPARK-49422-follow-up.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
…Hub Pages publication action"

This reverts commit 7de71a2.
…r branch via Live GitHub Pages Updates"

This reverts commit b180709.
### What changes were proposed in this pull request?
Fix of accent sensitive and case sensitive column results.

### Why are the changes needed?
When initial PR was introduced, ICU collation listing ended up with different order of generating columns so results were wrong.

### Does this PR introduce _any_ user-facing change?
No, as spark 4.0 was not released yet.

### How was this patch tested?
Existing test in CollationSuite.scala, which was wrong in the first place.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48152 from mihailom-db/tvf-collations-followup.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

This PR adds support for executing procedures in catalogs.

### Why are the changes needed?

These changes are needed per [discussed and voted](https://lists.apache.org/thread/w586jr53fxwk4pt9m94b413xyjr1v25m) SPIP tracked in [SPARK-44167](https://issues.apache.org/jira/browse/SPARK-44167).

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds CALL commands.

### How was this patch tested?

This PR comes with tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47943 from aokolnychyi/spark-48782.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…essionBuilder to Expression Walker

### What changes were proposed in this pull request?
Addition of new expressions to expression walker. This PR also improves descriptions of methods in the Suite.

### Why are the changes needed?
It was noticed while debugging that startsWith, endsWith and contains are not tested with this suite and these expressions represent core of collation testing.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test only.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48162 from mihailom-db/expressionwalkerfollowup.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…se StringSearch

### What changes were proposed in this pull request?

In this PR, I propose to disallow `CS_AI` collated strings in expressions that use `StringsSearch` in their implementation. These expressions are `trim`, `startswith`, `endswith`, `locate`, `instr`, `str_to_map`, `contains`, `replace`, `split_part` and `substring_index`.

Currently, these expressions support all possible collations, however, they do not work properly with `CS_AI` collators. This is because there is no support for `CS_AI` search in the ICU's `StringSearch` class which is used to implement these expressions. Therefore, the expressions are not behaving correctly when used with `CS_AI` collators (e.g. currently `startswith('hOtEl' collate unicode_ai, 'Hotel' collate unicode_ai)` returns `true`).

### Why are the changes needed?

Proposed changes are necessary in order to achieve correct behavior of the expressions mentioned above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This patch was tested by adding a test in the `CollationSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48121 from vladanvasi-db/vladanvasi-db/cs-ai-collations-expressions-disablement.

Authored-by: Vladan Vasić <vladan.vasic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nctions

### What changes were proposed in this pull request?
Function parity test ignore private functions

### Why are the changes needed?
existing test is based on `java.lang.reflect.Modifier` which cannot properly handle `private[xxx]`

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48163 from zhengruifeng/df_func_test.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?
This PR adds `Dataset.groupByKey(..)` to the shared interface. I forgot to add in the previous PR.

### Why are the changes needed?
The shared interface needs to support all functionality.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48147 from hvanhovell/SPARK-49422-follow-up.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
…imedelta`

### What changes were proposed in this pull request?
Refine the string representation of `timedelta`, by following the ISO format.
Note that the used units in JVM side (`Duration`) and Pandas are different.

### Why are the changes needed?
We should not leak the raw data

### Does this PR introduce _any_ user-facing change?
yes

PySpark Classic:
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'PT24H1S'>
```

PySpark Connect (before):
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'86401000000'>
```

PySpark Connect (after):
```
In [1]: from pyspark.sql import functions as sf

In [2]: import datetime

In [3]: sf.lit(datetime.timedelta(1, 1))
Out[3]: Column<'P1DT0H0M1S'>
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48159 from zhengruifeng/pc_lit_delta.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?
Make `UUID` and `SHUFFLE` accept integer `seed`

### Why are the changes needed?
In most cases, `seed` accept both int and long, but `UUID` and `SHUFFLE` only accept long seed

```py
In [1]: spark.sql("SELECT RAND(1L), RAND(1), SHUFFLE(array(1, 20, 3, 5), 1L), UUID(1L)").show()
+------------------+------------------+---------------------------+--------------------+
|           rand(1)|           rand(1)|shuffle(array(1, 20, 3, 5))|              uuid()|
+------------------+------------------+---------------------------+--------------------+
|0.6363787615254752|0.6363787615254752|              [20, 1, 3, 5]|1ced31d7-59ef-4bb...|
+------------------+------------------+---------------------------+--------------------+

In [2]: spark.sql("SELECT UUID(1)").show()
...
AnalysisException: [INVALID_PARAMETER_VALUE.LONG] The value of parameter(s) `seed` in `UUID` is invalid: expects a long literal, but got "1". SQLSTATE: 22023; line 1 pos 7
...

In [3]: spark.sql("SELECT SHUFFLE(array(1, 20, 3, 5), 1)").show()
...
AnalysisException: [INVALID_PARAMETER_VALUE.LONG] The value of parameter(s) `seed` in `shuffle` is invalid: expects a long literal, but got "1". SQLSTATE: 22023; line 1 pos 7
...
```

### Does this PR introduce _any_ user-facing change?
yes

after this fix:
```py
In [2]: spark.sql("SELECT SHUFFLE(array(1, 20, 3, 5), 1L), SHUFFLE(array(1, 20, 3, 5), 1), UUID(1L), UUID(1)").show()
+---------------------------+---------------------------+--------------------+--------------------+
|shuffle(array(1, 20, 3, 5))|shuffle(array(1, 20, 3, 5))|              uuid()|              uuid()|
+---------------------------+---------------------------+--------------------+--------------------+
|              [20, 1, 3, 5]|              [20, 1, 3, 5]|1ced31d7-59ef-4bb...|1ced31d7-59ef-4bb...|
+---------------------------+---------------------------+--------------------+--------------------+
```

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48166 from zhengruifeng/int_seed.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… plot

### What changes were proposed in this pull request?
- Update the documentation for barh plot to clarify the difference between axis interpretation in Plotly and Matplotlib.
- Test multiple columns as value axis.

The parameter difference is demonstrated as below.
```py
>>> df = ps.DataFrame({'lab': ['A', 'B', 'C'], 'val': [10, 30, 20]})
>>> df.plot.barh(x='val', y='lab').show()  # plot1

>>> ps.set_option('plotting.backend', 'matplotlib')
>>> import matplotlib.pyplot as plt
>>> df.plot.barh(x='lab', y='val')
>>> plt.show()  # plot2
```

plot1
![newplot (5)](https://github.com/user-attachments/assets/f1b6fabe-9509-41bb-8cfb-0733f65f1643)

plot2
![Figure_1](https://github.com/user-attachments/assets/10e1b65f-6116-4490-9956-29e1fbf0c053)

### Why are the changes needed?
The barh plot’s x and y axis behavior differs between Plotly and Matplotlib, which may confuse users. The updated documentation and tests help ensure clarity and prevent misinterpretation.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48161 from xinrong-meng/ps_barh.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Switch `Scatter` plot to sampled data

### Why are the changes needed?
when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset

for example:
```
import pandas as pd
import numpy as np
import pyspark.pandas as ps

# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)

pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)

psdf.plot.scatter(x='B', y='A')
```

all 10k datapoints:
![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f)

before (first 1k datapoints):
![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367)

after (sampled 1k datapoints):
![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59)

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
ci and manually test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48164 from zhengruifeng/ps_scatter_sampling.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Add a script to clean up PySpark temp files

### Why are the changes needed?
Sometimes I encounter weird issues due to the out-dated `pyspark.zip` file, and removing it can result in expected behavior.
So I think we can add such a script.

### Does this PR introduce _any_ user-facing change?
no, dev-only

### How was this patch tested?
manually test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48167 from zhengruifeng/py_infra_cleanup.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

This PR aims to upgrade `protobuf-java` to 3.25.5.

### Why are the changes needed?

To bring the latest bug fixes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48170

Closes apache#48171 from dongjoon-hyun/SPARK-49721.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…t number arguments

### What changes were proposed in this pull request?
1, Make function `count_min_sketch` accept number arguments;
2, Make argument `seed` optional;
3, fix the type hints of `eps/confidence/seed` from `ColumnOrName` to `Column`, because they require a foldable value and actually do not accept column name:
```
In [3]: from pyspark.sql import functions as sf

In [4]: df = spark.range(10000).withColumn("seed", sf.lit(1).cast("int"))

In [5]: df.select(sf.hex(sf.count_min_sketch("id", sf.lit(0.5), sf.lit(0.5), "seed")))
...
AnalysisException: [DATATYPE_MISMATCH.NON_FOLDABLE_INPUT] Cannot resolve "count_min_sketch(id, 0.5, 0.5, seed)" due to data type mismatch: the input `seed` should be a foldable "INT" expression; however, got "seed". SQLSTATE: 42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#1L, 0.5, 0.5, seed#2, 0, 0)))]
+- Project [id#1L, cast(1 as int) AS seed#2]
   +- Range (0, 10000, step=1, splits=Some(12))
...
```

### Why are the changes needed?
1, seed is optional in other similar functions;
2, existing type hint is `ColumnOrName` which is misleading since column name is not actually supported

### Does this PR introduce _any_ user-facing change?
yes, it support number arguments

### How was this patch tested?
updated doctests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48157 from zhengruifeng/py_fix_count_min_sketch.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…nd forceSnapshot flag is also reset

### What changes were proposed in this pull request?
Ensure that changelog files are written on commit and forceSnapshot flag is also reset

### Why are the changes needed?
Without these changes, we are not writing the changelog files per batch and we are also trying to upload full snapshot each time since the flag is not being reset correctly

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests

Before:
```
[info] Run completed in 3 seconds, 438 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
```

After:
```
[info] Run completed in 4 seconds, 155 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48125 from anishshri-db/task/SPARK-49677.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?

Disable https://github.com/apache/spark/actions/runs/10951008649/ via:

> adding a .nojekyll file to the root of your source branch will bypass the Jekyll build process and deploy the content directly.

https://docs.github.com/en/pages/quickstart

### Why are the changes needed?

restore ci

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

no

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48176 from yaooqinn/action.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ternal data source

### What changes were proposed in this pull request?
Change `sqlState` to KD010.

### Why are the changes needed?
Necessary modification for the Databricks error class space.

### Does this PR introduce _any_ user-facing change?
Yes, the new error message is now updated to KD010.

### How was this patch tested?
Existing tests (updated).

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48165 from uros-db/external-data-source-fix.

Authored-by: Uros Bojanic <uros.bojanic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

Fix rat check for .nojekyll

### Why are the changes needed?

CI fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

dev/check-license
Ignored 1 lines in your exclusion files as comments or empty lines.
RAT checks passed.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#48178 from yaooqinn/f.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
} else if (CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
} else if (CollationFactory.fetchCollation(collationId).isUtf8LcaseType) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes here: lowercaseEquality != UTF8_LCASE
(although at this moment it's lowercaseEquality <=> UTF8_LCASE)

@@ -154,12 +154,25 @@ public static class Collation {
*/
public final boolean supportsLowercaseEquality;


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the renaming... I would prefer keeping supportsBinaryEquality

however, UTF8_BINARY_RTRIM doesn't support B.E. (while UTF8_BINARY does)
this is something that we should keep in mind when handling expression execution, as you mentioned in a comment in your previous PR - one possible solution is to normalize the collationId in a way that "transforms" UTF8_BINARY_RTRIM to UTF8_BINARY after applying the trimming policy to the appropriate arguments, and then proceeding with the execution (this way, UTF8_BINARY_RTRIM will lead to the same execution code path as UTF8_BINARY; UNICODE_CI_RTRIM will lead to the same execution code path as UNICODE_CI; etc.)

in any case, please open a PR towards apache/spark master, and let's continue the review there

xinrong-meng and others added 3 commits October 15, 2024 08:31
### What changes were proposed in this pull request?
Support box plots with plotly backend on both Spark Connect and Spark classic.

### Why are the changes needed?
While Pandas on Spark supports plotting, PySpark currently lacks this feature. The proposed API will enable users to generate visualizations. This will provide users with an intuitive, interactive way to explore and understand large datasets directly from PySpark DataFrames, streamlining the data analysis workflow in distributed environments.

See more at [PySpark Plotting API Specification](https://docs.google.com/document/d/1IjOEzC8zcetG86WDvqkereQPj_NGLNW7Bdu910g30Dg/edit?usp=sharing) in progress.

Part of https://issues.apache.org/jira/browse/SPARK-49530.

### Does this PR introduce _any_ user-facing change?
Yes. Box plots are supported as shown below.

```py
>>> data = [
...             ("A", 50, 55),
...             ("B", 55, 60),
...             ("C", 60, 65),
...             ("D", 65, 70),
...             ("E", 70, 75),
...             # outliers
...             ("F", 10, 15),
...             ("G", 85, 90),
...             ("H", 5, 150),
...         ]
>>> columns = ["student", "math_score", "english_score"]
>>> sdf = spark.createDataFrame(data, columns)
>>> fig1 = sdf.plot.box(column=["math_score", "english_score"])
>>> fig1.show()  # see below
>>> fig2 = sdf.plot(kind="box", column="math_score")
>>> fig2.show()  # see below
```

fig1:
![newplot (17)](https://github.com/user-attachments/assets/8c36c344-f6de-47e3-bd63-c0f3b57efc43)

fig2:
![newplot (18)](https://github.com/user-attachments/assets/9b7b60f6-58ec-4eff-9544-d5ab88a88631)

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48447 from xinrong-meng/box.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade ASM from `9.7` to `9.7.1`.

### Why are the changes needed?
- xbean-asm9-shaded 4.26 upgrade to use `ASM 9.7.1` and `ASM 9.7.1` is for `Java 24`.
apache/geronimo-xbean#41

- https://asm.ow2.io/versions.html
  <img width="809" alt="image" src="https://github.com/user-attachments/assets/6ca57af9-2b03-467f-9a31-31b6d7eb4d53">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48465 from panbingkun/SPARK-49965.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request?

This test improves a unit test case where json strings with duplicate keys are prohibited by checking the cause of the exception instead of just the root exception.

### Why are the changes needed?

Earlier, the test only checked the top error class but not the cause of the error which should be `VARIANT_DUPLICATE_KEY`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

NA

Closes apache#48464 from harshmotw-db/harshmotw-db/minor_test_fix.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@jovanpavl-db jovanpavl-db changed the base branch from implement_hashing to master October 15, 2024 08:01
@jovanpavl-db
Copy link
Owner Author

jovanpavl-db commented Oct 15, 2024

not sure about the renaming... I would prefer keeping supportsBinaryEquality

however, UTF8_BINARY_RTRIM doesn't support B.E. (while UTF8_BINARY does) this is something that we should keep in mind when handling expression execution, as you mentioned in a comment in your previous PR - one possible solution is to normalize the collationId in a way that "transforms" UTF8_BINARY_RTRIM to UTF8_BINARY after applying the trimming policy to the appropriate arguments, and then proceeding with the execution (this way, UTF8_BINARY_RTRIM will lead to the same execution code path as UTF8_BINARY; UNICODE_CI_RTRIM will lead to the same execution code path as UNICODE_CI; etc.)

in any case, please open a PR towards apache/spark master, and let's continue the review there

supportsBinaryEquality will remain being used on almost all places, my mistake in pr that I left isUtf8Binary on most of the places. Will fix it here: https://github.com/apache/spark/pull/48472/files#diff-640c14aa5d7473df79b2435ce5a327dffcc16ca29354b153956b4f8d19fdb16c
Yes, idea is to keep using supportsBinaryEquality, but sometimes we can't always "pretend" that UTF8_Binary_Rtrim supports binaryEquality, i.e (as you said) it's not possible to use exactly same code execution. Take a look at hash expression (image bellow), we use hash unsafe bytes (and not hash function I set in collation factory that first normalizes string) if it supportsBinaryEquality.
So idea with this change is:

  1. always use supportsBinaryEquality (as we did before).
  2. When you have situation where we want collations with specifiers (for example RTRIM) to be treated like BinaryCollation (for example in passthrough expressions) use isBinaryUtf8 type.
    Let's continue review here:
    https://github.com/apache/spark/pull/48472/files#diff-640c14aa5d7473df79b2435ce5a327dffcc16ca29354b153956b4f8d19fdb16c
    image

@jovanpavl-db
Copy link
Owner Author

General question, why do we want to do this in this way. If I understood we want to make trim specifier make collator and comparator trim spaces and then compare, in general that makes the collation binary equal, I might be missing something

You are mostly right, there are cases where we would like to treat UTF8_BINARY_RTRIM as UTF8_BINARY and cases where it's not the case (for example take a look at hash example i pasted in Uros comment). Anyway this was kind of a raw change, will refactore it a bit here:
https://github.com/apache/spark/pull/48472/files#diff-640c14aa5d7473df79b2435ce5a327dffcc16ca29354b153956b4f8d19fdb16c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment