-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54629][CONNECT][TEST] Refactor fully-qualified class references to imports in SparkConnectJdbcDataTypeSuite #53371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ith BINARY type (UTF-8)
| assert(bytes.sameElements(testBytes)) | ||
| assert(!rs.wasNull) | ||
|
|
||
| val stringValue = rs.getString(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should use the label instead of the ordinal as this case is "get binary type by column label"
| val stringValue = rs.getString(1) | |
| val stringValue = rs.getString("test_binary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your code review, I've changed it.
| assert(!rs.wasNull) | ||
|
|
||
| val stringValue = rs.getString(1) | ||
| val expectedString = new String(testBytes, java.nio.charset.StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to import it at the beginning, though there are many anti-patterns here, maybe you can change them in bulk ...
import java.nio.charset.StandardCharsets
...
val expectedString = new String(testBytes, StandardCharsets.UTF_8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion, I've move all fully-qualified package path at the beginning.
…ith BINARY type (UTF-8)
| package org.apache.spark.sql.connect.client.jdbc | ||
|
|
||
| import java.sql.{ResultSet, SQLException, Types} | ||
| import java.math.BigDecimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should either use import java.math.{BigDecimal => JBigDecimal}, or keep the full package name, to distinguish from scala.BigDecimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…ith BINARY type (UTF-8)
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please update the PR title and description to reflect the final changes of this PR.
How was this patch tested?
Self-test
These words are unconvincing. Instead, you should write detailed instructions on how you tested, for example:
$ build/sbt -Phive "connect-client-jdbc/testOnly *SparkConnectJdbcDataTypeSuite"
...
[info] Run completed in 13 seconds, 53 milliseconds.
[info] Total number of tests run: 33
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
You can find more useful commands for developers at https://spark.apache.org/developer-tools.html
…required interfaces to `Evolving` ### What changes were proposed in this pull request? This PR aims to promote `RequiresDistributionAndOrdering` and its required interfaces to `Evolving` from `Experimental`. ### Why are the changes needed? Since Apache Spark 3.2.0, `RequiresDistributionAndOrdering` and its required interfaces have been served stably over 5 years like the following. In Apache Spark 4.2.0, we had better promote this to `Evolving` because these are no longer `Experimental`. - apache#30706 | Interface | Description | | - | - | | org.apache.spark.sql.connector.distributions.ClusteredDistribution | Unchange | | org.apache.spark.sql.connector.distributions.Distribution | Unchange | | org.apache.spark.sql.connector.distributions.Distributions | Unchange | | org.apache.spark.sql.connector.distributions.OrderedDistribution | Unchange | | org.apache.spark.sql.connector.distributions.UnspecifiedDistribution | Unchange | | org.apache.spark.sql.connector.expressions.NullOrdering | No Functional Change | | org.apache.spark.sql.connector.expressions.SortDirection | No Functional Change | | org.apache.spark.sql.connector.expressions.SortOrder | Stably Evolving | | org.apache.spark.sql.connector.write.RequiresDistributionAndOrdering | Stably Evolving | ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53361 from dongjoon-hyun/SPARK-54622. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…regate UDF ### What changes were proposed in this pull request? - `spark.python.profile`: Add `SQL_GROUPED_AGG_ARROW_ITER_UDF` to the profiler warning list in `udf.py` so that when `spark.python.profile` is enabled, users will see appropriate warnings consistent with other iterator-based UDFs. - `spark.sql.pyspark.udf.profiler`: No changes needed. This UDF type already works correctly because it returns scalar (not iterator), so it uses the non-iterator profiler branch in `wrap_perf_profiler` and `wrap_memory_profiler`. ### Why are the changes needed? To make profilers support for `SQL_GROUPED_AGG_ARROW_ITER_UDF` consistent with other UDFs. ### Does this PR introduce _any_ user-facing change? Yes. When users enable `spark.python.profile` with `SQL_GROUPED_AGG_ARROW_ITER_UDF`, they will now see a warning message consistent with other iterator-based UDFs. ### How was this patch tested? Added a test case `test_perf_profiler_arrow_udf_grouped_agg_iter` to verify that `spark.sql.pyspark.udf.profiler` works correctly with this UDF type. Also verified that the `spark.python.profile` profiler warning is triggered correctly in `test_unsupported`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53374 from Yicong-Huang/SPARK-54631/feat/add-profiler-support-for-arrow-grouped-agg-iter-udf. Authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… agg UDF partial consumption ### What changes were proposed in this pull request? Fix test case `test_iterator_grouped_agg_partial_consumption` to use count and sum instead of mean for testing partial consumption. Use the same value for all data points to avoid ordering issues. ### Why are the changes needed? Fixes test flakiness by ensuring test data points have the same value and using count/sum metrics that properly validate partial consumption behavior, making the test robust against ordering variations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53372 from Yicong-Huang/SPARK-53615/fix/fix-test-partial-consumption-ordering. Authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove all unnecessary explicit arguments in super().
```python
class A:
def __init__(self):
# from super(A, self).__init__() to
super().__init__()
````
### Why are the changes needed?
This seems like a cosmetic change but it's more than that. Even though the old syntax works, it has plenty of caveats hence considered bad practice:
1. It's verbose.
2. Most people probably don't know what it means
3. Readers needs to compare the argument with class name to confirm whether it actually changes the MRO order
4. When the class name is changed, it could break
5. When the class is copied, it's very easy to miss the argument and silently skip a method in MRO chain
6. It's very easy to establish [cargo cult effect](https://en.wikipedia.org/wiki/Cargo_cult) where people just use it without knowing why they need to.
There are actually very few explicit usages in our code base that uses a different class than the current class - which is when this explicit convention is supposed to be used. However, I suspect that some of them are actually the fact of neglect from reasons I listed above.
I will take a look at those usages later, but in the meantime, we should use the pure `super()` when it's equivalent.
BTW, this is how python2 works so that's why some people do it, it does not make sense in python3 anymore.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
There should not be any behavior change so wait for the CI
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#53369 from gaogaotiantian/remove-super-args.
Authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Remove redundant `self._timezone` and `self._safecheck` assignments in three serializer classes (`ArrowStreamAggArrowUDFSerializer`, `ArrowStreamAggArrowIterUDFSerializer`, `ArrowStreamAggPandasUDFSerializer`). Also improve parameter passing by directly passing `assign_cols_by_name` and `arrow_cast` to parent class instead of passing fixed values and then overriding them. ### Why are the changes needed? Improves code clarity and consistency by removing redundant code that was already set by parent classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53368 from Yicong-Huang/SPARK-54627/refactor/remove-redundant-initializations. Authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Coverage tests are added for the null checker in UDTF. ### Why are the changes needed? Improve test coverage and I need to refactor the code later so we need to make sure there's no regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test passed locally. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53362 from gaogaotiantian/test-null-checker. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Always pass runnerConf to python worker, even if it's not used. ### Why are the changes needed? This is part of the effort to consolidate our protocol from JVM to the worker. We have different ways to pass the runner conf now and sometimes we just don't pass it. It makes the worker side code a bit messy - we need to determine whether to read the conf based on eval type. However reading an empty conf is super cheap and we can just do it regardless. With this infra, vanilla python udfs can also pass some runner conf in the future. We can do some refactoring on our JVM worker code in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `pyspark-sql` passed locally. Running the rest on CI. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53353 from gaogaotiantian/always-pass-runnerconf. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR fixes the `test_to_feather` test failure with PyArrow 22.0.0 by filtering non-serializable attrs (`metrics`, `observed_metrics`) before writing to feather format. **Changes:** 1. Modified `to_feather()` in `pyspark/pandas/frame.py` to filter out non-serializable attrs before passing to PyArrow 2. Removed the `unittest.skipIf` workaround from `test_to_feather` 3. Added `to_dict()` methods to `MetricValue`, `PlanMetrics`, and `PlanObservedMetrics` for future utility (not used in the fix, but useful additions) ### Why are the changes needed? PyArrow 22.0.0 changed its behavior to serialize pandas `DataFrame.attrs` to JSON metadata when writing Feather files. PySpark Spark Connect stores `PlanMetrics` and `PlanObservedMetrics` objects in `pdf.attrs`, which are not JSON serializable, causing: TypeError: Object of type PlanMetrics is not JSON serializable ### Does this PR introduce any user-facing change? No. The fix filters internal Spark metadata (`metrics`, `observed_metrics`) from attrs only when writing to feather format. Code that directly accesses `pdf.attrs["metrics"]` (like `test_observe`) continues to work with the original objects. ### How was this patch tested? - Verified that `pdf.attrs["metrics"][0].name` still works (backward compatibility) - Verified that feather write succeeds with PyArrow 22.0.0 when attrs are filtered - Removed the `unittest.skipIf` workaround so `test_to_feather` now runs on all versions - All existing tests pass including `test_observe` which accesses attrs directly - Removed the `unittest.skipIf(not has_arrow_21_or_below, "SPARK-54068")` workaround so the test now runs on all PyArrow versions ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53377 from ashrithb/SPARK-54068-pyarrow-feather-planmetrics-fix. Authored-by: ashrithb <ashrithlb@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Commons Lang3 is unused for network-shuffle after SPARK-54607. ### Why are the changes needed? Reduce unnecessary deps, this also reduces the YARN ESS jar size. ``` -rw-r--r-- 1 chengpan staff 109249563 Dec 8 12:23 spark-4.2.0-SNAPSHOT-yarn-shuffle.jar -- master branch -rw-r--r-- 1 chengpan staff 108549934 Dec 8 12:41 spark-4.2.0-SNAPSHOT-yarn-shuffle.jar -- SPARK-54635 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GHA. Also manual verified integration with Hadoop YARN (v3.4.2), with ESS enabled <img width="1585" height="983" alt="image" src="https://github.com/user-attachments/assets/4ddc7967-b883-44b7-96f0-3c69676b8d9b" /> ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53380 from pan3793/SPARK-54635. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…n task killed message should not special handling OOM ### What changes were proposed in this pull request? Follow comment apache#52792 (comment) ### Why are the changes needed? Follow comment ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53379 from AngersZhuuuu/SPARK-54087-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…arkConnectJdbcDataTypeSuite` Improve `SparkConnectJdbcDataTypeSuite` to reuse the statement instances in `withExecuteQuery` if possible use less resources in tests no existing tests cursor 2.1.50 Closes apache#53385 from cloud-fan/test. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Use `select.poll` to replace `select.select` in `worker.py` on UNIX systems. ### Why are the changes needed? `select.select` has a known issue that it won't work with fd > 1024. We can reach this limit on heavy load systems. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `test_udf` passes locally, the rest will run on CI. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53388 from gaogaotiantian/replace-select-with-poll. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `ammonite` to 3.0.5. ### Why are the changes needed? To be ready to support Scala 2.13.18. - https://github.com/com-lihaoyi/Ammonite/releases/tag/3.0.5 - https://repo1.maven.org/maven2/com/lihaoyi/ammonite_2.13.18/3.0.5/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53389 from dongjoon-hyun/SPARK-54641. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…A EVOLUTION clause ### What changes were proposed in this pull request? Keep existing behavior for MERGE INTO without SCHEMA EVOLUTION clause for UPDATE SET * and INSERT * as well as UPDATE struct or INSERT struct, to throw exception if the source and target schemas are not exactly the same. ### Why are the changes needed? As aokolnychyi tested this feature, he mentioned that as of Spark 4.1 the behavior is changed for MERGE INTO but without SCHEMA EVOLUTION clause. In particular: - Source has less columns/nested fields than target => we fill with NULL or DEFAULT for inserts, and existing value for Update. (though we disabled for nested structs by default in [[SPARK-54525](https://issues.apache.org/jira/browse/SPARK-54525))](https://github.com/apache/spark/pull/53229) - Source has more columns/fields than target => we drop the extra fields. Initially, I thought its a good improvement of MERGE INTO and is not related to SCHEMA EVOLUTION exactly because the schema is not altered. But Anton has a good point that it may be a surprise to some user. So it may be better for now to be more conservative and keep the exact same behavior for without SCHEMA EVOLUTION clause. Note: this behavior is still enabled if SCHEMA EVOLUTION is specified, as the user then is more explicit about the decision. ### Does this PR introduce _any_ user-facing change? No, this keeps behavior exactly the same as 4.0 without SCHEMA EVOLUTION clause. ### How was this patch tested? Added a test and changed existing test output to expect the exception if SCHEMA EVOLUTION is not specified. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53363 from szehon-ho/SPARK-54595-reapply. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `AWS SDK v2` to 2.35.4. ### Why are the changes needed? To be ready for Apache Hadoop 3.4.3 and 3.5.0 release: - [HADOOP-19654 Upgrade AWS SDK to 2.35.4](https://issues.apache.org/jira/browse/HADOOP-19654) ### Does this PR introduce _any_ user-facing change? No Spark behavior change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53392 from dongjoon-hyun/SPARK-54642. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…res connector ### What changes were proposed in this pull request? Making new API for `beforeFetch` in `JDBCDialect`, where it accepts options as `JDBCOptions` instead of `Map[String, String]`, and deprecating the old API starting from Spark version `4.2.0`. [Spark docs](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option) state that options are case insensitive, so this will make it easier for dialects to respect that. Even if we have some edge case where we need the original casing, we can access the original map inside the `JDBCOptions` object. ### Why are the changes needed? The option `fetchsize` requires another option, `autocommit` to be set to `false` for the Postgres connector. We have logic for this: https://github.com/apache/spark/blob/415f50511463a8e73af77cf9f70cba4292c1331d/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L203-L205 However, this logic is case-sensitive, and will only work for lowercased `fetchsize`. When passing `fetchSize` for example, the correct value for the fetchsize will be set on the Postgres driver, but it won't have `autocommit -> false`, so the fetch size will be ignored. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test: `PostgresDialectSuite`. ### Was this patch authored or co-authored using generative AI tooling? Yes, partly generated-by: claude code. Closes apache#53308 from marko-sisovic-db/msisovic/postgres-fetchsize-fix. Authored-by: Marko Sisovic <marko.sisovic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Why are the changes needed?
Test coverage
Does this PR introduce any user-facing change?
No
How was this patch tested?
./build/sbt -Phive "connect-client-jdbc/testOnly *SparkConnectJdbcDataTypeSuite"
[info] Run completed in 27 seconds, 163 milliseconds.
[info] Total number of tests run: 33
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
Was this patch authored or co-authored using generative AI tooling?
No