Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Jun 17, 2017

What changes were proposed in this pull request?

This pr introduces a prototype of codegen for ArrowWriter.
Currently this simply moves write path code into codegen.

How was this patch tested?

I did benchmark in my local environment.

before: 418 ms ± 43.4 ms
after: 390 ms ± 12.9 ms

@ueshin
Copy link
Collaborator Author

ueshin commented Jun 19, 2017

I'd close this for now.

@ueshin ueshin closed this Jun 19, 2017
@ueshin ueshin deleted the wip/arrow branch August 28, 2017 03:47
cloud-fan added a commit that referenced this pull request Mar 13, 2019
* insert query stages dynamically

* add comment

* address comments
cloud-fan pushed a commit that referenced this pull request Sep 2, 2019
…enkins's test results

### What changes were proposed in this pull request?

See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/

![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
    ...
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
    </testcase>
    <system-out/>
    <system-err/>
</testsuite>
```

Root cause seems a bug in SBT - it truncates the test name based on the last dot.

sbt/sbt#2949
https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79

I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:

```diff
  [info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
+ [info]   inner-join.sql
  [info]   Expected "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[]", but got "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[	b]" Result did not match for query #6
  [info]   SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
```

It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.

Note that this PR does not fully fix the issue but only fix the logs on its failed tests.

### Why are the changes needed?
To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.

### Does this PR introduce any user-facing change?
It will print out the file name of failed tests in Jenkins' test reports.

### How was this patch tested?
Manually tested but Jenkins tests are required in this PR.

Now it at least shows which file it is:

![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)

Closes apache#25630 from HyukjinKwon/SPARK-28894-1.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
cloud-fan pushed a commit that referenced this pull request Nov 5, 2020
### What changes were proposed in this pull request?

This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.

### Why are the changes needed?

This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3

scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```

Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```

**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```

**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#30246 from dongjoon-hyun/SPARK-33338.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
cloud-fan pushed a commit that referenced this pull request Nov 11, 2020
### What changes were proposed in this pull request?

This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.

### Why are the changes needed?

This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3

scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```

Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```

**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```

**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#30246 from dongjoon-hyun/SPARK-33338.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 42c0b17)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
cloud-fan pushed a commit that referenced this pull request Mar 17, 2021
### What changes were proposed in this pull request?

Skip capture maven repo config for views.

### Why are the changes needed?

Due to the bad network, we always use the thirdparty maven repo to run test. e.g.,
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxxxx
```

It's failed with such error msg
```
[info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
[info] show-tblproperties.sql
[info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
[info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories xxxxx]" Result did not match for query #6
[info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
```

It's not necessary to capture the maven config to view since it's a session level config.
 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test pass
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxx
```

Closes apache#31856 from ulysses-you/skip-maven-config.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
cloud-fan pushed a commit that referenced this pull request Apr 19, 2021
backport [apache#31856](apache#31856) for branch-3.1

### What changes were proposed in this pull request?

Skip capture maven repo config for views.

### Why are the changes needed?

Due to the bad network, we always use the thirdparty maven repo to run test. e.g.,
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxxxx
```

It's failed with such error msg
```
[info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
[info] show-tblproperties.sql
[info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
[info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories xxxxx]" Result did not match for query #6
[info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
```

It's not necessary to capture the maven config to view since it's a session level config.
 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test pass
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxx
```

Closes apache#31856 from ulysses-you/skip-maven-config.

Authored-by: ulysses-you <ulyssesyou18gmail.com>
Signed-off-by: Kent Yao <yaoapache.org>

Closes apache#31879 from ulysses-you/SPARK-34766-3-1.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 30, 2024
…to the `hive-thriftserver` module to fix the Maven daily test

### What changes were proposed in this pull request?
This pr add bouncycastle-related test dependencies to the `hive-thrift` module to fix the Maven daily test.

### Why are the changes needed?
`sql-on-files.sql` added the following statement in apache#47480, which caused the Maven daily test to fail

https://github.com/apache/spark/blob/2363aec0c14ead24ade2bfa23478a4914f179c00/sql/core/src/test/resources/sql-tests/inputs/sql-on-files.sql#L10

- https://github.com/apache/spark/actions/runs/10094638521/job/27943309504
- https://github.com/apache/spark/actions/runs/10095571472/job/27943298802

```
- sql-on-files.sql *** FAILED ***
  "" did not contain "Exception" Exception did not match for query #6
  CREATE TABLE sql_on_files.test_orc USING ORC AS SELECT 1, expected: , but got: java.sql.SQLException
  org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8542.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8542.0 (TID 8594) (localhost executor driver): java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
    at test.org.apache.spark.sql.execution.datasources.orc.FakeKeyProvider$Factory.createProvider(FakeKeyProvider.java:127)
    at org.apache.hadoop.crypto.key.KeyProviderFactory.get(KeyProviderFactory.java:96)
    at org.apache.hadoop.crypto.key.KeyProviderFactory.getProviders(KeyProviderFactory.java:68)
    at org.apache.orc.impl.HadoopShimsCurrent.createKeyProvider(HadoopShimsCurrent.java:97)
    at org.apache.orc.impl.HadoopShimsCurrent.getHadoopKeyProvider(HadoopShimsCurrent.java:131)
    at org.apache.orc.impl.CryptoUtils$HadoopKeyProviderFactory.create(CryptoUtils.java:158)
    at org.apache.orc.impl.CryptoUtils.getKeyProvider(CryptoUtils.java:141)
    at org.apache.orc.impl.WriterImpl.setupEncryption(WriterImpl.java:1015)
    at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:164)
    at org.apache.orc.OrcFile.createWriter(OrcFile.java:1078)
    at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:49)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:89)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:180)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:165)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:391)
    at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:107)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
    at org.apache.spark.scheduler.Task.run(Task.scala:146)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
  Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
    ... 32 more
```

Because we have configured `hadoop.security.key.provider.path` as `test:///` in the parent `pom.xml`,

https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/pom.xml#L3165-L3166

`KeyProviderFactory#getProviders` will use `FakeKeyProvider$Factory` to create instances of `FakeKeyProvider`.

https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/sql/core/src/test/resources/META-INF/services/org.apache.hadoop.crypto.key.KeyProviderFactory#L18

During the initialization of `FakeKeyProvider`, it first initializes its superclass `org.apache.hadoop.crypto.key.KeyProvider`, which leads to the loading of the `BouncyCastleProvider` class. Therefore, we need to add bouncycastle-related test dependencies in the `hive-thrift` module.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual Test with this pr.

```
build/mvn -Phive -Phive-thriftserver clean install -DskipTests
build/mvn -Phive -Phive-thriftserver clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -pl sql/hive-thriftserver
```

```
Run completed in 6 minutes, 52 seconds.
Total number of tests run: 243
Suites: completed 2, aborted 0
Tests: succeeded 243, failed 0, canceled 0, ignored 20, pending 0
All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47496 from LuciferYang/thrift-bouncycastle.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
cloud-fan pushed a commit that referenced this pull request Feb 6, 2025
…anRelationPushDown

### What changes were proposed in this pull request?

Add the timezone information to a cast expression when the destination type requires it.

### Why are the changes needed?

When current_timestamp() is materialized as a string, the timezone information is gone (e.g., 2024-12-27 10:26:27.684158) which prohibits further optimization rules from being applied to the affected data source.

For example,

```
Project [1735900357973433#10 AS current_timestamp()#6]
+- 'Project [cast(2025-01-03 10:32:37.973433#11 as timestamp) AS 1735900357973433#10]
   +- RelationV2[2025-01-03 10:32:37.973433#11] xxx
```

-> This query fails to execute because the injected cast expression lacks the timezone information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49549 from changgyoopark-db/SPARK-50870.

Authored-by: changgyoopark-db <changgyoo.park@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Feb 10, 2025
…anRelationPushDown

### What changes were proposed in this pull request?

Add the timezone information to a cast expression when the destination type requires it.

### Why are the changes needed?

When current_timestamp() is materialized as a string, the timezone information is gone (e.g., 2024-12-27 10:26:27.684158) which prohibits further optimization rules from being applied to the affected data source.

For example,

```
Project [1735900357973433#10 AS current_timestamp()#6]
+- 'Project [cast(2025-01-03 10:32:37.973433#11 as timestamp) AS 1735900357973433#10]
   +- RelationV2[2025-01-03 10:32:37.973433#11] xxx
```

-> This query fails to execute because the injected cast expression lacks the timezone information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49549 from changgyoopark-db/SPARK-50870.

Authored-by: changgyoopark-db <changgyoo.park@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 24abb0f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Mar 6, 2025
…anRelationPushDown

### What changes were proposed in this pull request?

Add the timezone information to a cast expression when the destination type requires it.

### Why are the changes needed?

When current_timestamp() is materialized as a string, the timezone information is gone (e.g., 2024-12-27 10:26:27.684158) which prohibits further optimization rules from being applied to the affected data source.

For example,

```
Project [1735900357973433#10 AS current_timestamp()#6]
+- 'Project [cast(2025-01-03 10:32:37.973433#11 as timestamp) AS 1735900357973433#10]
   +- RelationV2[2025-01-03 10:32:37.973433#11] xxx
```

-> This query fails to execute because the injected cast expression lacks the timezone information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49549 from changgyoopark-db/SPARK-50870.

Authored-by: changgyoopark-db <changgyoo.park@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 24abb0f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant