Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Spark session timezone format #3958

Closed
wants to merge 2 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Dec 10, 2022

Why are the changes needed?

The Spark session supports setting the time zone through spark.sql.session.timeZone and formatting according to the time zone, but timestamp does not use timezone, resulting in some incorrect results.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before make a pull request

@cxzl25
Copy link
Contributor Author

cxzl25 commented Dec 10, 2022

set spark.sql.session.timeZone=UTC;
select from_utc_timestamp(from_unixtime(1670404535000/1000,'yyyy-MM-dd HH:mm:ss'),'GMT+08:00') as time_utc8;
set spark.sql.session.timeZone=GMT+8;
select from_utc_timestamp(from_unixtime(1670404535000/1000,'yyyy-MM-dd HH:mm:ss'),'GMT+08:00') as time_utc8;

Current

2022-12-08 01:15:35.0
2022-12-08 01:15:35.0

Fix

2022-12-07 17:15:35.0
2022-12-08 01:15:35.0

@cxzl25 cxzl25 force-pushed the fix_session_timezone branch from 6252a1f to e2fd90a Compare December 10, 2022 05:13
@codecov-commenter
Copy link

codecov-commenter commented Dec 10, 2022

Codecov Report

Merging #3958 (3f2e375) into master (7aa3445) will decrease coverage by 0.01%.
The diff coverage is 87.50%.

@@             Coverage Diff              @@
##             master    #3958      +/-   ##
============================================
- Coverage     51.96%   51.94%   -0.02%     
  Complexity       13       13              
============================================
  Files           522      522              
  Lines         28870    28876       +6     
  Branches       3864     3864              
============================================
  Hits          15001    15001              
- Misses        12498    12500       +2     
- Partials       1371     1375       +4     
Impacted Files Coverage Δ
...ain/scala/org/apache/kyuubi/util/RowSetUtils.scala 83.33% <85.71%> (ø)
...org/apache/kyuubi/engine/spark/schema/RowSet.scala 91.17% <100.00%> (ø)
...g/apache/kyuubi/operation/BatchJobSubmission.scala 78.43% <0.00%> (-1.31%) ⬇️
.../engine/spark/session/SparkSQLSessionManager.scala 78.48% <0.00%> (-1.27%) ⬇️
.../org/apache/kyuubi/session/KyuubiSessionImpl.scala 77.16% <0.00%> (-0.79%) ⬇️
...n/scala/org/apache/kyuubi/engine/ProcBuilder.scala 80.74% <0.00%> (-0.63%) ⬇️
...he/kyuubi/ha/client/etcd/EtcdDiscoveryClient.scala 68.50% <0.00%> (-0.56%) ⬇️
...org/apache/kyuubi/operation/ExecuteStatement.scala 76.25% <0.00%> (ø)
...che/kyuubi/server/KyuubiTHttpFrontendService.scala 60.68% <0.00%> (+0.68%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@cxzl25 cxzl25 self-assigned this Dec 10, 2022
@cxzl25 cxzl25 modified the milestones: v1.7.0, v1.6.2 Dec 10, 2022
Copy link
Contributor

@cfmcgrady cfmcgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow encoding also has the same problem, we can fix it in a separate PR

@turboFei turboFei closed this in 4efd4d0 Dec 11, 2022
turboFei pushed a commit that referenced this pull request Dec 11, 2022
### _Why are the changes needed?_
The Spark session supports setting the time zone through `spark.sql.session.timeZone` and formatting according to the time zone, but `timestamp` does not use timezone, resulting in some incorrect results.

### _How was this patch tested?_
- [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #3958 from cxzl25/fix_session_timezone.

Closes #3958

3f2e375 [sychen] ut
e2fd90a [sychen] session timezone format

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: fwang12 <fwang12@ebay.com>
(cherry picked from commit 4efd4d0)
Signed-off-by: fwang12 <fwang12@ebay.com>
@turboFei
Copy link
Member

thanks, merged to master and branch-1.6

pan3793 added a commit to pan3793/kyuubi that referenced this pull request Feb 13, 2023
pan3793 added a commit that referenced this pull request Feb 14, 2023
### _Why are the changes needed?_

This PR proposes to use `org.apache.spark.sql.execution#toHiveString` to replace `org.apache.kyuubi.engine.spark.schema#toHiveString` to get consistent result w/ `spark-sql` and `STS`.

Because of [SPARK-32006](https://issues.apache.org/jira/browse/SPARK-32006), it only works w/ Spark 3.1 and above.

The patch takes effects on both thrift and arrow result format.

### _How was this patch tested?_
- [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [x] Add screenshots for manual tests if appropriate
```
➜  ~ beeline -u 'jdbc:hive2://0.0.0.0:10009/default'
Connecting to jdbc:hive2://0.0.0.0:10009/default
Connected to: Spark SQL (version 3.3.1)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://0.0.0.0:10009/default> select to_timestamp('2023-02-08 22:17:33.123456789');
+----------------------------------------------+
| to_timestamp(2023-02-08 22:17:33.123456789)  |
+----------------------------------------------+
| 2023-02-08 22:17:33.123456                   |
+----------------------------------------------+
1 row selected (0.415 seconds)
```

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4318 from pan3793/hive-string.

Closes #4316

ba9016f [Cheng Pan] nit
8be774b [Cheng Pan] nit
bd696fe [Cheng Pan] nit
b5cf051 [Cheng Pan] fix
dd6b702 [Cheng Pan] test
63edd34 [Cheng Pan] nit
37cc70a [Cheng Pan] Fix python ut
c66ad22 [Cheng Pan] [KYUUBI #4316] Fix returned Timestamp values may lose precision
41d9444 [Cheng Pan] Revert "[KYUUBI #3958] Fix Spark session timezone format"

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 added a commit that referenced this pull request Feb 14, 2023
### _Why are the changes needed?_

This PR proposes to use `org.apache.spark.sql.execution#toHiveString` to replace `org.apache.kyuubi.engine.spark.schema#toHiveString` to get consistent result w/ `spark-sql` and `STS`.

Because of [SPARK-32006](https://issues.apache.org/jira/browse/SPARK-32006), it only works w/ Spark 3.1 and above.

The patch takes effects on both thrift and arrow result format.

### _How was this patch tested?_
- [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [x] Add screenshots for manual tests if appropriate
```
➜  ~ beeline -u 'jdbc:hive2://0.0.0.0:10009/default'
Connecting to jdbc:hive2://0.0.0.0:10009/default
Connected to: Spark SQL (version 3.3.1)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://0.0.0.0:10009/default> select to_timestamp('2023-02-08 22:17:33.123456789');
+----------------------------------------------+
| to_timestamp(2023-02-08 22:17:33.123456789)  |
+----------------------------------------------+
| 2023-02-08 22:17:33.123456                   |
+----------------------------------------------+
1 row selected (0.415 seconds)
```

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4318 from pan3793/hive-string.

Closes #4316

ba9016f [Cheng Pan] nit
8be774b [Cheng Pan] nit
bd696fe [Cheng Pan] nit
b5cf051 [Cheng Pan] fix
dd6b702 [Cheng Pan] test
63edd34 [Cheng Pan] nit
37cc70a [Cheng Pan] Fix python ut
c66ad22 [Cheng Pan] [KYUUBI #4316] Fix returned Timestamp values may lose precision
41d9444 [Cheng Pan] Revert "[KYUUBI #3958] Fix Spark session timezone format"

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 8fe7947)
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 pushed a commit that referenced this pull request Feb 18, 2023
…ed result format

### _Why are the changes needed?_

1. this PR introduces a new configuration called `kyuubi.operation.result.arrow.timestampAsString`, when true, arrow-based rowsets will convert timestamp-type columns to strings for transmission.

2. `kyuubi.operation.result.arrow.timestampAsString` default setting to false for better transmission performance

3. the PR fixes timezone issue in arrow based result format described in #3958

### _How was this patch tested?_
- [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4326 from cfmcgrady/arrow-string-ts.

Closes #4326

38c7fc9 [Fu Chen] fix style
d864db0 [Fu Chen] address comment
b714b3e [Fu Chen] revert externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala
6c4eb50 [Fu Chen] minor
289b600 [Fu Chen] timstampAsString = false by default
78b7cab [Fu Chen] fix
f560135 [Fu Chen] debug info
b8e4b28 [Fu Chen] fix ut
87c6f9e [Fu Chen] update docs
86f6cb7 [Fu Chen] arrow based rowset timestamp as string

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 pushed a commit that referenced this pull request Feb 18, 2023
…ed result format

### _Why are the changes needed?_

1. this PR introduces a new configuration called `kyuubi.operation.result.arrow.timestampAsString`, when true, arrow-based rowsets will convert timestamp-type columns to strings for transmission.

2. `kyuubi.operation.result.arrow.timestampAsString` default setting to false for better transmission performance

3. the PR fixes timezone issue in arrow based result format described in #3958

### _How was this patch tested?_
- [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4326 from cfmcgrady/arrow-string-ts.

Closes #4326

38c7fc9 [Fu Chen] fix style
d864db0 [Fu Chen] address comment
b714b3e [Fu Chen] revert externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala
6c4eb50 [Fu Chen] minor
289b600 [Fu Chen] timstampAsString = false by default
78b7cab [Fu Chen] fix
f560135 [Fu Chen] debug info
b8e4b28 [Fu Chen] fix ut
87c6f9e [Fu Chen] update docs
86f6cb7 [Fu Chen] arrow based rowset timestamp as string

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 6bd0016)
Signed-off-by: Cheng Pan <chengpan@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants