Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Spark]: Spark write timestamp value to parquet as UTC+0 hour. #686

Closed
1 task done
baiyangtx opened this issue Nov 22, 2022 · 7 comments · Fixed by #721
Closed
1 task done

[Bug][Spark]: Spark write timestamp value to parquet as UTC+0 hour. #686

baiyangtx opened this issue Nov 22, 2022 · 7 comments · Fixed by #721
Labels
module:mixed-spark Spark module for Mixed Format priority:blocker security, data-loss, correctness, etc. type:bug Something isn't working
Milestone

Comments

@baiyangtx
Copy link
Contributor

What happened?

When table schema contains a field as timestamp without-zone, the value spark engines write to the parquet file is UTC+0, but should be the current timezone value.

Affects Versions

0.4.0

What engines are you seeing the problem on?

Spark

How to reproduce

checked .

select via flink

image

select via spark

image

table schema

image

table files

image

seems spark problems

Relevant log output

No response

Anything else

for sql `insert into test_db.test_table values (7, 'randy', timestamp('2022-07-03 19:11:00'));

the flink connector write data
image

spark connector write data

image

Code of Conduct

  • I agree to follow this project's Code of Conduct
@baiyangtx baiyangtx added the type:bug Something isn't working label Nov 22, 2022
@baiyangtx
Copy link
Contributor Author

spark connector writing data to file is ( value - 8h ). it seems to spark handles timestamp-without-zone via parse value as timestamp-with-zone and zone of spark configured, and spark writes data to file by UTC timestamp, so spark writes data as (value - 8h)

@baiyangtx baiyangtx added this to the Release 0.4.0 milestone Nov 22, 2022
@baiyangtx baiyangtx added module:mixed-spark Spark module for Mixed Format priority:blocker security, data-loss, correctness, etc. labels Nov 22, 2022
@hellojinsilei
Copy link
Contributor

It seems that it is not only the arctic table that has problems. For the iceberg table, after my test, flink writes the iceberg table and the time read by spark will also be increased as (value + 8h).
After checking, it is confirmed that spark automatically converts the UTC timestamp to the local time zone (GMT+8) when reading timestamp data. The timestamps written by flink into the Iceberg is the UTC timezone value. When the spark reads the iceberg time, the timestamp is increased as (value + 8h)

@hellojinsilei
Copy link
Contributor

Add the following configuration to spark-default.conf:

spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC
spark.sql.session.timeZone=UTC

Making sure that the same timeZone is written and read can temporarily solve this problem

@zhoujinsong
Copy link
Contributor

@hellojinsilei
Can we solve this problem by setting timzeone to GTM+8 in flink?

@hellojinsilei
Copy link
Contributor

@hellojinsilei Can we solve this problem by setting timzeone to GTM+8 in flink?

Flink cannot write timestamp type with timeZone

@lklhdu Can we do that?

@lklhdu
Copy link
Contributor

lklhdu commented Nov 22, 2022

@hellojinsilei Can we solve this problem by setting timzeone to GTM+8 in flink?

Flink cannot write timestamp type with timeZone

@lklhdu Can we do that?

In my understanding, TIMESTAMP is a field in Flink without a time zone attribute, if we need to set a time zone for the field, we should use TIMESTAMP WITH LOCAL TIME ZONE.

@baiyangtx
Copy link
Contributor Author

baiyangtx commented Nov 22, 2022

@zhoujinsong Table created by terminal via spark sqls. I think it should be with zone for the timestamp type field because the spark engine doesn't support field type timestamp without-zone.

For the adapt hive table, the default action of spark field type timestamp should be timestamp without-zone, for the non-hive adapt table, the default action should be timestamp with-zone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:mixed-spark Spark module for Mixed Format priority:blocker security, data-loss, correctness, etc. type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants