[SPARK-42442][SQL] Use spark.sql.timestampType for data source inference #40022

gengliangwang · 2023-02-14T23:29:42Z

What changes were proposed in this pull request?

With the configuration spark.sql.timestampType, TIMESTAMP in Spark is a user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ variations. This is quite complicated to Spark users.

There is another option spark.sql.sources.timestampNTZTypeInference.enabled for schema inference. I would like to introduce it in #40005 but having two flags seems too much. After thoughts, I decide to merge spark.sql.sources.timestampNTZTypeInference.enabled into spark.sql.timestampType and let spark.sql.timestampType control the schema inference behavior.

We can have followups to add data source options "inferTimestampNTZType" for CSV/JSON/partiton column like JDBC data source did.

Why are the changes needed?

Make the new feature simpler.

Does this PR introduce any user-facing change?

No, the feature is not released yet.

How was this patch tested?

Existing UT
I also tried

git grep spark.sql.sources.timestampNTZTypeInference.enabled
git grep INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES

to make sure the flag INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES is removed.

gengliangwang · 2023-02-14T23:32:56Z

Sorry about getting back and forth about the schema inference. I am trying to tell a good story about the timestamp without time zone. This should be the final version before 3.4.0 release.

HyukjinKwon · 2023-02-15T11:39:32Z

Merged to master and branch-3.4.

With the configuration `spark.sql.timestampType`, TIMESTAMP in Spark is a user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ variations. This is quite complicated to Spark users. There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` for schema inference. I would like to introduce it in #40005 but having two flags seems too much. After thoughts, I decide to merge `spark.sql.sources.timestampNTZTypeInference.enabled` into `spark.sql.timestampType` and let `spark.sql.timestampType` control the schema inference behavior. We can have followups to add data source options "inferTimestampNTZType" for CSV/JSON/partiton column like JDBC data source did. Make the new feature simpler. No, the feature is not released yet. Existing UT I also tried ``` git grep spark.sql.sources.timestampNTZTypeInference.enabled git grep INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES ``` to make sure the flag INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES is removed. Closes #40022 from gengliangwang/unifyInference. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46226c2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

With the configuration `spark.sql.timestampType`, TIMESTAMP in Spark is a user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ variations. This is quite complicated to Spark users. There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` for schema inference. I would like to introduce it in apache#40005 but having two flags seems too much. After thoughts, I decide to merge `spark.sql.sources.timestampNTZTypeInference.enabled` into `spark.sql.timestampType` and let `spark.sql.timestampType` control the schema inference behavior. We can have followups to add data source options "inferTimestampNTZType" for CSV/JSON/partiton column like JDBC data source did. Make the new feature simpler. No, the feature is not released yet. Existing UT I also tried ``` git grep spark.sql.sources.timestampNTZTypeInference.enabled git grep INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES ``` to make sure the flag INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES is removed. Closes apache#40022 from gengliangwang/unifyInference. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46226c2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

gengliangwang added 2 commits February 14, 2023 11:23

save for now

8155567

remove INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES

26dfdb6

github-actions bot added the SQL label Feb 14, 2023

revise description

489b27c

gengliangwang requested review from MaxGekk and cloud-fan February 14, 2023 23:31

gengliangwang requested a review from HyukjinKwon February 15, 2023 08:46

HyukjinKwon approved these changes Feb 15, 2023

View reviewed changes

HyukjinKwon closed this in 46226c2 Feb 15, 2023

MaxGekk mentioned this pull request Jan 19, 2024

[SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV #44789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42442][SQL] Use spark.sql.timestampType for data source inference #40022

[SPARK-42442][SQL] Use spark.sql.timestampType for data source inference #40022

Uh oh!

gengliangwang commented Feb 14, 2023

Uh oh!

gengliangwang commented Feb 14, 2023

Uh oh!

HyukjinKwon commented Feb 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-42442][SQL] Use spark.sql.timestampType for data source inference #40022

[SPARK-42442][SQL] Use spark.sql.timestampType for data source inference #40022

Uh oh!

Conversation

gengliangwang commented Feb 14, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Feb 14, 2023

Uh oh!

HyukjinKwon commented Feb 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants