[SPARK] Spark parquet read timestamp without timezone #2282

bkahloon · 2021-02-28T04:57:16Z

Porting over @shardulm94 's code. Addressed the previous issue of bringing over Linkedin's fork of Iceberg code.

Orig PR from Shardul: https://github.com/linkedin/iceberg/pull/48/files

RussellSpitzer · 2021-02-28T16:50:40Z

Thanks so much for writing this up! I'll take a look soon

RussellSpitzer · 2021-03-01T17:01:00Z

spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java

-          return TimestampType$.MODULE$;
-        }
-        throw new UnsupportedOperationException(
-            "Spark does not support timestamp without time zone fields");


Could we do the flag check here as well to check whether or not we have enabled the "Handle without timezone" flag here as well? We may be using this not on the read path (like in the migrate/snapshot code) and it would be good to catch it here as well and make sure users know what is happening.

I think this might involve a bigger refactor, including changing the method signature to accept the flag to .primitive(..., handleTimestampWithoutTimezoneFlag). I'm not sure if that will break other stuff. I added in the logic as comments for now as to how to implement it after we can settle on the implementation

I think that's fair, I can always fix this in the migrate code directly

RussellSpitzer · 2021-03-01T17:02:50Z

spark/src/test/java/org/apache/iceberg/spark/source/TestTimestampWithoutZone.java

+  private static List<Row> read(String table, boolean vectorized, String select0, String... selectN) {
+    Dataset<Row> dataset = spark.read().format("iceberg")
+        .option("vectorization-enabled", String.valueOf(vectorized))
+        .option("read-timestamp-without-zone", "true")


Can we add in a test for the error message and exception when this flag is not true?

ah nvm, I see the test above

RussellSpitzer · 2021-03-01T17:04:17Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

+    // is adjusted so that the corresponding time in the reader timezone is displayed.
+    // When set to false (default), we throw an exception at runtime
+    // "Spark does not support timestamp without time zone fields" if reading timestamp without time zone fields
+    this.readTimestampWithoutZone = options.get("read-timestamp-without-zone").map(Boolean::parseBoolean).orElse(false);


Maybe i'm being silly, but I would like "spark" to be in this flag like "spark handle timestamp without zone" and have it handle reading and writing

addressed it

RussellSpitzer · 2021-03-01T17:08:11Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

  private StructType lazyType() {
    if (type == null) {
+      Preconditions.checkArgument(readTimestampWithoutZone || !hasTimestampWithoutZone(lazySchema()),
+              "Spark does not support timestamp without time zone fields");


I would like this to be a bit more elaborate. Something like
"Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'flag name' to true. This will not change the underlying values stored but will change their displayed values in Spark. For more information see ... some website reference?"

I have been looking for a website to put as a reference in the error message and I'm having a hard time finding a specific site that talks about "Spark not supporting timestamp without timezone" fields. Could I maybe please get some pointers as to where I should look. I saw some relevant information on this site :

https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html

https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps - That is probably enough info

addressed it

RussellSpitzer · 2021-03-01T17:09:19Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java

+    // is adjusted so that the corresponding time in the reader timezone is displayed.
+    // When set to false (default), we throw an exception at runtime
+    // "Spark does not support timestamp without time zone fields" if reading timestamp without time zone fields
+    this.readTimestampWithoutZone = options.getBoolean("read-timestamp-without-zone", false);


We should probably have the flag name as a constant somewhere, maybe in the Spark Util class?

addressed it

RussellSpitzer · 2021-03-01T17:09:55Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java

  public StructType readSchema() {
    if (readSchema == null) {
+      Preconditions.checkArgument(readTimestampWithoutZone || !hasTimestampWithoutZone(expectedSchema),
+              "Spark does not support timestamp without time zone fields");


Same comment as above on the error message, since it's gonna be in a few places we should probably have the error message as a constant too.

addressed it

RussellSpitzer · 2021-03-01T17:10:52Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java

    return new ReaderFactory(readUsingBatch ? batchSize : 0);
  }

+  private static boolean hasTimestampWithoutZone(Schema schema) {


Another candidate for the public utility class

addressed it

RussellSpitzer · 2021-03-01T17:13:13Z

spark/src/test/java/org/apache/iceberg/spark/source/TestTimestampWithoutZone.java

+  @Test
+  public void testUnpartitionedTimestampWithoutZoneError() {
+    exception.expect(IllegalArgumentException.class);
+    exception.expectMessage("Spark does not support timestamp without time zone fields");


We have some helpers for checking exceptions are thrown, see

/** * A convenience method to avoid a large number of @Test(expected=...) tests * @param message A String message to describe this assertion * @param expected An Exception class that the Runnable should throw * @param callable A Callable that is expected to throw the exception */ public static void assertThrows(String message, Class<? extends Exception> expected, Callable callable) { assertThrows(message, expected, null, callable); }

addressed it

RussellSpitzer · 2021-03-01T17:14:38Z

Left a few commons! Very excited for this to finally get fixed :) Any chance you want to do the "write" pathway as well? I don't think it would be that many more changes to the code.

RussellSpitzer · 2021-03-01T17:30:41Z

Also nit, you'll probably want to fetch apache master and rebase this branch that to drop the "merge 1 commit" from the commits in the PR

bkahloon · 2021-03-01T18:36:40Z

Left a few commons! Very excited for this to finally get fixed :) Any chance you want to do the "write" pathway as well? I don't think it would be that many more changes to the code.

Yea, I can attempt to do the writes as well. Should I do it the same PR ?

RussellSpitzer · 2021-03-01T18:37:49Z

You could do another if you like, but I think it's also reasonable to do it here. Up to you

bkahloon · 2021-03-01T18:38:25Z

Also nit, you'll probably want to fetch apache master and rebase this branch that to drop the "merge 1 commit" from the commits in the PR

I can make the changes you mentioned and also open a new PR and comply with the project's commit history standards. I think I missed a few things that you mentioned and also not prefixing my commits with the project that the commits relate to (i.e. spark:).

Should I open a new PR after making the changes and close this one ?

EDIT: just did what you mentioned, and was able to rebase

bkahloon · 2021-03-01T18:38:57Z

You could do another if you like, but I think it's also reasonable to do it here. Up to you

Okay, I'll just do it in this one

…ts-as-tstz # Conflicts: # spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java # spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java

RussellSpitzer · 2021-03-08T17:54:29Z

@shardulm94 Hey! Could you please take a look over this too :)

RussellSpitzer · 2021-03-25T15:51:22Z

@shardulm94 ping :)

RussellSpitzer · 2021-04-08T14:06:56Z

@bkahloon are you still interested in this? Just checking to see whether you were going to implement the write support as well?

bkahloon · 2021-04-08T15:52:08Z

@bkahloon are you still interested in this? Just checking to see whether you were going to implement the write support as well?

@RussellSpitzer yes, I'm still interested. I'll try to get it done within the next two weeks.

RussellSpitzer · 2021-05-20T19:55:47Z

@bkahloon ping :)

bkahloon · 2021-05-21T03:37:56Z

@bkahloon ping :)

Hi @RussellSpitzer , sorry I had been busy with some other obligations. I will work on the write support according to your suggestions.

daniil-dubin · 2021-05-27T11:28:25Z

@bkahloon Hi! Thanks for working on this PR. My team is blocked by the absence of this feature in spark. We would gladly help with write support. We will have to implement it in a fork anyway but would rather join existing effort. So could you share the status? How can we help to sped up the progress?

bkahloon · 2021-05-28T04:12:26Z

Hi @daniil-dubin , thank you for following up with me. If you'd like to take over the write support from me that would help me out. I can point you in the direction that Russell had pointed me if you'd like.

daniil-dubin · 2021-05-28T08:49:03Z

Yes, I could do that. Please, share what Russell told you.

bkahloon · 2021-06-01T03:48:34Z

Hi @daniil-dubin , sorry for the delay. According to @RussellSpitzer comments, you will need to add an additional case statement handling the timestamp without timezone writing operation. For example for the orc spark writer you can add in an additional statement for without timezone. Russell also mentioned that you should probably add in the warning flag like we did in this PR. If you see this piece of code

sshkvar · 2021-06-03T13:02:53Z

Hi, I am working on write support together with @daniil-dubin
@bkahloon @RussellSpitzer in current implementation we have spark-handle-timestamp-without-timezone flag, which can be used as read or write(after implementation) option, like

spark
.read
.format("iceberg")
.option("spark-handle-timestamp-without-timezone", "true")
.load("some.table.name")

but we do not have ability to set this flag when using spark sql

spark.sql("select * from some.table.name") //<- will fail in current implementation

Our proposal is move this flag to spark session property, @bkahloon @RussellSpitzer what do you think about it?

RussellSpitzer · 2021-06-09T18:44:23Z

Hi, I am working on write support together with @daniil-dubin
@bkahloon @RussellSpitzer in current implementation we have spark-handle-timestamp-without-timezone flag, which can be used as read or write(after implementation) option, like
spark
.read
.format("iceberg")
.option("spark-handle-timestamp-without-timezone", "true")
.load("some.table.name")
but we do not have ability to set this flag when using spark sql
spark.sql("select * from some.table.name") //<- will fail in current implementation
Our proposal is move this flag to spark session property, @bkahloon @RussellSpitzer what do you think about it?

Sounds good to me, I have no problem with this being a session parameter. Let me know when the changes are up

sshkvar · 2021-06-11T14:09:32Z

@RussellSpitzer thank you for the response, changes are ready, but I can't push to the current git branch, so I created another PR to my forked version of iceberg
I can create PR to original Iceberg repository or @bkahloon can pull my changes to the current branch. What is the process for such cases?

Short description about changes:
Added a new spark session config spark.sql.iceberg.handle-timestamp-without-timezone which is responsible for allowing reading/writing timestamps without timezone, the same as reader/writer config spark-handle-timestamp-without-timezone.
Also added additional spark session config spark.sql.iceberg.store-timestamp-without-zone which is responsible for indicating which iceberg type TimestampType.withoutZone() or TimestampType.withZone() will be used in SparkTypeToType.java

@RussellSpitzer @bkahloon I would really appreciate for the review

RussellSpitzer · 2021-06-21T14:36:48Z

@sshkvar Sorry i was on vacation, feel free to just make a pull request against OSS Iceberg

RussellSpitzer · 2021-06-21T14:59:55Z

Added review on other Pull request, Ping me when there is a new copy against Apache/Iceberg @sshkvar

RussellSpitzer · 2021-06-28T19:23:04Z

@sshkvar Any update? Let me know when you have a PR up or if you don't have time to handle this just let me know and i'll pick it up

sshkvar · 2021-06-29T06:42:45Z

@sshkvar Any update? Let me know when you have a PR up or if you don't have time to handle this just let me know and i'll pick it up

@RussellSpitzer sorry for the delay, I was fully busy with my project tasks, I will try to create PR to iceberg and address comments this week

sshkvar · 2021-06-29T14:10:47Z

@RussellSpitzer I have opened new PR #2757
Also added comments/code changes based on your review in sshkvar#1

github-actions · 2024-07-27T00:13:14Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-03T00:13:24Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the spark label Feb 28, 2021

RussellSpitzer reviewed Mar 1, 2021

View reviewed changes

bkahloon added 2 commits March 1, 2021 11:35

spark: Add in support to read timestamp without timezone from parquet

5359aaf

spark: Remove ORC vectorized test for reading timestamp without timezone

4444d58

bkahloon force-pushed the spark-parquet-read-ts-as-tstz branch from 61c0c23 to 4444d58 Compare March 1, 2021 19:38

bkahloon added 7 commits March 3, 2021 21:06

Spark: address PR comments

022dde6

Spark: fix build failure due to import of all iceberg packages

c914154

Merge remote-tracking branch 'apache/master' into spark-parquet-read-…

943a6af

…ts-as-tstz # Conflicts: # spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java # spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java

Spark: remove unsed imports and try to fix package import ordering

a875e7a

Spark: fix missing imports

3be2739

Spark: fix code formatting issue

82194f6

Spark: fix code formatting issue

e09b137

bkahloon added 2 commits March 8, 2021 21:54

Spark: Fix formatting error of long line

687a1a4

Spark: Fix formatting error of long line

79744a9

RussellSpitzer mentioned this pull request Mar 9, 2021

Missing handling of Timestamp Without Timezone type #2244

Closed

RussellSpitzer mentioned this pull request Mar 27, 2021

Spark：spark need support timestamp time zone fileds #2388

Closed

sshkvar mentioned this pull request Jun 29, 2021

Add support for reading/writing timestamps without timezone. #2757

Merged

github-actions bot added the stale label Jul 27, 2024

github-actions bot closed this Aug 3, 2024

[SPARK] Spark parquet read timestamp without timezone #2282

[SPARK] Spark parquet read timestamp without timezone #2282

Uh oh!

Conversation

bkahloon commented Feb 28, 2021

Uh oh!

RussellSpitzer commented Feb 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Mar 1, 2021

Uh oh!

RussellSpitzer commented Mar 1, 2021

Uh oh!

bkahloon commented Mar 1, 2021

Uh oh!

RussellSpitzer commented Mar 1, 2021

Uh oh!

bkahloon commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkahloon commented Mar 1, 2021

Uh oh!

RussellSpitzer commented Mar 8, 2021

Uh oh!

RussellSpitzer commented Mar 25, 2021

Uh oh!

RussellSpitzer commented Apr 8, 2021

Uh oh!

bkahloon commented Apr 8, 2021

Uh oh!

RussellSpitzer commented May 20, 2021

Uh oh!

bkahloon commented May 21, 2021

Uh oh!

daniil-dubin commented May 27, 2021

Uh oh!

bkahloon commented May 28, 2021

Uh oh!

daniil-dubin commented May 28, 2021

Uh oh!

bkahloon commented Jun 1, 2021

Uh oh!

sshkvar commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

bkahloon commented Mar 1, 2021 •

edited

Loading

sshkvar commented Jun 3, 2021 •

edited

Loading