Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support micro seconds precisison during copy unload #492

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

arthurli1126
Copy link

@arthurli1126 arthurli1126 commented Mar 8, 2023

Snowflake connector uses SimpleDateFormat (legacy) for data time parsing during copy unload, internally, it only supports milliseconds. For instance, for string 2023-03-01 07:54:56.191173 it would consider it carries 191173 milliseconds so it will add
191000 / 1000 / 60 = 3 mins 11s and put 173 microseconds to milliseconds filed:
2023-03-01 07:58:07.173000.

@@ -193,4 +193,88 @@ class ConversionsSuite extends FunSuite {

assert(expect == result.toString())
}

test("Data with micro-seconds and nano-seconds precision should be correctly converted"){
Copy link
Contributor

@sfc-gh-mrui sfc-gh-mrui Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a unit test. It proves that it can parse timestamp with micro/nano-seconds.
Could you please add an integration test (AKA end-to-end test)?

Copy link
Contributor

@sfc-gh-mrui sfc-gh-mrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/snowflakedb/spark-snowflake/blob/master/src/it/scala/net/snowflake/spark/snowflake/SnowflakeResultSetRDDSuite.scala#L711 is the test case for new Arrow Format read (non-COPY-UNLOAD).
It is disabled for COPY UNLOAD. It will be great if you can make it work.

// COPY UNLOAD can't be run because it only supports millisecond(0.001s).
if (!params.useCopyUnload) {
val result = sparkSession.sql("select * from test_table_timestamp")
val result = sparkSession.sql("select * from test_table_timestamp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case failed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingli-Rui Yea sorry, haven't finished the change yet, should've mentioned this. Will push another commit this week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to add new test cases instead of changing existing one, e.g. test("testTimestamp copy unload").

  1. With the new test, you can set below options in the new test cases only
    thisConnectorOptionsNoTable += ("timestamp_ntz_output_format" -> "YYYY-MM-DD HH24:MI:SS.FF6")
    thisConnectorOptionsNoTable += ("timestamp_ltz_output_format" -> "TZHTZM YYYY-MM-DD HH24:MI:SS.FF6")
    thisConnectorOptionsNoTable += ("timestamp_tz_output_format" -> "TZHTZM YYYY-MM-DD HH24:MI:SS.FF6")
  1. With the new test case, you can test the new internal parameter ( I suggested in another comment) can disable/enable the test.

@arthurli1126
Copy link
Author

@sfc-gh-mrui The test cases has been fixed please review. Thanks!

@@ -193,7 +196,23 @@ private[snowflake] object Conversions {
* Parse a string exported from a Snowflake TIMESTAMP column
*/
private def parseTimestamp(s: String, isInternalRow: Boolean): Any = {
// Need to handle the nano seconds filed separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an internal parameter to enable/disable the change? It's enabled by default.
If some users is broken, they can disable the fix as a workaround.
For example, https://github.com/snowflakedb/spark-snowflake/blob/master/src/main/scala/net/snowflake/spark/snowflake/Parameters.scala#LL163C14-L163C14

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. do you mean to have a parameter such as internal_support_micro_second_during_unload ? IMO, we should always try to support micro second level precision since by default direct JDBC supports this precision. It would make this part of the code confusing if we have two code paths and two set of timestamp patterns..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree users need to use this fix. This is why we can set the internal parameter as true by default.
This is the internal policy to introduce a parameter to disable the fix if possible.
The internal parameter can be removed later. So the code will be clean up accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Will push a new commit to add this parameter.

@sfc-gh-mrui
Copy link
Contributor

@arthurli1126 Let's me know if you are ready for review. So far, below test case failed.

23/03/27 19:12:32 INFO DAGScheduler: Job 109 finished: collect at DataTypesIntegrationSuite.scala:244, took 0.006617 s
[info] - filter on timestamp column *** FAILED ***

@arthurli1126
Copy link
Author

@sfc-gh-mrui Yea the tests pass in my local and with instacart's snowflake account(based on aws), I realized the test only failed with azure and gcp copy unload so wondering if there's anything snowflake does differently when unload to gcp/azure. However, I'm not able to verify it due to my company only has aws deployment. Is this something we can help me with or can you provide a testing temp GCP deployment account?

@arthurli1126
Copy link
Author

Looks I can sign up for a trial account with snowflake on GCP. Let me also try that.

arthurli1126 and others added 11 commits June 23, 2023 13:45
…nowflakedb#502)

* SNOW-763124 Support uploading files with down-scoped token for Snowflake Gcs accounts

* Update JDBC and SC version
…nowflakedb#504)

* SNOW-770051 Fix a potential wrong result issue for crossing schema join/union

* Revise error message

* Simplify canUseSameConnection()

* Use toSet.size == 1 for duplication check.
* SNOW-824420 Update SC version to 2.12.0

* Update spark version in build_image.sh
…:arthurli1126/spark-snowflake into support-micro-seconds-during-copy-unload
@arthurli1126
Copy link
Author

Sorry about the long break finally got chance to look into this again. For workflows in different envs, does it need approval from your side to run? @sfc-gh-mrui

@arthurli1126
Copy link
Author

Hi @sfc-gh-mrui, I was wondering if there's a way to trigger the test again? Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants