Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

Closed
revans2 opened this issue Jun 10, 2020 · 5 comments · Fixed by #435
Closed

[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored #144

revans2 opened this issue Jun 10, 2020 · 5 comments · Fixed by #435
Assignees
Labels
bug Something isn't working P0 Must have for release SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented Jun 10, 2020

Describe the bug
When writing parquet the config spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored. We should at least check it and verify which behavior we are supporting.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels Jun 10, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 29, 2020
@sameerz sameerz added the P0 Must have for release label Jul 22, 2020
@sameerz
Copy link
Collaborator

sameerz commented Jul 22, 2020

Need to understand what spark.sql.legacy.parquet.datetimeRebaseModeInWrite does and then will reprioritize.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 23, 2020

The following is from the docs for that config.

When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid
(Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and
write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees
ancient dates/timestamps that are ambiguous between the two calendars.

There is also a corresponding spark.sql.legacy.parquet.datetimeRebaseModeInRead config. The main difference with this one is that it looks like spark tries to be smart when reading the data

This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.

Because we have issues with reading/writing dates/times around this period anyways I'm not sure how critical it is that we fix it.

https://github.com/apache/spark/blob/a8e3de36e7d543f1c7923886628ac3178f45f512/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L90-L108

is the function that decides what to do for reads.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 23, 2020

So the next step is to decide if this should be a feature of cudf or if it should be something we only do in spark. The reading side feels like it should be a part of cudf. You really want cudf to be abel to read data from any data source correctly. On the write side we need to understand what cudf currently does for timestamps in this range and then we can adjust appropriately. We are also going to want to be sure that we insert the proper metadata so that the spark CPU can read the data correctly too.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 23, 2020

A little more information. When I look through the parquet reader/writer code in cudf it looks like there is no special case processing for DATE or TIMESTAMP. This should correspond to the "CORRECTED" mode for Spark. For writes we can probably look at this config and the types in the schema to decide if we can support this or not. We can also add in support for throwing an exception for writes if the config is set to that too.

The reader mode is going to be a little more difficult. We don't support reading parquet metadata with the current cudf java API, and arguably I think cudf would want to be able to read these types of dates/timestamps correctly on any platform. So we might want to open up a discussion with cudf on what is the right way to handle this.

@sameerz is this still a P1? If so I can put in returning metadata from the parquet reader and start to look at what it would take to rebase the timestamps/dates to match spark.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 24, 2020

For now I am going to propose that we do the following.

When writing a parquet file we fall back to the CPU if spark.sql.legacy.parquet.datetimeRebaseModeInWrite is not "CORRECTED" or "EXCEPTION" and we have any time columns.

When it is "EXCEPTION" we scan all of the output for days and timestamps that are out of the supported range and we throw an exception if we see any of them. We then file a follow in issue to try and support "LEGACY" mode for writes.

For spark.sql.legacy.parquet.datetimeRebaseModeInRead it is a much harder problem. For now I would say that we implement similar logic to what spark supports to look at the metadata for each file and decide what mode we are going to operate in for that file. If the file mode is what we support we just read it. If it is a mode that we do not support "LEGACY" we treat it as "EXCEPTION" and scan the data we read to look for any dates that we cannot support. If we hit these dates we throw an exception. If not we can read the data without any issues. We might also want to add in a config that allows us to fall back to the CPU just for date/timestamp columns so users might be able to get some GPU acceleration until we can fully support the new formats.

We also file a follow on issue to go back and look at supporting the rebase logic for reads, which I think is more important than the rebase logic for writes.

@revans2 revans2 self-assigned this Jul 24, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants