-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146
Comments
@revans2 Please help review the solution: Interval types can be found in https://spark.apache.org/docs/latest/sql-ref-datatypes.html Currently plugin do not support write for csv, so let's talk about reading interval type from CSV. Spark read interval code is in: There are legacy form and normal form switched by SQLConf.LEGACY_FROM_DAYTIME_STRING. legacy form SQLConf.LEGACY_FROM_DAYTIME_STRING See: https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3042 parseDayTimeLegacy see: normal form
The invalid value will be null when reading csv. proposed solution for the normal form Use Cudf ColumnView.extractRe to extract the day, hour, ... , second fields by specifing the groups in regexp, and then calculate the micros. Gpu code is like:
Cpu code is like
row count: 10,000,000 |
I really dislike regular expression use in casts, but it is a good first step. It would be nice to file a follow on issue to write a custom kernel to do this for us. Also I assume you know that your code to do the conversion is leaking a lot of column views. I assume you did that just for readability of the code. Second have you tested this with CSV? The patch that added in support for writing/reading CSV https://issues.apache.org/jira/browse/SPARK-36831 did not add in anything that calls |
Filed an issue: rapidsai/cudf#10356 CSV is text file, the day-time interval is stored in string form, e.g:
Spark used a similar method to parse interval string to day-time interval: IntervalUtils.castStringToDTInterval I know the leaking in the example code, thanks the kindly reminder. |
Spark Accelerator already supported reading |
The cuDF issue is closed but without a fix. |
Make sure the plugin can read ANSI interval types from CSV source
The text was updated successfully, but these errors were encountered: