Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

Open
Tracked by #2063
gerashegalov opened this issue Nov 18, 2021 · 5 comments
Open
Tracked by #2063
Assignees
Labels
audit_3.3.0 Audit related tasks for 3.3.0 cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@gerashegalov
Copy link
Collaborator

Make sure the plugin can read ANSI interval types from CSV source

@gerashegalov gerashegalov added the audit_3.3.0 Audit related tasks for 3.3.0 label Nov 18, 2021
@res-life res-life self-assigned this Feb 17, 2022
@res-life
Copy link
Collaborator

@revans2 Please help review the solution:

Interval types can be found in https://spark.apache.org/docs/latest/sql-ref-datatypes.html

Currently plugin do not support write for csv, so let's talk about reading interval type from CSV.

Spark read interval code is in:
IntervalUtils.fromDayTimeString

There are legacy form and normal form switched by SQLConf.LEGACY_FROM_DAYTIME_STRING.

legacy form
By default, legacy from daytime string is disable, so we may not support this.

SQLConf.LEGACY_FROM_DAYTIME_STRING See: https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3042

parseDayTimeLegacy see:
https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L463toL475

normal form
By default, Spark use this form, some examples see below table, we will support this form only.

type example valid comment
INTERVAL DAY INTERVAL '100' DAY true
INTERVAL DAY 100 DAY true
INTERVAL DAY TO HOUR INTERVAL '100 10' DAY TO HOUR true
INTERVAL DAY TO HOUR 100 10 true
INTERVAL DAY TO SECOND INTERVAL '100 10:30:40.999999' DAY TO SECOND true
INTERVAL DAY TO SECOND 100 10:30:40.999999 true
INTERVAL DAY TO SECOND INTERVAL '100 10:30:40.    999999' DAY TO SECOND false has extra spaces
INTERVAL DAY TO SECOND INTERVAL '100     10:30:40.999999' DAY TO SECOND false has extra spaces
INTERVAL DAY TO SECOND 100     10:30:40.999999 false has extra spaces
INTERVAL DAY TO SECOND 100 10:30:     40.999999 false has extra spaces
INTERVAL DAY TO SECOND INTERVAL '-100 10:30:40.999999' DAY TO SECOND true
INTERVAL DAY TO SECOND INTERVAL -'-100 10:30:40.999999' DAY TO SECOND true two negative signs is positive
INTERVAL DAY TO SECOND -100 10:30:40.999999 true
INTERVAL DAY TO SECOND INTERVAL '100 26:30:40.999999' DAY TO SECOND false hour is 26 > 23

The invalid value will be null when reading csv.

proposed solution for the normal form

Use Cudf ColumnView.extractRe to extract the day, hour, ... , second fields by specifing the groups in regexp, and then calculate the micros.

Gpu code is like:

    val intervalCV = builder.buildAndPutOnDevice()
    val start = System.nanoTime()
    val p = "^INTERVAL\\s+([+|-])?'([+|-])?(\\d{1,9}) (\\d{1,2}):(\\d{1,2}):" +
        "(\\d{1,2})(\\.\\d{1,9})?'\\s+DAY\\s+TO\\s+SECOND$"

    // e.g.: INTERVAL -'-100 10:30:40.999999' DAY TO SECOND
    // group 0: sign is -
    // group 1: sign is -
    // group 2: day is 100
    // group 3: hour 10
    // group 4: minute 30
    // group 5: second 40
    // group 6: micro seconds 999999

    withResource(intervalCV.extractRe(p)) {
      table => {
        println("row count: " + table.getRowCount)

        val micros = table.getColumn(2).castTo(DType.INT64).mul(Scalar.fromLong(86400L * 1000000L))   // day to micros
            .add(table.getColumn(3).castTo(DType.INT64).mul(Scalar.fromLong(3600 * 1000000L))) // hour to micros
            .add(table.getColumn(4).castTo(DType.INT64).mul(Scalar.fromLong(60 * 1000000L))) // minute
            .add(table.getColumn(5).castTo(DType.INT64).mul(Scalar.fromLong(1000000L))) 
            .add(table.getColumn(6).castTo(DType.INT64))
      }
    }
    val timeS =(System.nanoTime() - start).toDouble / TimeUnit.SECONDS.toNanos(1)
    println(s"GPU used time $timeS S")

Cpu code is like

    while (i < rowCount) {
      IntervalUtils.fromDayTimeString(intervals(i))
      i += 1
    }

row count: 10,000,000
GPU used time 0.86659265 S
CPU used time 16.818680952 S
GPU speedups about 19x
Is this acceptable or have more effient approach?

@revans2
Copy link
Collaborator

revans2 commented Feb 24, 2022

I really dislike regular expression use in casts, but it is a good first step. It would be nice to file a follow on issue to write a custom kernel to do this for us.

Also I assume you know that your code to do the conversion is leaking a lot of column views. I assume you did that just for readability of the code.

Second have you tested this with CSV? The patch that added in support for writing/reading CSV https://issues.apache.org/jira/browse/SPARK-36831 did not add in anything that calls fromDayTimeString or similar. The Parquet code just stored these as an int or a long. Please check to see if CSV is doing the same, because I suspect that they are, and then we don't need to do as much special processing.

@res-life
Copy link
Collaborator

res-life commented Feb 25, 2022

Filed an issue: rapidsai/cudf#10356

CSV is text file, the day-time interval is stored in string form, e.g:

INTERVAL '100 10:30:40.999999' DAY TO SECOND 

Spark used a similar method to parse interval string to day-time interval: IntervalUtils.castStringToDTInterval
The IntervalUtils.fromDayTimeString in the example code also invoked IntervalUtils.castStringToDTInterval.

I know the leaking in the example code, thanks the kindly reminder.

@res-life
Copy link
Collaborator

res-life commented Apr 8, 2022

Spark Accelerator already supported reading day-time interval from CSV temporarily.
But is still waiting cuDF kernel to support.

@res-life
Copy link
Collaborator

The cuDF issue is closed but without a fix.
For details see rapidsai/cudf#10356 (comment).

@revans2 revans2 mentioned this issue Oct 27, 2022
38 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0 cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

3 participants