Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types #32949

Closed
wants to merge 7 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Jun 17, 2021

What changes were proposed in this pull request?

This PR allow the parser to parse unit list interval literals like '3' day '10' hours '3' seconds or '8' years '3' months as YearMonthIntervalType or DayTimeIntervalType.

Why are the changes needed?

For ANSI compliance.

Does this PR introduce any user-facing change?

Yes. I noted the following things in the sql-migration-guide.md.

  • Unit list interval literals are parsed as YearMonthIntervaType or DayTimeIntervalType instead of CalendarIntervalType.
  • WEEK, MILLISECONS, MICROSECOND and NANOSECOND are not valid units for unit list interval literals.
  • Units of year-month and day-time cannot be mixed like 1 YEAR 2 MINUTES.

How was this patch tested?

New tests and modified tests.

@@ -2423,10 +2423,10 @@ object DatePart {
224
> SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001');
1.000001
> SELECT _FUNC_('days', interval 1 year 10 months 5 days);
> SELECT _FUNC_('days', interval 5 days 3 hours 7 minutes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

days cannot be listed with year or months. So I change this part.

@@ -1,134 +1,3 @@
CREATE TEMPORARY VIEW t AS select '2011-05-06 07:08:09.1234567' as c, interval 10 year 20 month 30 day 40 hour 50 minute 6.7890 second as i;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The view t is no longer created if spark.sql.legacy.interval.enabled is false and the queries which refer to t always fail.
So I moved this part to inputs/legacy_interval/extract.sql.

@@ -86,15 +86,15 @@ Input [9]: [ws_item_sk#1, ws_ext_sales_price#2, ws_sold_date_sk#3, i_item_sk#6,
Output [2]: [d_date_sk#13, d_date#14]
Batched: true
Location [not included in comparison]/{warehouse_dir}/date_dim]
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)]
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), IsNotNull(d_date_sk)]
Copy link
Member Author

@sarutak sarutak Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that INTERVAL 30 days is DayTimeIntervalType and the result of cast('1999-02-22' AS DATE) + INTERVAL 30 days is TimestampTypeso filter push down seems not to work because d_date here needs to be casted as TimestampType.

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44449/

1 months -1 days 1 seconds
org.apache.spark.sql.catalyst.parser.ParseException

year-month and day-time intervals cannot be mixed(line 1, pos 8)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANSI interval literals don't allow to mix day-time units and year-month units right?
If so, should we remove these queries which throw this exception?

-- !query output
{"a":1,"b":1 days} 1,1 days 2 years 8 months,1 hours 10 minutes {"a":2 years 8 months,"b":1 hours 10 minutes}
{"a":1,"b":1 days} 1,1 days 32,4200000000 {"a":null,"b":null}
Copy link
Member Author

@sarutak sarutak Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query doesn't work as expected if spark.sql.legacy.interval.enabled is false because the result of named_struct('a', interval 32 month, 'b', interval 70 minute) is 32,4200000000 format string but struct<a:interval,b:interval> here expects 2 years 8 months, 1 hours 10 minutes format string.

Should we move the corresponding query to other place like sql-tests/inputs/legacy_interval_ansi/interval.sql?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the result of named_struct('a', interval 32 month, 'b', interval 70 minute) is 32,4200000000

Is it corrected? I thought it should be interval '32' month

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44449/

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Test build #139921 has finished for PR 32949 at commit d17721e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak sarutak changed the title [WIP][SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types [SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types Jun 18, 2021
@sarutak sarutak marked this pull request as ready for review June 18, 2021 15:11
@MaxGekk
Copy link
Member

MaxGekk commented Jun 23, 2021

@sarutak Could you resolve conflicts, please.

@sarutak sarutak force-pushed the day-time-multi-units branch from d17721e to a284267 Compare June 23, 2021 17:09
@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44743/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44743/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Test build #140215 has finished for PR 32949 at commit a284267.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44747/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44747/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Test build #140219 has finished for PR 32949 at commit 5b37384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 24, 2021

@cloud-fan @AngersZhuuuu @Peng-Lei Since this PR can change behavior and impact on users apps, may I ask you to help in reviewing this PR, please.

@sarutak
Copy link
Member Author

sarutak commented Jun 28, 2021

@MaxGekk If you feel this change complicated, how about the following change in this PR?

  • Leave the behavior for multi units interval literals as it is like 1 year 2 months (still represented as CalendarIntervalType).
  • Leave the behavior for plural form units as it is like 3 years (still represented as CalendarIntervalType)
  • Change the behavior for single and singular form units like 3 year to comply with ANSI (represented as YearMonthIntervalType or DayTimeIntervalType).

Comment on lines 2366 to 2396
val start = YearMonthIntervalType.stringToField(fromUnit)
val end = YearMonthIntervalType.stringToField(toUnit)
Literal(calendarInterval.months, YearMonthIntervalType(start, end))
Copy link
Contributor

@Peng-Lei Peng-Lei Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
how about:

        val start = YearMonthIntervalType.stringToField(fromUnit)
        Literal(calendarInterval.months, YearMonthIntervalType(start, YearMonthIntervalType.MONTH))

- In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR TO MONTH` are converted to ANSI interval types: `YearMonthIntervalType` or `DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`.
- In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR TO MONTH` and the unit list interval literals like `INTERVAL '3' DAYS '1' HOUR` are converted to ANSI interval types: `YearMonthIntervalType` or `DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`.

- In Spark 3.2, the unit list interval literals support two category of units. The one is year-month units (YEAR and MONTH) and the other is day-time units (DAY, HOUR, MINUTE and SECOND). Units are allowd to be listed within the same category like `5 YEARS 8 MONTHS` or `1 DAY 2 MINUTE`. In Spark 3.1 and earlier, WEEK, MILLISECOND, MICROSECOND and NANOSECOND are also supported as units, and any units are allowed to be listed in a literal. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to forbid WEEK, MILLISECOND, MICROSECOND and NANOSECOND? I think the only limitation is people can't mix year-month and day-time fields now, but I don't see why we need to forbid WEEK, MILLISECOND, MICROSECOND and NANOSECOND.

_.getText.toLowerCase(Locale.ROOT).stripSuffix("s"))
if (units.forall(YearMonthIntervalType.stringToField.contains)) {
val fields = units.map(YearMonthIntervalType.stringToField)
val start = if (0 < IntervalUtils.getYears(calendarInterval)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean? why can't we always use fields.min?

@@ -142,15 +142,15 @@ BroadcastExchange (28)
Output [2]: [d_date_sk#13, d_date#24]
Batched: true
Location [not included in comparison]/{warehouse_dir}/date_dim]
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)]
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), IsNotNull(d_date_sk)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we break filter pushdown in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned here it seems so.
This is the most concerning point of this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast('1999-02-22' AS DATE) + INTERVAL 30 days is TimestampType

I checked the corresponding code in ResolveBinaryArithmetic

case (DateType, DayTimeIntervalType(DAY, DAY)) => DateAdd(l, ExtractANSIIntervalDays(r))

DateAdd returns Date, I think we are fine now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, return types have been changed. I'll re-run TPCDS again.

@cloud-fan
Copy link
Contributor

One general comment about test: I think most tests do not want to test mixed fields of year-month and day-time interval. Can we update these tests to create valid intervals, and add some extra tests for the mixed-fields case?

@cloud-fan
Copy link
Contributor

@sarutak any progress?

@sarutak
Copy link
Member Author

sarutak commented Jul 6, 2021

@cloud-fan
This is in progress. SPARK-35983 and SPARK-35999 block this to fix the issue pointed out here but I'll open PR for it today.

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45344/

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Test build #140833 has finished for PR 32949 at commit 03d2522.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}
if (yearMonthFields.nonEmpty) {
assert(dayTimeFields.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(dayTimeFields.isEmpty)
if (dayTimeFields.nonEmpty) {
val literalString = ctx.errorCapturingMultiUnitsInterval.body.getText
throw QueryParsingErrors.mixedIntervalUnitsError(literalString, ctx)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean by moving the check here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have already tried this but I found we can't get expected error message.
The text returned by ctx.errorCapturingMultiUnitsInterval.body.getText is not split by white spaces.
This is an example.

spark-sql> select interval 3 year 2 minute;
Error in query: 
Cannot mix year-month and day-time fields: 3year2minute(line 1, pos 7)

We expect 3 years 2 minute but actually get 3year2minute.
Or, we can split numbers and units in the returned text, then joined them with white spaces.
What do you think?

Copy link
Member Author

@sarutak sarutak Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to split tokens using regex at the last commit.

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45365/

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45365/

literalBuilder.append(" ")
}
}
throw QueryParsingErrors.mixedIntervalUnitsError(literalBuilder.toString, ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

val multiUnitsContext = ctx.errorCapturingMultiUnitsInterval.body
val values = multiUnitsContext.intervalValue().asScala.map(_.getText)
val units = multiUnitsContext.unit.asScala.map(_.getText)
val literalStr = values.zip(units).map { case (v, u) => v + " " + u }.mkString(" ")
throw ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost the same done in visitMultiUnitsInterval isn't it?
As I mentioned here, I wanted to avoid the building logic twice for literal string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think we must duplicate some code: either the validation code or the string building code.

Perf is not an issue here as it's for the error case, and I found the string builder code is simpler and OK to duplicate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's too much trouble, let's don't put the interval literal string in the error message. The error contains the line number and start position, people can see the original interval literal string by themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW did you try source(ctx.errorCapturingMultiUnitsInterval.body)? In CTAS we use source to get the original SQL text of the query

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW did you try source(ctx.errorCapturingMultiUnitsInterval.body)? In CTAS we use source to get the original SQL text of the query

Oh, I didn't notice that. I'll try it and if it seems work, let's use it. Thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source seems to work. Thank you!

checkIntervals("1.001 second",
Literal(IntervalUtils.stringToInterval("1 second 1 millisecond")))
// Hive nanosecond notation.
checkIntervals("13.123456789 seconds", legacyIntervalLiteral("second", "13.123456789"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval 13.123456789 seconds returns the legacy interval not day-time interval?

-- !query schema
struct<(INTERVAL '14 00:00:00.000003' DAY TO SECOND * 1.5):interval day to second>
-- !query output
21 00:00:00.000005000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so we always print 9 digits for second even if our precision is only microsecond?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a query runs through SparkSQLDriver, HIVE_STYLE is applied. It's not a scope of this change.

-- !query output
{"a":1,"b":1 days} 1,1 days 2 years 8 months,1 hours 10 minutes {"a":2 years 8 months,"b":1 hours 10 minutes}
{"a":1,"b":1 days} {"a":1,"b":1 00:00:00.000000000} 1,1 days 1,INTERVAL '1' DAY INTERVAL '32' HOUR,INTERVAL '70' MINUTE {"a":1 08:00:00.000000000,"b":0 01:10:00.000000000}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CSV, we ignore the interval fields and always print the full format of interval strings? I thought CSV will print 1 for interval 1 day.

Copy link
Member Author

@sarutak sarutak Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of to_csv for ANSI intervals were changed in #33210.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I got it now. {"a":1,"b":1 00:00:00.000000000} is the HIVE_STYLE interval string format, which does not respect the interval fields.

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Test build #140854 has finished for PR 32949 at commit 94ade45.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45374/

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45374/

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Test build #140863 has finished for PR 32949 at commit fa40bcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(e2.getMessage.contains(s"Cannot use interval type in the table schema."))
val e3 = intercept[AnalysisException](v2Writer.overwritePartitions())
assert(e3.getMessage.contains(s"Cannot use interval type in the table schema."))
withSQLConf(SQLConf.LEGACY_INTERVAL_ENABLED.key -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR: so DS v2 can't fail correctly when writing out new interval types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
withJdbcStatement() { statement =>
val e = intercept[SQLException] {
statement.execute(s"SET ${SQLConf.LEGACY_INTERVAL_ENABLED.key} = true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to turn on the legacy config to detect this error?

@@ -355,6 +356,7 @@ class SparkMetadataOperationSuite extends HiveThriftServer2TestBase {
val ddl = s"CREATE GLOBAL TEMP VIEW $viewName as select interval 1 day as i"

withJdbcStatement(viewName) { statement =>
statement.execute(s"SET ${SQLConf.LEGACY_INTERVAL_ENABLED.key}=true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this test to check the new interval types. @yaooqinn can you help with this later in a followup PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

testExecuteStatementWithProtocolVersion(
version,
"SELECT interval '1' year '2' day",
SQLConf.LEGACY_INTERVAL_ENABLED.key -> "true") { rs =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45451/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45451/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140939 has finished for PR 32949 at commit 48f080f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2!

@cloud-fan cloud-fan closed this in 8e92ef8 Jul 13, 2021
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…tightest year-month/day-time interval types

### What changes were proposed in this pull request?

This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`.

### Why are the changes needed?

For ANSI compliance.

### Does this PR introduce _any_ user-facing change?

Yes. I noted the following things in the `sql-migration-guide.md`.

* Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`.
* `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals.
* Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`.

### How was this patch tested?

New tests and modified tests.

Closes #32949 from sarutak/day-time-multi-units.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8e92ef8)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants