-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types #32949
Conversation
@@ -2423,10 +2423,10 @@ object DatePart { | |||
224 | |||
> SELECT _FUNC_('SECONDS', timestamp'2019-10-01 00:00:01.000001'); | |||
1.000001 | |||
> SELECT _FUNC_('days', interval 1 year 10 months 5 days); | |||
> SELECT _FUNC_('days', interval 5 days 3 hours 7 minutes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
days
cannot be listed with year
or months
. So I change this part.
@@ -1,134 +1,3 @@ | |||
CREATE TEMPORARY VIEW t AS select '2011-05-06 07:08:09.1234567' as c, interval 10 year 20 month 30 day 40 hour 50 minute 6.7890 second as i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The view t
is no longer created if spark.sql.legacy.interval.enabled
is false
and the queries which refer to t
always fail.
So I moved this part to inputs/legacy_interval/extract.sql
.
@@ -86,15 +86,15 @@ Input [9]: [ws_item_sk#1, ws_ext_sales_price#2, ws_sold_date_sk#3, i_item_sk#6, | |||
Output [2]: [d_date_sk#13, d_date#14] | |||
Batched: true | |||
Location [not included in comparison]/{warehouse_dir}/date_dim] | |||
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)] | |||
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), IsNotNull(d_date_sk)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that INTERVAL 30 days
is DayTimeIntervalType
and the result of cast('1999-02-22' AS DATE) + INTERVAL 30 days
is TimestampType
so filter push down seems not to work because d_date
here needs to be casted as TimestampType
.
Kubernetes integration test starting |
1 months -1 days 1 seconds | ||
org.apache.spark.sql.catalyst.parser.ParseException | ||
|
||
year-month and day-time intervals cannot be mixed(line 1, pos 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANSI interval literals don't allow to mix day-time units and year-month units right?
If so, should we remove these queries which throw this exception?
-- !query output | ||
{"a":1,"b":1 days} 1,1 days 2 years 8 months,1 hours 10 minutes {"a":2 years 8 months,"b":1 hours 10 minutes} | ||
{"a":1,"b":1 days} 1,1 days 32,4200000000 {"a":null,"b":null} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query doesn't work as expected if spark.sql.legacy.interval.enabled
is false
because the result of named_struct('a', interval 32 month, 'b', interval 70 minute)
is 32,4200000000
format string but struct<a:interval,b:interval>
here expects 2 years 8 months, 1 hours 10 minutes
format string.
Should we move the corresponding query to other place like sql-tests/inputs/legacy_interval_ansi/interval.sql
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the result of named_struct('a', interval 32 month, 'b', interval 70 minute) is 32,4200000000
Is it corrected? I thought it should be interval '32' month
Kubernetes integration test status success |
Test build #139921 has finished for PR 32949 at commit
|
@sarutak Could you resolve conflicts, please. |
d17721e
to
a284267
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #140215 has finished for PR 32949 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #140219 has finished for PR 32949 at commit
|
@cloud-fan @AngersZhuuuu @Peng-Lei Since this PR can change behavior and impact on users apps, may I ask you to help in reviewing this PR, please. |
@MaxGekk If you feel this change complicated, how about the following change in this PR?
|
val start = YearMonthIntervalType.stringToField(fromUnit) | ||
val end = YearMonthIntervalType.stringToField(toUnit) | ||
Literal(calendarInterval.months, YearMonthIntervalType(start, end)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
how about:
val start = YearMonthIntervalType.stringToField(fromUnit)
Literal(calendarInterval.months, YearMonthIntervalType(start, YearMonthIntervalType.MONTH))
docs/sql-migration-guide.md
Outdated
- In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR TO MONTH` are converted to ANSI interval types: `YearMonthIntervalType` or `DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. | ||
- In Spark 3.2, the unit-to-unit interval literals like `INTERVAL '1-1' YEAR TO MONTH` and the unit list interval literals like `INTERVAL '3' DAYS '1' HOUR` are converted to ANSI interval types: `YearMonthIntervalType` or `DayTimeIntervalType`. In Spark 3.1 and earlier, such interval literals are converted to `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. | ||
|
||
- In Spark 3.2, the unit list interval literals support two category of units. The one is year-month units (YEAR and MONTH) and the other is day-time units (DAY, HOUR, MINUTE and SECOND). Units are allowd to be listed within the same category like `5 YEARS 8 MONTHS` or `1 DAY 2 MINUTE`. In Spark 3.1 and earlier, WEEK, MILLISECOND, MICROSECOND and NANOSECOND are also supported as units, and any units are allowed to be listed in a literal. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to forbid WEEK, MILLISECOND, MICROSECOND and NANOSECOND
? I think the only limitation is people can't mix year-month and day-time fields now, but I don't see why we need to forbid WEEK, MILLISECOND, MICROSECOND and NANOSECOND
.
_.getText.toLowerCase(Locale.ROOT).stripSuffix("s")) | ||
if (units.forall(YearMonthIntervalType.stringToField.contains)) { | ||
val fields = units.map(YearMonthIntervalType.stringToField) | ||
val start = if (0 < IntervalUtils.getYears(calendarInterval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean? why can't we always use fields.min
?
@@ -142,15 +142,15 @@ BroadcastExchange (28) | |||
Output [2]: [d_date_sk#13, d_date#24] | |||
Batched: true | |||
Location [not included in comparison]/{warehouse_dir}/date_dim] | |||
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), LessThanOrEqual(d_date,1999-03-24), IsNotNull(d_date_sk)] | |||
PushedFilters: [IsNotNull(d_date), GreaterThanOrEqual(d_date,1999-02-22), IsNotNull(d_date_sk)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we break filter pushdown in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned here it seems so.
This is the most concerning point of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast('1999-02-22' AS DATE) + INTERVAL 30 days is TimestampType
I checked the corresponding code in ResolveBinaryArithmetic
case (DateType, DayTimeIntervalType(DAY, DAY)) => DateAdd(l, ExtractANSIIntervalDays(r))
DateAdd
returns Date, I think we are fine now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, return types have been changed. I'll re-run TPCDS again.
One general comment about test: I think most tests do not want to test mixed fields of year-month and day-time interval. Can we update these tests to create valid intervals, and add some extra tests for the mixed-fields case? |
@sarutak any progress? |
@cloud-fan |
Kubernetes integration test status success |
Test build #140833 has finished for PR 32949 at commit
|
} | ||
} | ||
if (yearMonthFields.nonEmpty) { | ||
assert(dayTimeFields.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(dayTimeFields.isEmpty) | |
if (dayTimeFields.nonEmpty) { | |
val literalString = ctx.errorCapturingMultiUnitsInterval.body.getText | |
throw QueryParsingErrors.mixedIntervalUnitsError(literalString, ctx) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I mean by moving the check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have already tried this but I found we can't get expected error message.
The text returned by ctx.errorCapturingMultiUnitsInterval.body.getText
is not split by white spaces.
This is an example.
spark-sql> select interval 3 year 2 minute;
Error in query:
Cannot mix year-month and day-time fields: 3year2minute(line 1, pos 7)
We expect 3 years 2 minute
but actually get 3year2minute
.
Or, we can split numbers and units in the returned text, then joined them with white spaces.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to split tokens using regex at the last commit.
Kubernetes integration test starting |
Kubernetes integration test status success |
literalBuilder.append(" ") | ||
} | ||
} | ||
throw QueryParsingErrors.mixedIntervalUnitsError(literalBuilder.toString, ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val multiUnitsContext = ctx.errorCapturingMultiUnitsInterval.body
val values = multiUnitsContext.intervalValue().asScala.map(_.getText)
val units = multiUnitsContext.unit.asScala.map(_.getText)
val literalStr = values.zip(units).map { case (v, u) => v + " " + u }.mkString(" ")
throw ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost the same done in visitMultiUnitsInterval
isn't it?
As I mentioned here, I wanted to avoid the building logic twice for literal string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It saves the checking logic: 94ade45#diff-26fe6b511f4b8aedf5bdffae0d7c03aba69463c09fa4f881d77520125796e0faL2471-L2477
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think we must duplicate some code: either the validation code or the string building code.
Perf is not an issue here as it's for the error case, and I found the string builder code is simpler and OK to duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's too much trouble, let's don't put the interval literal string in the error message. The error contains the line number and start position, people can see the original interval literal string by themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW did you try source(ctx.errorCapturingMultiUnitsInterval.body)
? In CTAS we use source
to get the original SQL text of the query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW did you try source(ctx.errorCapturingMultiUnitsInterval.body)? In CTAS we use source to get the original SQL text of the query
Oh, I didn't notice that. I'll try it and if it seems work, let's use it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source
seems to work. Thank you!
checkIntervals("1.001 second", | ||
Literal(IntervalUtils.stringToInterval("1 second 1 millisecond"))) | ||
// Hive nanosecond notation. | ||
checkIntervals("13.123456789 seconds", legacyIntervalLiteral("second", "13.123456789")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval 13.123456789 seconds
returns the legacy interval not day-time interval?
-- !query schema | ||
struct<(INTERVAL '14 00:00:00.000003' DAY TO SECOND * 1.5):interval day to second> | ||
-- !query output | ||
21 00:00:00.000005000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so we always print 9 digits for second even if our precision is only microsecond?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a query runs through SparkSQLDriver
, HIVE_STYLE
is applied. It's not a scope of this change.
-- !query output | ||
{"a":1,"b":1 days} 1,1 days 2 years 8 months,1 hours 10 minutes {"a":2 years 8 months,"b":1 hours 10 minutes} | ||
{"a":1,"b":1 days} {"a":1,"b":1 00:00:00.000000000} 1,1 days 1,INTERVAL '1' DAY INTERVAL '32' HOUR,INTERVAL '70' MINUTE {"a":1 08:00:00.000000000,"b":0 01:10:00.000000000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CSV, we ignore the interval fields and always print the full format of interval strings? I thought CSV will print 1
for interval 1 day
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of to_csv
for ANSI intervals were changed in #33210.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I got it now. {"a":1,"b":1 00:00:00.000000000}
is the HIVE_STYLE interval string format, which does not respect the interval fields.
Test build #140854 has finished for PR 32949 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140863 has finished for PR 32949 at commit
|
assert(e2.getMessage.contains(s"Cannot use interval type in the table schema.")) | ||
val e3 = intercept[AnalysisException](v2Writer.overwritePartitions()) | ||
assert(e3.getMessage.contains(s"Cannot use interval type in the table schema.")) | ||
withSQLConf(SQLConf.LEGACY_INTERVAL_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR: so DS v2 can't fail correctly when writing out new interval types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @imback82
} | ||
withJdbcStatement() { statement => | ||
val e = intercept[SQLException] { | ||
statement.execute(s"SET ${SQLConf.LEGACY_INTERVAL_ENABLED.key} = true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to turn on the legacy config to detect this error?
@@ -355,6 +356,7 @@ class SparkMetadataOperationSuite extends HiveThriftServer2TestBase { | |||
val ddl = s"CREATE GLOBAL TEMP VIEW $viewName as select interval 1 day as i" | |||
|
|||
withJdbcStatement(viewName) { statement => | |||
statement.execute(s"SET ${SQLConf.LEGACY_INTERVAL_ENABLED.key}=true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update this test to check the new interval types. @yaooqinn can you help with this later in a followup PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
testExecuteStatementWithProtocolVersion( | ||
version, | ||
"SELECT interval '1' year '2' day", | ||
SQLConf.LEGACY_INTERVAL_ENABLED.key -> "true") { rs => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140939 has finished for PR 32949 at commit
|
thanks, merging to master/3.2! |
…tightest year-month/day-time interval types ### What changes were proposed in this pull request? This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`. ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? Yes. I noted the following things in the `sql-migration-guide.md`. * Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`. * `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals. * Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`. ### How was this patch tested? New tests and modified tests. Closes #32949 from sarutak/day-time-multi-units. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8e92ef8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR allow the parser to parse unit list interval literals like
'3' day '10' hours '3' seconds
or'8' years '3' months
asYearMonthIntervalType
orDayTimeIntervalType
.Why are the changes needed?
For ANSI compliance.
Does this PR introduce any user-facing change?
Yes. I noted the following things in the
sql-migration-guide.md
.YearMonthIntervaType
orDayTimeIntervalType
instead ofCalendarIntervalType
.WEEK
,MILLISECONS
,MICROSECOND
andNANOSECOND
are not valid units for unit list interval literals.1 YEAR 2 MINUTES
.How was this patch tested?
New tests and modified tests.