18417: Feat: Make extract SQL expression timezone aware#6
18417: Feat: Make extract SQL expression timezone aware#6martin-augment wants to merge 7 commits intomainfrom
Conversation
WalkthroughIntroduces a new EXTRACT scalar UDF with timezone-aware timestamp handling for both timezone-aware and naive timestamps. Replaces date_part references in query planner and test files. Enhances timezone support in casting and datetime functions. Updates documentation and adds comprehensive timezone tests. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } | ||
| arr = Arc::new(builder.finish()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: Double Timezone Adjustment Causes Incorrect Hour Extraction
Double timezone adjustment for hour extraction. The code already adjusts timestamps via adjust_timestamp_array (lines 206-247) which properly converts UTC timestamps to the target timezone. However, lines 282-304 apply an additional manual offset adjustment specifically for hour extraction on timezone-aware timestamps. This creates a double adjustment that produces incorrect results. Additionally, line 289 uses .unwrap() which could panic if the timezone string doesn't match the expected format (e.g., for named timezones instead of offset-based ones like "+04:00"). This special-case code should be removed as the timestamp adjustment is already handled correctly by adjust_timestamp_array.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Cursor AI reviewer is correct that the timestamp value is already adjusted with adjust_timestamp_array(), so there is no need to do it again manually for the hour part. Prevents a double offsetting for the hour part
| } else { | ||
| let sign = if tz_str.starts_with('+') { 1i32 } else { -1i32 }; | ||
| let hours_str = &tz_str[1..3]; | ||
| let hours: i32 = hours_str.parse().unwrap(); |
There was a problem hiding this comment.
Bug: Unsafe unwrap for timezone parsing panics on invalid formats
Use of .unwrap() on parsing hours from timezone string. This will panic if the timezone string format is not exactly as expected (e.g., "+04:00"). Named timezones like "America/New_York" or other valid IANA timezone formats would cause a panic here. Should use proper error handling instead of .unwrap().
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Cursor AI reviewer is correct that named offsets are not supported in this PR. The finding prevents abort of the application due to the panic caused by the failed parsing of the hour.
| "doy" => date_part(array.as_ref(), DatePart::DayOfYear)?, | ||
| "dow" => date_part(array.as_ref(), DatePart::DayOfWeekSunday0)?, | ||
| "isodow" => date_part(array.as_ref(), DatePart::DayOfWeekMonday0)?, | ||
| "epoch" => epoch(array.as_ref())?, |
There was a problem hiding this comment.
For the "epoch" branch, the earlier timezone adjustment shifts the absolute epoch and returns a value offset by the timezone; epoch should be independent of timezone. Consider skipping the adjustment when extracting epoch so it reflects true seconds since Unix epoch.
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct that there shouldn't be adjusting of the timestamp when the user needs it as epoch seconds. The finding prevents wrong values for the epoch case.
| array | ||
| }; | ||
|
|
||
| let part_trim = part_normalization(&part); |
There was a problem hiding this comment.
Note that timezone-adjusting the input before computing parts will also affect epoch, yielding an offset result; epoch should not depend on timezone. Consider bypassing the adjustment when the requested part is "epoch".
🤖 React with 👍 or 👎 to let us know if the comment was useful.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct that there shouldn't be adjusting of the timestamp when the user needs it as epoch seconds. The finding prevents wrong values for the epoch case.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
datafusion/functions/src/datetime/date_part.rs(6 hunks)datafusion/functions/src/datetime/extract.rs(1 hunks)datafusion/functions/src/datetime/mod.rs(3 hunks)datafusion/functions/src/datetime/planner.rs(1 hunks)datafusion/sql/src/expr/mod.rs(5 hunks)datafusion/sqllogictest/test_files/extract_tz.slt(1 hunks)datafusion/sqllogictest/test_files/group_by.slt(1 hunks)datafusion/sqllogictest/test_files/optimizer_group_by_constant.slt(1 hunks)datafusion/sqllogictest/test_files/table_functions.slt(1 hunks)datafusion/sqllogictest/test_files/tpch/plans/q7.slt.part(2 hunks)datafusion/sqllogictest/test_files/tpch/plans/q8.slt.part(2 hunks)datafusion/sqllogictest/test_files/tpch/plans/q9.slt.part(2 hunks)docs/source/user-guide/sql/scalar_functions.md(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
datafusion/functions/src/datetime/date_part.rs (2)
datafusion/functions/src/datetime/to_local_time.rs (11)
as_primitive_array(188-188)None(573-573)adjust_to_local_time(142-142)adjust_to_local_time(153-153)adjust_to_local_time(164-164)adjust_to_local_time(175-175)adjust_to_local_time(194-194)adjust_to_local_time(296-344)adjust_to_local_time(443-443)convert_timestamp(297-312)new(112-116)datafusion/functions/src/datetime/extract.rs (16)
as_primitive_array(375-375)config(230-230)tz_str(209-209)adjust_to_local_time(322-368)adjust_to_local_time(380-380)convert_timestamp(323-338)adjust_timestamp_array(216-216)adjust_timestamp_array(219-219)adjust_timestamp_array(222-222)adjust_timestamp_array(224-224)adjust_timestamp_array(236-236)adjust_timestamp_array(239-239)adjust_timestamp_array(242-242)adjust_timestamp_array(244-244)adjust_timestamp_array(370-386)new(101-134)
datafusion/functions/src/datetime/extract.rs (2)
datafusion/common/src/cast.rs (10)
as_date32_array(47-49)as_date64_array(52-54)as_time32_millisecond_array(215-217)as_time32_second_array(210-212)as_time64_microsecond_array(220-222)as_time64_nanosecond_array(225-227)as_timestamp_microsecond_array(244-248)as_timestamp_millisecond_array(237-241)as_timestamp_nanosecond_array(230-234)as_timestamp_second_array(251-253)datafusion/common/src/utils/mod.rs (1)
take_function_args(923-937)
datafusion/sql/src/expr/mod.rs (2)
datafusion/functions-table/src/generate_series.rs (2)
value(122-124)tz_str(126-128)datafusion/functions/src/datetime/date_part.rs (2)
new(102-136)tz_str(213-213)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (12)
datafusion/functions/src/datetime/date_part.rs (2)
204-252: LGTM! Timezone-aware extraction logic is well-structured.The implementation correctly handles three cases:
- Timezone-aware timestamps (with explicit timezone)
- Naive timestamps (interpreted in session timezone)
- Non-timestamp types (no adjustment)
The logic properly adjusts timestamps before extraction, ensuring date parts are computed in the appropriate timezone.
46-46: Both error macros are actively used and should remain imported.Verification confirms
internal_err!is used on line 153 andinternal_datafusion_err!is used on lines 333 and 339. Both macros are necessary for the file's error handling and should be retained.datafusion/sqllogictest/test_files/tpch/plans/q8.slt.part (2)
63-63: LGTM! Test expectations correctly updated.The logical plan projection now uses
extract(Utf8("YEAR"), orders.o_orderdate)instead ofdate_part(...), which aligns with the new Extract UDF implementation.
100-100: LGTM! Physical plan projection updated consistently.The physical plan correctly reflects the extract function usage:
extract(YEAR, o_orderdate@2).datafusion/functions/src/datetime/planner.rs (1)
32-32: LGTM! Planner correctly routes to new extract UDF.The change from
crate::datetime::date_part()tocrate::datetime::extract()properly integrates the new timezone-aware extract function into the datetime function planner.docs/source/user-guide/sql/scalar_functions.md (1)
2574-2603: LGTM! Comprehensive documentation for the extract function.The documentation clearly describes:
- Syntax:
extract(field FROM source)- Supported fields (year, quarter, month, etc.)
- Both field and source arguments
The documentation is consistent with the existing
date_partfunction documentation and properly positioned in the Time and Date Functions section.datafusion/sqllogictest/test_files/optimizer_group_by_constant.slt (1)
93-93: LGTM! Test output correctly reflects extract function.The projection now shows
NOT extract(Utf8("MONTH"),now())instead ofdate_part, which is consistent with the extract UDF migration.datafusion/sqllogictest/test_files/table_functions.slt (1)
356-357: LGTM! Timestamp output correctly includes timezone indicator.The timestamps now include the
Zsuffix (e.g.,2023-01-01T00:00:00Z), which properly indicates UTC timezone. This is consistent with the timezone-aware extraction implementation.datafusion/sqllogictest/test_files/tpch/plans/q9.slt.part (2)
59-59: LGTM! Logical plan updated to use extract.The projection correctly uses
extract(Utf8("YEAR"), orders.o_orderdate)for YEAR extraction.
85-85: LGTM! Physical plan consistent with logical plan.The physical plan projection correctly uses
extract(YEAR, o_orderdate@4).datafusion/sqllogictest/test_files/tpch/plans/q7.slt.part (2)
65-65: LGTM! Logical plan correctly uses extract function.The projection shows
extract(Utf8("YEAR"), lineitem.l_shipdate)for extracting the year from the shipdate column.
94-94: LGTM! Physical plan matches logical plan semantics.The physical plan projection correctly uses
extract(YEAR, l_shipdate@2).
| fn adjust_to_local_time<T: ArrowTimestampType>(ts: i64, tz: Tz) -> Result<i64> { | ||
| fn convert_timestamp<F>(ts: i64, converter: F) -> Result<DateTime<Utc>> | ||
| where | ||
| F: Fn(i64) -> MappedLocalTime<DateTime<Utc>>, | ||
| { | ||
| match converter(ts) { | ||
| MappedLocalTime::Ambiguous(earliest, latest) => exec_err!( | ||
| "Ambiguous timestamp. Do you mean {:?} or {:?}", | ||
| earliest, | ||
| latest | ||
| ), | ||
| MappedLocalTime::None => exec_err!( | ||
| "The local time does not exist because there is a gap in the local time." | ||
| ), | ||
| MappedLocalTime::Single(date_time) => Ok(date_time), | ||
| } | ||
| } | ||
|
|
||
| let date_time = match T::UNIT { | ||
| Nanosecond => Utc.timestamp_nanos(ts), | ||
| Microsecond => convert_timestamp(ts, |ts| Utc.timestamp_micros(ts))?, | ||
| Millisecond => convert_timestamp(ts, |ts| Utc.timestamp_millis_opt(ts))?, | ||
| Second => convert_timestamp(ts, |ts| Utc.timestamp_opt(ts, 0))?, | ||
| }; | ||
|
|
||
| let offset_seconds: i64 = tz | ||
| .offset_from_utc_datetime(&date_time.naive_utc()) | ||
| .fix() | ||
| .local_minus_utc() as i64; | ||
|
|
||
| let adjusted_date_time = date_time.add( | ||
| TimeDelta::try_seconds(offset_seconds) | ||
| .ok_or_else(|| internal_datafusion_err!("Offset seconds should be less than i64::MAX / 1_000 or greater than -i64::MAX / 1_000"))?, | ||
| ); | ||
|
|
||
| // convert back to i64 | ||
| match T::UNIT { | ||
| Nanosecond => adjusted_date_time.timestamp_nanos_opt().ok_or_else(|| { | ||
| internal_datafusion_err!( | ||
| "Failed to convert DateTime to timestamp in nanosecond. This error may occur if the date is out of range. The supported date ranges are between 1677-09-21T00:12:43.145224192 and 2262-04-11T23:47:16.854775807" | ||
| ) | ||
| }), | ||
| Microsecond => Ok(adjusted_date_time.timestamp_micros()), | ||
| Millisecond => Ok(adjusted_date_time.timestamp_millis()), | ||
| Second => Ok(adjusted_date_time.timestamp()), | ||
| } | ||
| } | ||
|
|
||
| fn adjust_timestamp_array<T: ArrowTimestampType>( | ||
| array: &ArrayRef, | ||
| tz: Tz, | ||
| ) -> Result<ArrayRef> { | ||
| let mut builder = PrimitiveBuilder::<T>::new(); | ||
| let primitive_array = as_primitive_array::<T>(array)?; | ||
| for ts_opt in primitive_array.iter() { | ||
| match ts_opt { | ||
| None => builder.append_null(), | ||
| Some(ts) => { | ||
| let adjusted_ts = adjust_to_local_time::<T>(ts, tz)?; | ||
| builder.append_value(adjusted_ts); | ||
| } | ||
| } | ||
| } | ||
| Ok(Arc::new(builder.finish())) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Refactor to eliminate code duplication.
The helper functions adjust_to_local_time and adjust_timestamp_array are duplicated across three files:
datafusion/functions/src/datetime/date_part.rs(this file)datafusion/functions/src/datetime/to_local_time.rs(lines 295-343, 369-385)datafusion/functions/src/datetime/extract.rs(lines 321-367, 369-385)
This violates the DRY principle and creates a maintenance burden. Consider extracting these functions into a shared utility module (e.g., datafusion/functions/src/datetime/utils.rs or similar) and importing them in all three locations.
// In a new shared module (e.g., datetime/timezone_utils.rs):
pub(crate) fn adjust_to_local_time<T: ArrowTimestampType>(ts: i64, tz: Tz) -> Result<i64> {
// ... existing implementation
}
pub(crate) fn adjust_timestamp_array<T: ArrowTimestampType>(
array: &ArrayRef,
tz: Tz,
) -> Result<ArrayRef> {
// ... existing implementation
}Then import in each file:
use crate::datetime::timezone_utils::{adjust_to_local_time, adjust_timestamp_array};There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that these helper functions could be reused instead of copied in several files. This will prevent technical dept and maintenance cost.
There was a problem hiding this comment.
@martin-augment, thank you for confirming this issue! You're absolutely right that extracting these helpers will prevent technical debt and reduce maintenance burden.
Since this refactoring affects multiple files in the datetime module, would you like to:
- Address it in this PR - I can help identify all the duplication points and suggest the refactoring structure
- Create a follow-up issue - Track this as a separate refactoring task to keep this PR focused on the timezone-aware functionality
If you'd like to go with option 2, I can open a GitHub issue for you with details about the duplicated functions and the suggested approach for consolidating them into a shared utility module.
What would you prefer?
| if is_timezone_aware && part_trim.to_lowercase() == "hour" { | ||
| if let Some(tz_str) = &tz_str_opt { | ||
| let offset_hours = if tz_str.as_ref() == "+00:00" { | ||
| 0 | ||
| } else { | ||
| let sign = if tz_str.starts_with('+') { 1i32 } else { -1i32 }; | ||
| let hours_str = &tz_str[1..3]; | ||
| let hours: i32 = hours_str.parse().unwrap(); | ||
| sign * hours | ||
| }; | ||
| let int_arr = as_int32_array(&arr)?; | ||
| let mut builder = PrimitiveBuilder::<Int32Type>::new(); | ||
| for i in 0..arr.len() { | ||
| if arr.is_null(i) { | ||
| builder.append_null(); | ||
| } else { | ||
| let v = int_arr.value(i); | ||
| builder.append_value(v + offset_hours); | ||
| } | ||
| } | ||
| arr = Arc::new(builder.finish()); | ||
| } |
There was a problem hiding this comment.
Remove the post-adjustment of hours for timezone-aware timestamps
adjust_timestamp_array(...) already converts each timestamp into the correct local time before date_part runs. The extra block that re-adds an offset:
- panics outright for common zone names like
America/New_York(hours_strindexing hits non-digits and theparse().unwrap()aborts), - ignores minute offsets (e.g.
+05:30becomes+05), and - double-adjusts fixed offsets, yielding wrong hour values.
We should drop this block and rely on the earlier conversion for all parts.
- if is_timezone_aware && part_trim.to_lowercase() == "hour" {
- if let Some(tz_str) = &tz_str_opt {
- let offset_hours = if tz_str.as_ref() == "+00:00" {
- 0
- } else {
- let sign = if tz_str.starts_with('+') { 1i32 } else { -1i32 };
- let hours_str = &tz_str[1..3];
- let hours: i32 = hours_str.parse().unwrap();
- sign * hours
- };
- let int_arr = as_int32_array(&arr)?;
- let mut builder = PrimitiveBuilder::<Int32Type>::new();
- for i in 0..arr.len() {
- if arr.is_null(i) {
- builder.append_null();
- } else {
- let v = int_arr.value(i);
- builder.append_value(v + offset_hours);
- }
- }
- arr = Arc::new(builder.finish());
- }
- }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_timezone_aware && part_trim.to_lowercase() == "hour" { | |
| if let Some(tz_str) = &tz_str_opt { | |
| let offset_hours = if tz_str.as_ref() == "+00:00" { | |
| 0 | |
| } else { | |
| let sign = if tz_str.starts_with('+') { 1i32 } else { -1i32 }; | |
| let hours_str = &tz_str[1..3]; | |
| let hours: i32 = hours_str.parse().unwrap(); | |
| sign * hours | |
| }; | |
| let int_arr = as_int32_array(&arr)?; | |
| let mut builder = PrimitiveBuilder::<Int32Type>::new(); | |
| for i in 0..arr.len() { | |
| if arr.is_null(i) { | |
| builder.append_null(); | |
| } else { | |
| let v = int_arr.value(i); | |
| builder.append_value(v + offset_hours); | |
| } | |
| } | |
| arr = Arc::new(builder.finish()); | |
| } | |
| // [preceding code context - lines 1-281] | |
| // [code that follows the removed block - originally after line 303] |
🤖 Prompt for AI Agents
In datafusion/functions/src/datetime/extract.rs around lines 282 to 303, remove
the entire post-adjustment block that manually parses tz_str and adds
offset_hours to the hour values; this block panics on named zones, drops minute
offsets, and double-adjusts timestamps because adjust_timestamp_array already
produced local times. Delete that conditional branch (the if let Some(tz_str) {
... } block) and any dependent code so date_part uses the already-converted
values; ensure arr remains the Arc returned from adjust_timestamp_array and no
further manual offset arithmetic is performed.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the timestamp value is already adjusted with adjust_timestamp_array(), so there is no need to do it again manually for the hour part. Prevents a double offsetting for the hour part
There was a problem hiding this comment.
@martin-augment, thank you for confirming! You're absolutely right—since adjust_timestamp_array() already converts the timestamps to local time by applying the timezone offset, the additional manual hour adjustment at lines 282-303 would indeed cause double offsetting. Removing that block ensures the hour (and all other parts) are extracted from the correctly adjusted timestamps without any redundant offset arithmetic.
If you'd like, I can help verify the fix once you've applied it, or assist with any related testing! 🎉
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
|
value:useful; category:bug; feedback:The Codex AI reviewer is correct that there shouldn't be adjusting of the timestamp when the user needs it as epoch seconds. The finding prevents wrong values for the |
value:useful; category:bug; feedback:The Codex AI reviewer is correct that currently special timezones like +05:30 and -07:45 won't work as desired. The finding prevents incorrect date part values due to the ignored minutes in the offset. Named offsets are not supported at all in this PR. They will panic and abort the application. |
18417: To review by AI
Note
Introduce a timezone-aware
EXTRACTUDF, enhanceDATE_PARTand typed timestamp casts with timezone handling, and update planner, tests, and docs accordingly.extractUDF (datafusion/functions/src/datetime/extract.rs) with timezone-aware extraction and epoch/seconds handling.date_partto adjust timestamp values using array timezone or session timezone; add helpersadjust_to_local_timeandadjust_timestamp_array.extractindatetime::functions()and exposemake_udf_function!(extract::ExtractFunc, extract).EXTRACTtoextractUDF inDatetimeFunctionPlanner.sql/expr/mod.rs, detect timezone inTypedStringcasts and setTimestamp(_, Some(tz)); addextract_tz_from_string.sqllogictestfileextract_tz.sltcovering timezone-awareEXTRACTand session TZ.extract(...)(replacingdate_part(...)) and includeZsuffix for timezone-aware ranges.extractfunction and update Time and Date Functions index.Written by Cursor Bugbot for commit 5f2f5d7. This will update automatically on new commits. Configure here.