Skip to content

20224: fix: Fix scalar broadcast for to_timestamp()#229

Open
martin-augment wants to merge 1 commit intomainfrom
pr-20224-2026-02-09-07-41-56
Open

20224: fix: Fix scalar broadcast for to_timestamp()#229
martin-augment wants to merge 1 commit intomainfrom
pr-20224-2026-02-09-07-41-56

Conversation

@martin-augment
Copy link
Owner

20224: To review by AI

When to_timestamp() was invoked with a scalar Float64 and an array of
strings, the previous coding neglected to broadcast the scalar to the
array properly when producing the return value.  That is, a query like
`SELECT to_timestamp(123.5, t.x) FROM t` would result in:

    Internal error: UDF to_timestamp returned a different number of rows
    than expected
@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the to_timestamp function where it previously failed to correctly process scalar floating-point values. The changes introduce explicit handling for Float16, Float32, and Float64 scalar inputs, ensuring they are properly converted to TimestampNanosecond. This enhancement improves the robustness of the to_timestamp function and is validated by new SQL logic tests.

Highlights

  • Scalar Float Handling in to_timestamp: The to_timestamp function now correctly processes scalar Float16, Float32, and Float64 inputs, ensuring proper conversion to TimestampNanosecond.
  • New SQL Logic Tests: Added new SQL logic tests to validate the to_timestamp function's behavior with various scalar floating-point inputs, confirming the fix.
  • Minor File Cleanup: Removed trailing newlines from several SQL logic test files, contributing to minor code hygiene.
Changelog
  • datafusion/functions/src/datetime/to_timestamp.rs
    • Refactored the Float16, Float32, and Float64 branches to use match statements, distinguishing between ColumnarValue::Scalar and ColumnarValue::Array inputs.
    • Added logic to convert scalar float values directly to TimestampNanosecond using a map operation.
  • datafusion/sqllogictest/test_files/datetime/arith_date_time.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/datetime/arith_timestamp_duration.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/datetime/timestamps.slt
    • Introduced new test cases for to_timestamp with scalar Float16, Float32, and Float64 values to confirm the fix.
  • datafusion/sqllogictest/test_files/limit.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/limit_single_row_batches.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/spark/collection/size.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/spark/datetime/time_trunc.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/spark/datetime/trunc.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/struct.slt
    • Removed a trailing newline at the end of the file.
  • datafusion/sqllogictest/test_files/truncate.slt
    • Removed a trailing newline at the end of the file.
Activity
  • The pull request was created by martin-augment with the title "20224: fix: Fix scalar broadcast for to_timestamp()".
  • The author explicitly requested review by AI in the description: "To review by AI".
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Walkthrough

This pull request refactors the to_timestamp function in DataFusion to support both scalar and array inputs for floating-point types (Float16, Float32, Float64). Previously, these types were only processed as dedicated array branches. The changes introduce per-argument logic that converts scalar floating-point values to nanoseconds and returns TimestampNanosecond scalars, while preserving array-based conversion for array inputs. The pull request also includes updates to test files: adding new test coverage for scalar float inputs in timestamps tests, removing an invalid modulo operation test, and adjusting whitespace formatting across multiple test files.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20224-2026-02-09-07-41-56

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 9, 2026

Pull Request Review: Fix scalar broadcast for to_timestamp()

Summary

This PR fixes a bug where to_timestamp() incorrectly handled scalar float inputs (Float16, Float32, Float64) by always converting them to arrays using to_array(1), even when they should remain as scalar values for performance optimization.

Code Quality: ✅ Good

Strengths:

  • The fix correctly implements the scalar/array pattern used throughout DataFusion
  • Code follows existing conventions seen in other functions (e.g., advanced_udf.rs examples)
  • Proper use of match expressions to handle both ColumnarValue::Scalar and ColumnarValue::Array cases
  • Consistent pattern applied across all three float types (Float16, Float32, Float64)

Style Observations:

  • The implementation aligns with DataFusion's performance optimization strategy of keeping scalars as scalars
  • Error messages are clear and descriptive

Potential Issues: ⚠️ Minor Concerns

1. Unreachable Error Branch

The _ => exec_err!("Invalid Float16/32/64 value for to_timestamp") branches are likely unreachable:

Location: datafusion/functions/src/datetime/to_timestamp.rs:448, 465, 481

Since the outer match args[0].data_type() already ensures we're dealing with the correct float type, the inner match should only see:

  • ColumnarValue::Scalar(ScalarValue::Float16(_))
  • ColumnarValue::Array(_)

The wildcard _ pattern would only match if there's a type mismatch between data_type() and the actual ColumnarValue variant, which shouldn't happen in normal operation.

Recommendation: Consider if these error branches should use unreachable!() or internal_err!() instead to indicate they represent impossible states.

2. Missing Test Coverage for Edge Cases

The new tests in timestamps.slt are good but could be expanded:

Current coverage:

  • ✅ Scalar float with array query (the main bug fix)
  • ✅ Float16, Float32, Float64 types

Missing coverage:

  • NULL scalar floats: to_timestamp(NULL::float)
  • NaN values: to_timestamp('NaN'::float)
  • Infinity values: to_timestamp('Infinity'::float)
  • Negative timestamps: to_timestamp(-123.5)

These edge cases could expose undefined behavior in the float-to-i64 cast.

3. Floating Point Precision Loss

Location: datafusion/functions/src/datetime/to_timestamp.rs:453, 462, 469

The conversions (v as f64 * 1_000_000_000.0) as i64 can lose precision for:

  • Large timestamps (dates far in future/past)
  • Float32 → Float64 conversion before multiplication

Example:

// Float32 precision is ~7 decimal digits
let f32_val = 456.789f32;
let result = (f32_val as f64 * 1_000_000_000.0) as i64;
// May not precisely equal 456789000000

The test output shows this: 1970-01-01T00:07:36.789001464 instead of exact ...789000000

Recommendation: Add a comment documenting this known limitation, or consider if Float32 should use a different conversion strategy.

Performance: ✅ Excellent

This fix is a significant performance improvement:

Before:

  • Scalar floats were always expanded to 1-element arrays
  • Required allocation and copying even for constant values
  • Query optimizer couldn't leverage scalar simplification

After:

  • Scalars remain as scalars throughout execution
  • Zero-copy for constant propagation
  • Enables expression simplification during planning

Benchmark scenario:

SELECT to_timestamp(123.5, format_col) FROM large_table;
  • Before: Creates millions of 1-element arrays
  • After: Single scalar value reused for all rows

Security: ✅ No concerns

  • No unsafe code introduced
  • Float-to-i64 casts are safe (saturating behavior)
  • No untrusted input parsing changes
  • Timezone handling remains unchanged

Test Coverage: ✅ Good (with minor gaps)

Strengths:

  • Tests the actual bug: scalar float with varying format column
  • Covers all three float types
  • Validates correct timestamp output
  • Tests integrate with existing sqllogictest suite

Gaps:

  • No edge case coverage (NULL, NaN, Infinity, negative values)
  • No explicit performance/memory test (though this is typical for DataFusion)
  • No test for the error branches (acceptable if unreachable)

Additional Observations

Whitespace-only changes

Several test files have trailing newline changes:

  • arith_date_time.slt, arith_timestamp_duration.slt, etc.

These are minor cleanup changes and improve consistency, but are unrelated to the main fix.

Recommendations

  1. Consider adding edge case tests for NULL, NaN, Infinity, and negative timestamps
  2. Document floating point precision limitations with a code comment
  3. Review error branch reachability - consider using unreachable!() if they truly can't be hit
  4. Optional: Add an integration test that verifies scalar values remain as scalars (not converted to arrays)

Verdict: ✅ Approve with minor suggestions

This is a solid bug fix that:

  • Correctly implements the scalar broadcast pattern
  • Improves performance significantly
  • Has good test coverage for the main use case
  • Follows DataFusion coding conventions

The minor concerns are non-blocking and could be addressed in follow-up PRs if desired.


Reviewed by Claude Code

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with scalar broadcast for to_timestamp() when using float types. The changes correctly handle ColumnarValue::Scalar for Float16, Float32, and Float64 inputs, and new tests are added to cover these cases. The logic for these float types is quite similar, and I've suggested a refactoring to reduce code duplication. Additionally, several test files have had trailing newlines removed, which is a minor formatting improvement.

Comment on lines +433 to +482
Float16 => match &args[0] {
ColumnarValue::Scalar(ScalarValue::Float16(value)) => {
let timestamp_nanos =
value.map(|v| (v.to_f64() * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(
timestamp_nanos,
tz,
)))
}
ColumnarValue::Array(arr) => {
let f16_arr = downcast_arg!(arr, Float16Array);
let result: TimestampNanosecondArray =
f16_arr.unary(|x| (x.to_f64() * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Array(Arc::new(result.with_timezone_opt(tz))))
}
_ => exec_err!("Invalid Float16 value for to_timestamp"),
},
Float32 => match &args[0] {
ColumnarValue::Scalar(ScalarValue::Float32(value)) => {
let timestamp_nanos =
value.map(|v| (v as f64 * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(
timestamp_nanos,
tz,
)))
}
ColumnarValue::Array(arr) => {
let f32_arr = downcast_arg!(arr, Float32Array);
let result: TimestampNanosecondArray =
f32_arr.unary(|x| (x as f64 * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Array(Arc::new(result.with_timezone_opt(tz))))
}
_ => exec_err!("Invalid Float32 value for to_timestamp"),
},
Float64 => match &args[0] {
ColumnarValue::Scalar(ScalarValue::Float64(value)) => {
let timestamp_nanos = value.map(|v| (v * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(
timestamp_nanos,
tz,
)))
}
ColumnarValue::Array(arr) => {
let f64_arr = downcast_arg!(arr, Float64Array);
let result: TimestampNanosecondArray =
f64_arr.unary(|x| (x * 1_000_000_000.0) as i64);
Ok(ColumnarValue::Array(Arc::new(result.with_timezone_opt(tz))))
}
_ => exec_err!("Invalid Float64 value for to_timestamp"),
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling Float16, Float32, and Float64 is very similar and contains a lot of duplicated code. This can be refactored to improve maintainability. I suggest unifying the logic by casting all float types to Float64 first.

            Float16 | Float32 | Float64 => {
                // Cast to Float64 to unify handling
                let arg = if args[0].data_type() != &DataType::Float64 {
                    args[0].cast_to(&DataType::Float64, None)?
                } else {
                    args[0].clone()
                };

                match &arg {
                    ColumnarValue::Scalar(ScalarValue::Float64(value)) => {
                        let timestamp_nanos = value.map(|v| (v * 1_000_000_000.0) as i64);
                        Ok(ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(
                            timestamp_nanos,
                            tz,
                        )))
                    }
                    ColumnarValue::Array(arr) => {
                        let f64_arr = downcast_arg!(arr, Float64Array);
                        let result: TimestampNanosecondArray =
                            f64_arr.unary(|x| (x * 1_000_000_000.0) as i64);
                        Ok(ColumnarValue::Array(Arc::new(result.with_timezone_opt(tz))))
                    }
                    _ => exec_err!("Invalid float value for to_timestamp"),
                }
            },

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The suggestion by the Gemini AI reviewer is not optimal. The casting might be cheap to do but it is not zero-cost. It would be better to use a macros instead and pass the types as arguments.

@augmentcode
Copy link

augmentcode bot commented Feb 9, 2026

🤖 Augment PR Summary

Summary: Fixes to_timestamp() behavior when the input is a scalar float but the expression is evaluated over multi-row batches.

Changes:

  • Handle Float16/Float32/Float64 scalar inputs by returning a scalar Timestamp(Nanosecond) (instead of forcing a length-1 array), enabling correct scalar broadcasting.
  • Preserve existing array-path behavior for float arrays (vectorized conversion to nanoseconds).
  • Add sqllogictest coverage for scalar float constants/casts used alongside a column argument.
  • Normalize/newline-only adjustments in several .slt files.

Technical Notes: The updated float handling avoids constructing a 1-row array for scalars, allowing DataFusion’s normal scalar-to-batch expansion to produce correctly-sized results.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants