Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicates on to_timestamp do not work as expected with "naive" timestamp strings #765

Closed
Tracked by #3148
alamb opened this issue Jul 21, 2021 · 7 comments · Fixed by apache/arrow-rs#2814
Closed
Tracked by #3148
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jul 21, 2021

Describe the bug
Given a TimestampNanosecondArray which pretty-prints as follows:

    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+

Queries involving a predicate such as time < to_timestamp('2021-07-20 23:29:30') do not filter any rows (even though they should filter the row with 2021-07-20 23:30:30)

To Reproduce

async fn datafusion_reproducer() {
    let array = TimestampNanosecondArray::from(vec![1626823730000000000, 1626823830000000000]);
    let array: ArrayRef = Arc::new(array);
    println!("array[0]: {:?}", array_value_to_string(&array, 0).unwrap());
    println!("array[1]: {:?}", array_value_to_string(&array, 1).unwrap());

    let batch = RecordBatch::try_from_iter(vec![("time", array)]).unwrap();
    let table = MemTable::try_new(batch.schema(), vec![vec![batch]]).unwrap();
    let table = Arc::new(table);

    // select * from t
    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+
    run_query(table.clone(), "select * from t").await;

    // Using the following predicate should result in a single row,
    // but instead results in both
    //
    // select * from t where time < to_timestamp('2021-07-20 23:29:30')
    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+
    run_query(table.clone(), "select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;



    // explain select * from t where time < to_timestamp('2021-07-20 23:29:30')
    // +---------------+---------------------------------------------------------------+
    // | plan_type     | plan                                                          |
    // +---------------+---------------------------------------------------------------+
    // | logical_plan  | Projection: #t.time                                           |
    // |               |   Filter: #t.time Lt TimestampNanosecond(1626838170000000000) |
    // |               |     TableScan: t projection=Some([0])                         |
    // | physical_plan | ProjectionExec: expr=[time@0 as time]                         |
    // |               |   CoalesceBatchesExec: target_batch_size=4096                 |
    // |               |     FilterExec: time@0 < 1626838170000000000                  |
    // |               |       RepartitionExec: partitioning=RoundRobinBatch(16)       |
    // |               |         MemoryExec: partitions=1, partition_sizes=[1]         |
    // +---------------+---------------------------------------------------------------+

    run_query(table.clone(), "explain select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;
}

#[allow(dead_code)]
async fn run_query(csvdata: Arc<dyn TableProvider>, query: &str)  {


    let mut ctx = ExecutionContext::new();
    ctx.register_table("t", csvdata).unwrap();

    let results = ctx.sql(query)
        .unwrap()
        .collect()
        .await
        .unwrap();

    let pretty = pretty_format_batches(&results).unwrap();
    println!("{}\n{}", query, pretty);
}

Expected behavior
The query should produce a single row with timestamp 2021-07-20 23:28:50

However the actual query returns both rows

Additional context
We saw this in IOx: https://github.com/influxdata/influxdb_iox/issues/2071

@alamb alamb added the bug Something isn't working label Jul 21, 2021
@alamb alamb self-assigned this Jul 21, 2021
@alamb
Copy link
Contributor Author

alamb commented Jul 21, 2021

FYI @mcassels @lvheyang and @velvia

@alamb alamb changed the title Predicates on to_timestamp do not work as expected Predicates on to_timestamp do not work as expected with "naive" timestamp strings Jul 21, 2021
@alamb
Copy link
Contributor Author

alamb commented Jul 21, 2021

The core of the problem is that TimestampNanosecondArray is defined to have type Timestamp(Nanoseconds, None) -- the second argument of None means the following according to the arrow spec reference to schema.fbs:

  /// * If the time zone is null or an empty string, the data is a local date-time
  ///   and does not represent a single moment in time.  Instead it represents a wall clock
  ///   time and care should be taken to avoid interpreting it semantically as an instant.

So that certainly suggests we should not be applying any normalization to timestamps if there is no specific timezone set; Instead, we should return the raw "naive" timestamp (which corresponds to the arrow semantics for Timestamp(_, None) I think)

Now this leaves open the question of "what do we do if the timestamp has an explicit timezone in it"? For example, 2021-07-20 23:28:50-05:00

If the desired output timezone is UTC then it makes sense to convert this to UTC 👍 ; However if the desired output timezone is "None" then what?

I feel this is very similar to the question that @velvia was getting at in #686

I will continue the conversation there.

@alamb
Copy link
Contributor Author

alamb commented Jul 21, 2021

#686 (comment) is the proposal for handling timezones more properly

@waitingkuo
Copy link
Contributor

waitingkuo commented Sep 30, 2022

@alamb i think to_timestamp automatically append your local time zone if there's no timezone in your time string, and convert it back to utc (note that the underline value doesn't change), and then drop the timezone

i'm in utc+8, here's what i have

select to_timestamp('2021-07-20 23:29:30');
+------------------------------------------+
| totimestamp(Utf8("2021-07-20 23:29:30")) |
+------------------------------------------+
| 2021-07-20 15:29:30                      |
+------------------------------------------+
1 row in set. Query took 0.001 seconds.

I assume that you are in negative time zone so that the result timestamp is larger

@waitingkuo
Copy link
Contributor

waitingkuo commented Sep 30, 2022

this is where the offset comes from
https://github.com/apache/arrow-rs/blob/e2bf158946e5d81912bc9166d87b86f0ad442afb/arrow/src/compute/kernels/cast_utils.rs#L135-L155

[   /// Interprets a naive_datetime (with no explicit timezone offset)
    /// using the local timezone and returns the timestamp in UTC (0
    /// offset)
    fn naive_datetime_to_timestamp(naive_datetime: &NaiveDateTime) -> i64 {
        // Note: Use chrono APIs that are different than
        // naive_datetime_to_timestamp to compute the utc offset to
        // try and double check the logic
        let utc_offset_secs = match Local.offset_from_local_datetime(naive_datetime) {
            LocalResult::Single(local_offset) => {
                local_offset.fix().local_minus_utc() as i64
            }
            _ => panic!("Unexpected failure converting to local datetime"),
        };
        let utc_offset_nanos = utc_offset_secs * 1_000_000_000;
        naive_datetime.timestamp_nanos() - utc_offset_nanos
    }](https://github.com/apache/arrow-rs/blob/e2bf158946e5d81912bc9166d87b86f0ad442afb/arrow/src/compute/kernels/cast_utils.rs#L135-L155)

@waitingkuo
Copy link
Contributor

@alamb would you mind helping check whether this is the root cause?

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2022

@waitingkuo -- it certainly sounds plausible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants