Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining tables based on time with different time-zones - no warning #4773

Closed
PetoLau opened this issue Oct 21, 2020 · 6 comments
Closed

Joining tables based on time with different time-zones - no warning #4773

PetoLau opened this issue Oct 21, 2020 · 6 comments
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins

Comments

@PetoLau
Copy link

PetoLau commented Oct 21, 2020

Hello,

I caught myself at joining two data.tables based on timestamp, but with different time-zones (unfortunately).

The joined values were automatically shifted based on first data.table's time-zone...but without warning - or anything else.
Is it possible to throw warning if something like that is happening?

Example:

library(data.table)
library(lubridate)

dt_a <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"),
                                     to = ymd_hms("2020-10-20 10:00:00"),
                                     by = "1 hour"),
                   value = 1:5)

dt_b <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"),
                                     to = ymd_hms("2020-10-20 10:00:00"),
                                     by = "1 hour"),
                   value = 1:5)

dt_b[, time := lubridate::force_tz(time, tzone = "CET", roll = F)]

dt_a[dt_b,
     on = .(time),
     value_b := i.value]

Result:

dt_a
time value value_b
1: 2020-10-20 06:00:00 1 3
2: 2020-10-20 07:00:00 2 4
3: 2020-10-20 08:00:00 3 5
4: 2020-10-20 09:00:00 4 NA
5: 2020-10-20 10:00:00 5 NA

Thanks

@jangorecki
Copy link
Member

Thank you for reporting. It could be useful to know how data.frame behaves on that matter as well.

@jangorecki jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Oct 21, 2020
@MichaelChirico
Copy link
Member

FWIW, do recall that timezones are somewhat "artificial" in the sense that they are "only" a printing attribute.

Once the class of the object is POSIXct, data from all timezones is represented as "UTC epoch seconds" (seconds since Jan 1 1970 UTC). The timezone is just used to create the string representation.

So IINM the join will be done correctly (in that observations at identical points in time [i.e., identical displacements from epoch time] will be matched even if they occurred in different time zones).

Coercion / ignoring attributes is a bit more of a general issue for data.table, see #4541, #4415, #3911, #3087... probably a few others

@eddelbuettel
Copy link
Contributor

The trouble here is that the lubridate function actually alters the values. Interspersing OP's code with some 'naked' printing:

 R> dt_b <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"), to = ymd_hms("2020-10-20 10:00:00"), by = "1 hour"), value = 1:5)                                                                  
 R> dt_b[,time]                                                                                                                                                                                                    
 [1] "2020-10-20 06:00:00 UTC" "2020-10-20 07:00:00 UTC" "2020-10-20 08:00:00 UTC" "2020-10-20 09:00:00 UTC" "2020-10-20 10:00:00 UTC"                                                                             
 R> as.numeric(dt_b[, time])                                                                                                                                                                                       
 [1] 1603173600 1603177200 1603180800 1603184400 1603188000                                                                                                                                                        
 R> dt_b[, time := lubridate::force_tz(time, tzone = "CET", roll = F)]                                                                                                                                             
 R> dt_b[,time]                                                                                                                                                                                                    
 [1] "2020-10-20 06:00:00 CEST" "2020-10-20 07:00:00 CEST" "2020-10-20 08:00:00 CEST" "2020-10-20 09:00:00 CEST" "2020-10-20 10:00:00 CEST"                                                                        
 R> as.numeric(dt_b[, time])                                                                                                                                                                                       
 [1] 1603166400 1603170000 1603173600 1603177200 1603180800                                                                                                                                                        
!R> dt_a[dt_b, on = .(time), value_b := i.value]                                                                                                                                                                   
 R> 

So here the actual numeric holding the value was shifted. On purpose. Should that really trigger a warning?

Computing with dates and times is tricky. Doubly so in the presence of timezones. Most 'professional' work I have seen in many years settles on one TZ, often UTC or whereever the head office is.

@PetoLau
Copy link
Author

PetoLau commented Oct 22, 2020

@jangorecki the data.frame's way result is same as data.table's...so this is not really data.table issue...sorry

Interesting problem here, when time is used as string and when as numeric, and nobody knows what is used when...for example during extraction of hour attribute the string is used:

> lubridate::hour(dt_a$time)
[1]  6  7  8  9 10
> lubridate::hour(dt_b$time)
[1]  6  7  8  9 10

@shrektan
Copy link
Member

I can't understand this issue, frankly speaking. As @eddelbuettel commented, lubridate::force_tz() will alter the value, and your example looks expected for me. A more appropriated example, in my opinion, should use lubridate::with_tz():

library(data.table)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:data.table':
#> 
#>     hour, isoweek, mday, minute, month, quarter, second, wday, week,
#>     yday, year
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

dt_a <- data.table(
  time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00", tz = "UTC"),
                    to = ymd_hms("2020-10-20 10:00:00", tz = "UTC"),
                    by = "1 hour"),
  value = 1:5
)
dt_a
#>                   time value
#> 1: 2020-10-20 06:00:00     1
#> 2: 2020-10-20 07:00:00     2
#> 3: 2020-10-20 08:00:00     3
#> 4: 2020-10-20 09:00:00     4
#> 5: 2020-10-20 10:00:00     5

dt_b <- copy(dt_a)
dt_b[, time := lubridate::with_tz(time, tzone = "Asia/Shanghai")]
dt_b
#>                   time value
#> 1: 2020-10-20 14:00:00     1
#> 2: 2020-10-20 15:00:00     2
#> 3: 2020-10-20 16:00:00     3
#> 4: 2020-10-20 17:00:00     4
#> 5: 2020-10-20 18:00:00     5

dt_a[dt_b,
     on = .(time),
     value_b := i.value]
dt_a
#>                   time value value_b
#> 1: 2020-10-20 06:00:00     1       1
#> 2: 2020-10-20 07:00:00     2       2
#> 3: 2020-10-20 08:00:00     3       3
#> 4: 2020-10-20 09:00:00     4       4
#> 5: 2020-10-20 10:00:00     5       5

Created on 2020-10-24 by the reprex package (v0.3.0)

@jangorecki
Copy link
Member

I think we can close this issue. We should not expect posixct to merge to shifted posixct. They should be aligned before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

No branches or pull requests

5 participants