Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug / misfeature in POSIXct creation #1619

Closed
eddelbuettel opened this issue Apr 1, 2016 · 30 comments
Closed

Possible bug / misfeature in POSIXct creation #1619

eddelbuettel opened this issue Apr 1, 2016 · 30 comments

Comments

@eddelbuettel
Copy link
Contributor

I am using data.table to hold/slide/dice some data which I then combine with data help in xts objects. Those work pretty reliably in local time with their timezone.

Now when comparing some time intervals with what I get in data.table, I think I arrived at a "misfeature".
Here is simple example script:

Sys.setenv("TZ"="America/Chicago")
library(data.table)

dt <- data.table(date=as.IDate("2016-03-02") + (0:3)*7,   # four Wednesdays
                 time=as.ITime("09:30:00"))
print(dt)

dt[, pTime  := as.POSIXct(x=time, date=date)]
dt[, tzTime := as.POSIXct(x=time, date=date, tz="America/Chicago")]

dt[, goodTime := as.POSIXct(paste(date, time))]  # tz implicit from R session

print(dt)

## BUT
print(dt[4, pTime])                        # how can I have this be local time?
print(dt[4, tzTime])                       # how can I have this be correct?
##
print(dt[4, goodTime])                     # this is correct but created 'the wrong way'

When I run this, I get the following output:

         date     time
1: 2016-03-02 09:30:00
2: 2016-03-09 09:30:00
3: 2016-03-16 09:30:00
4: 2016-03-23 09:30:00
         date     time               pTime              tzTime            goodTime
1: 2016-03-02 09:30:00 2016-03-02 09:30:00 2016-03-02 09:30:00 2016-03-02 09:30:00
2: 2016-03-09 09:30:00 2016-03-09 09:30:00 2016-03-09 09:30:00 2016-03-09 09:30:00
3: 2016-03-16 09:30:00 2016-03-16 09:30:00 2016-03-16 10:30:00 2016-03-16 09:30:00
4: 2016-03-23 09:30:00 2016-03-23 09:30:00 2016-03-23 10:30:00 2016-03-23 09:30:00
[1] "2016-03-23 09:30:00 UTC"
[1] "2016-03-23 10:30:00 CDT"
[1] "2016-03-23 09:30:00 CDT"

The ptime column looks right, but when I compare to an xts object I am off by hours because ... silently UTC persisted. Not good.

The tzTime column is correct, but only half the time. After DST we are off an hour. Also not good.

The third option works but that is AFAIK not the you document it, and possibly the least performant as we need to reparse from string.

Am I being silly or is this something that needs a nudge and correction?

@arunsrinivasan
Copy link
Member

I'm able to reproduce this only on Linux (tested only on Ubuntu 14.04). On my Mac, and Windows 10 (run virtually), I get

print(dt[4, tzTime])
# [1] "2016-03-23 09:30:00 CDT"

The problem (in Linux) seems to come from this line in data.table:::as.POSIXct.ITime

as.POSIXct(as.POSIXlt(date), tz = tz) + x

Specifically,

as.POSIXct(as.POSIXlt(dt$date), tz = "America/Chicago")
# On Ubuntu
# [1] "2016-03-02 00:00:00 CST" "2016-03-09 00:00:00 CST"
# [3] "2016-03-16 01:00:00 CDT" "2016-03-23 01:00:00 CDT"

# On Mac
# [1] "2016-03-02 CST" "2016-03-09 CST" "2016-03-16 CDT" "2016-03-23 CDT"

This is likely the source of the issue. Not sure how to fix it though. Will have to understand why this difference now.

Let me know if I'm completely off point here. Haven't worked much with dates/times.

@eddelbuettel
Copy link
Contributor Author

Paging @joshuaulrich ...

It looks like we mess up a DST correction for no good reason. I can't recall when I last ran into that. Hm.

@eddelbuettel
Copy link
Contributor Author

That last example also happens in plain base R without IDate:

as.POSIXct(as.POSIXlt(as.Date("2016-03-02") + (0:3)*7), tz = "America/Chicago")

I think @joshuaulrich once concluded to never use dates as Date but always stick with POSIXct...

@MichaelChirico
Copy link
Member

Well, especially if time zones are important for the task at hand

@joshuaulrich
Copy link

@eddelbuettel It was the other way around. If you only have daily data, you should avoid POSIXct and just use Date. I've added checks to xts to ensure that Date index values always have a UTC timezone to avoid a lot of potential pain.

This looks like it may be a much deeper issue. src/main/datetime.c says that Windows and OS X use substitutes for many system date/time functions, while Linux would use the system functions and 64-bit time_t.

@eddelbuettel
Copy link
Contributor Author

My second example does indeed suggest that something may be rotten in the state of Denmark. Somehow I feel I encountered this before but I cannot think of a fix.

Then again I pretty religiously stuck with POSIXct for many years. So maybe

dt[, goodTime := as.POSIXct(paste(date, time))]

is as good as it gets here. Hmpf.

@eddelbuettel
Copy link
Contributor Author

I think I might have it. Looking more closely at help(POSIXlt) we see

‘isdst’ Daylight Saving Time flag.  Positive if in force, zero if
      not, negative if unknown.

and

Where possible the platform limits are detected, and outside the
limits we use our own C code. This uses the offset from GMT in
use either for 1902 (when there was no DST) or that predicted for
one of 2030 to 2037 (chosen so that the likely DST transition days
are Sundays), and uses the alternate (daylight-saving) time zone
only if ‘isdst’ is positive or (if ‘-1’) if DST was predicted to
be in operation in the 2030s on that day.

I am still not quite sure I understand exactly what they try to say here but it seems that we want flag value different from zero for isdst.

And indeed:

R> plt <- as.POSIXlt(as.Date("2016-03-02") + (0:3)*7, TZ = "America/Chicago")
R> as.POSIXct(plt, tz = "America/Chicago")       ## as before
[1] "2016-03-02 00:00:00 CST" "2016-03-09 00:00:00 CST" "2016-03-16 01:00:00 CDT" "2016-03-23 01:00:00 CDT"
R> 
R> plt[3]$isdst <- 1                                              ## now for one of the two 'wrong ones' set the flag 
R> as.POSIXct(plt, tz = "America/Chicago")
[1] "2016-03-02 00:00:00 CST" "2016-03-09 00:00:00 CST" "2016-03-16 00:00:00 CDT" "2016-03-23 01:00:00 CDT"
R> 

Now we get midnight local rather than the DST change.

@eddelbuettel
Copy link
Contributor Author

So when I set this, things work out:

as.POSIXct.ITime <- function (x, tz = "UTC", date = as.Date(Sys.time()), ...) {
    if (missing(date) && any(class(tz) %in% c("Date", "IDate", "POSIXt", "dates"))) {
        date <- tz
        tz <- "UTC"
    }
    d <- as.POSIXlt(date)
    d$isdst <- 1                        # maybe on Linux only?
    as.POSIXct(d, tz = tz) + x
}

New code:

Sys.setenv("TZ"="America/Chicago")
library(data.table)

dt <- data.table(date=as.IDate("2016-03-02") + (0:3)*7,   # four Wednesdays
                 time=as.ITime("09:30:00"))
dt[, pTime  := as.POSIXct(x=time, date=date)]
dt[, tzTime := as.POSIXct(x=time, date=date, tz="America/Chicago")]

dt[, goodTime := as.POSIXct(paste(date, time))]  # tz implicit from R session

as.POSIXct.ITime <- function (x, tz = "UTC", date = as.Date(Sys.time()), ...) {
    if (missing(date) && any(class(tz) %in% c("Date", "IDate", "POSIXt", "dates"))) {
        date <- tz
        tz <- "UTC"
    }
    d <- as.POSIXlt(date)
    d$isdst <- 1                        # maybe on Linux only?
    as.POSIXct(d, tz = tz) + x
}

## data.table conversion, now with dst correction
dt[, newGoodTime := as.POSIXct(x=time, date=date, tz="America/Chicago")] 

print(dt)
## BUT
print(dt[4, pTime])                     # how can I have this be local time?
print(dt[4, tzTime])                    # how can I have this be correct?
##
print(dt[4, goodTime]) 
print(dt[4, newGoodTime])               # better!

which produces

edd@max:~/src/progs/R(master)$ Rscript datatableBug.R
         date     time               pTime              tzTime            goodTime         newGoodTime
1: 2016-03-02 09:30:00 2016-03-02 09:30:00 2016-03-02 09:30:00 2016-03-02 09:30:00 2016-03-02 09:30:00
2: 2016-03-09 09:30:00 2016-03-09 09:30:00 2016-03-09 09:30:00 2016-03-09 09:30:00 2016-03-09 09:30:00
3: 2016-03-16 09:30:00 2016-03-16 09:30:00 2016-03-16 10:30:00 2016-03-16 09:30:00 2016-03-16 09:30:00
4: 2016-03-23 09:30:00 2016-03-23 09:30:00 2016-03-23 10:30:00 2016-03-23 09:30:00 2016-03-23 09:30:00
[1] "2016-03-23 09:30:00 UTC"
[1] "2016-03-23 10:30:00 CDT"
[1] "2016-03-23 09:30:00 CDT"
[1] "2016-03-23 09:30:00 CDT"
edd@max:~/src/progs/R(master)$ 

@arunsrinivasan
Copy link
Member

@eddelbuettel great! Could try this patch for Linux. Thanks! I don't follow the explanation in ?POSIXlt either.

@eddelbuettel
Copy link
Contributor Author

I settled for this now:

toPOSIXct <- function(date, time, tz=Sys.getenv("TZ", unset="America/Chicago")) {
    lt <- as.POSIXlt(date)
    lt$isdst <- -1                        # maybe on Linux only?
    as.POSIXct(lt, tz=tz) + time
}

Can you play on Windows and OS X (where I have access to neither) to see if setting the isdst flag hurts? I would also really like to petition you guys to drop enforced use of UTC and use a user-given one if given. Could be, say tz=getOption("data.table.TZ", default="UTC").

@arunsrinivasan
Copy link
Member

Already tried on OS X. isdst was 0 by default. But changing it to 1 had no effect. Will test on -1 as well. Thanks.
On dropping enforced use of UTC, I agree, and your suggested default makes sense. There's a lot of improvements to be done in ITime/IDate, tentatively planned for next release.

@joshuaulrich
Copy link

I wouldn't change the underlying elements of a POSIXlt object. There's nothing to prevent you from doing odd things. In this case, you're setting an "is daylight saving" flag on an object with a UTC timezone (which has no daylight saving). Plus you would be setting isdst = 1 for all date/times in all timezones. Seems very fragile to rely on how an (arguably) invalid datetime is converted.

@eddelbuettel
Copy link
Contributor Author

@arunsrinivasan Yes yes yes -- I mentioned to Matt that maybe we want complement IDate and ITime with (tentatively named) INano to have int-valued nanoseconds (which eg Boost Date_Time can deal if a compile-time flag is set). Then we could index on three columns, and reconstitute POSIXct --- lossy though, as it only resolves to about 1.1 microsecs.

@joshuaulrich Did you catch my post earlier in this thread demonstrating a bug in conversion that happens with base R at least on Linux. It is about avoiding that base R bug...

@joshuaulrich
Copy link

@eddelbuettel I was the one who said it was a deeper issue, either in base R or something in the system functions. My comment was that it seems fragile to avoid that bug by creating and converting an invalid datetime object. The safest thing to do is convert the date and time to string, concatenate, then call strptime, but there's probably a more efficient solution.

@eddelbuettel
Copy link
Contributor Author

@joshuaulrich Gosh no. Avoid the conversion to/from text at all times.

At the time of construction we have valid IDate and ITime objects. IDate transforms cleanly into Date. What we are learning here is that the asPOSIXlt.Date function may have a wart. I need to mull this over a little (and check with @arunsrinivasan about other platforms) but right now it seems like its a plain R bug. So we'll report it.

@eddelbuettel
Copy link
Contributor Author

I guess the other part is whether enforcing UTC a) matters and b) is a good idea. Because you could argue that converting UTC to local should be different at during summer/winter time.

So there may be a data.table issue here in insisting in UTC.

In short, I am really not sure what the specification would be. But the conversion from having the (IDate, ITime) pair that I could convert cleanly to a POSIXct in order to map into xts is clearly busted on Linux, and apparently only there.

@joshuaulrich
Copy link

as.POSIXlt.Date is fine. It takes a Date and returns a POSIXlt object with a UTC timezone, since dates don't have timezones, and that's why isdst is always zero. This is true on Linux, Windows, and I assume OS X.

The issue is the difference in how that object is coerced to POSIXct with a different timezone on different OSs. And the two OSs that do not use the system functions do what you want, and the OS that uses the system functions gives the undesired result.

@eddelbuettel
Copy link
Contributor Author

I am with on you that for the first part. As for the second part shouldn't as.POSIXlt.Date then set the flag to achieve the desired result? Or allow us to set the flag?

@eddelbuettel
Copy link
Contributor Author

Also from strptime:

‘strptime’ turns character representations into an object of class
 ‘"POSIXlt"’.  The time zone is used to set the ‘isdst’ component
 and to set the ‘"tzone"’ attribute if ‘tz != ""’. If the specified
 time is invalid (for example ‘"2010-02-30 08:00"’) all the
 components of the result are ‘NA’.  (NB: this does means exactly
 what it says - if it is an invalid time, not just a time that does
 not exist in some time zone.)

Maybe having UTC is different from having no TZ set...

@eddelbuettel
Copy link
Contributor Author

@arunsrinivasan What is your thinking whether this is an R bug for not setting isdst on Linux?

Shall we prepare something for R-devel highlighting inconsistent behaviour across OSs which cannot really be the desired goal?

@arunsrinivasan
Copy link
Member

@eddelbuettel yes (I believe it's a bug) and yes.

Here's the summary:

Sys.setenv("TZ"="America/Chicago")
dates = as.Date("2016-03-02") + (0:3)*7 # four Wednesdays

# on OS X and windows 10 -- expected result
as.POSIXct(as.POSIXlt(dates), tz = "America/Chicago")
# [1] "2016-03-02 CST" "2016-03-09 CST" "2016-03-16 CDT" "2016-03-23 CDT"

# On Linux (tested on Ubuntu 14.04) -- not as expected
as.POSIXct(as.POSIXlt(dates), tz = "America/Chicago")
# [1] "2016-03-02 00:00:00 CST" "2016-03-09 00:00:00 CST" "2016-03-16 01:00:00 CDT" "2016-03-23 01:00:00 CDT"

# isdst attribute is identical on OS X / Windows / Ubuntu
lt = as.POSIXlt(dates)
lt$isdst
# [1] 0 0 0 0

# However, setting isdst to -1 on Ubuntu returns expected results
lt$isdst = -1
as.POSIXct(lt, tz = "America/Chicago")
# [1] "2016-03-02 CST" "2016-03-09 CST" "2016-03-16 CDT" "2016-03-23 CDT"

# setting isdst to -1 on OS X / Windows has no effect, i.e., the result is always as expected.

@eddelbuettel
Copy link
Contributor Author

@arunsrinivasan 100% on board. Do you want to bring this to r-devel?

@arunsrinivasan
Copy link
Member

@eddelbuettel great. Yes, I'll post this summary on r-devel.

@arunsrinivasan
Copy link
Member

@crossxwill
Copy link

crossxwill commented Jul 30, 2019

I can confirm that this is indeed a bug in base R.

What day and time is it? Depends on the operating system!

init_date <- as.POSIXct("2017-11-01", tz="America/New_York")
init_date

init_date_lt <- as.POSIXlt(init_date)
init_date_lt$mon <- init_date_lt$mon + 1  ## add 1 month
init_date_lt

as.POSIXct(init_date_lt)  ## the answer depends on the OS

## [1] "2017-11-30 23:00:00 EST"   RHEL 7; R 3.3.3
## [1] "2017-12-01 EST"            WINDOWS 7; R 3.6.1

@MichaelChirico
Copy link
Member

thanks @crossxwill, could you please add this to the bugzilla report /linked above?

@crossxwill
Copy link

crossxwill commented Jul 30, 2019

I don't have a Bugzilla account. R-Core closed it down for n00bs like me.

@eddelbuettel
Copy link
Contributor Author

eddelbuettel commented Jul 30, 2019

You get one by asking on the mailing list. The step requires a human as the previous setup was outsmarted by spam bots -- see https://www.r-project.org/bugs.html

@jangorecki
Copy link
Member

jangorecki commented Apr 4, 2020

Assuming that fixing this bug in base R will not require any extra adjustment in data.table I believe we are safe to close this issue?
Bugzilla link above is best to track the status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants