-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for native parsing of iso8601 dates/timestamps in fread #4464
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4464 +/- ##
========================================
Coverage 99.61% 99.61%
========================================
Files 73 73
Lines 14119 14228 +109
========================================
+ Hits 14064 14173 +109
Misses 55 55
Continue to review full report at Codecov.
|
this is very related to #1656, the inverse of it? |
You're right this doesn't close #1656 per se, it's just very related. #1656 wants to This PR offers a different way of producing Would need to think a bit more about how to implement both at the same time... |
The ISO part refers to the string formatting: https://en.wikipedia.org/wiki/ISO_8601 IINM the relation to Same in intent to Python's |
inst/tests/tests.Rraw
Outdated
@@ -10771,9 +10773,9 @@ test(1743.241, fread("a,b,c\n2,2,f", colClasses = list(character="c", integer="b | |||
test(1743.242, fread("a,b,c\n2,2,f", colClasses = c("integer", "integer", "factor"), drop="a"), data.table(b=2L, c=factor("f"))) | |||
|
|||
## POSIXct | |||
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character")), data.table(a=as.POSIXct("2015-06-01 11:00:00"),b=1L,c="ae")) | |||
test(1743.25, fread("a,b,c\n2015-06-01 11:00:00,1,ae", colClasses=c("POSIXct","integer","character")), data.table(a=as.POSIXct("2015-06-01 11:00:00", tz='UTC'),b=1L,c="ae")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
three more tests broken because I forced tz='UTC'
in the implementation.
this is making me reconsider that -- should we default to not setting 'tzone'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this going to be a breaking change? maybe we could relax that, and keep old behavior for coming release and change it in next release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i agree. not even 100% sure we should change the behavior -- if so we need to be very explicit that it's a break with base
Not really sure what to do for testing the |
Here's a benchmark:
|
So in summary we blow For
|
Just realized
So "out of the box" (4 cores on my machine) this branch is about 10x faster. Single-threaded is still 3-5x faster. |
@statquant suggests we read timestamp columns as This could in principle be coupled with the |
Any POSIXct fields in csv should not be automatically turned into nanotime, but if a time field has higher precision than POSIXct offers, then make sense to automatically turn it into nanotime. The same way we do for numeric/integer64. And using existing argument to control that behavior looks fine for me. |
It's the opposite right? For other fields, we try int -> int64 -> float;
i.e., we can represent nanoseconds exactly in that range -- if we encounter timestamps outside that range, then we bump to double. |
What I was suggesting was that columns with precision up to millisecond would be converted to POSIXct and sub-milli to nanotime. Maybe an option could specify the behavior, {"auto", "POSIX", "nanotime"} where last 2 would just convert to POSIXct or nanotime |
Thanks, @eddelbuettel. Yes that's the way I was leaning already and why I had added to the top comment: "Or, extend the parser to interpret the datetime in local timezone, so that there is no break from previous versions of data.table, nor base R defaults. Sadly, I know that's hard. fasttime doesn't do local time for example, which may be why it was never promoted to R." |
Regarding this that I put in top comment :
It seems to be test 2124 and it's down to a material difference between TZ environment variable being unset vs empty, at least on my Linux box. Test 2124 does this : oldtz=Sys.getenv('TZ')
Sys.setenv(TZ='Asia/Jakarta') # UTC+7
...
Sys.setenv(TZ=oldtz) and observe the following : $ R --vanilla
Sys.getenv('TZ',unset=NA)
# [1] NA
Sys.getenv('TZ')
# [1] ""
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 MDT"
Sys.setenv(TZ="")
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 UTC"
Sys.unsetenv("TZ")
as.POSIXct("2010-05-15 01:02:03")
# [1] "2010-05-15 01:02:03 MDT" So if TZ was unset before that test, TZ will be set to |
This gets to a broader point that I also got stuck at when looking into #1162 ( As it relates to this issue, if we see In any case I don't think it's something that we should try to address for now. |
I don't know either but I have e.g. been wrapping Boost date_time for a decade or so (first in |
More directly for the issue at hand, I am a bit of two minds about time zones. I recognize If the system time zone is used, co-authors in different locales can conceivably get different downstream results on their respective machines. Hopefully as innocuous as a shifter but I'm not sure how badly that could spiral out in a complicated analysis (what if the analysis period goes over daylight savings time in one author's locale, but not another's? what about one locale where there was a permanent jump change in the time zone definition?). Obviously it's ideal if everyone is fastidious about setting time zones if their doing time series stuff, but certainly R-core agrees this type of thing is a big enough threat for it to be a major reason to change Lastly, for ISO8601 timestamps like That's my two cents on why I went with UTC as a default. I agree with the need to anticipate more potentially breaking code, especially as base R doesn't (currently?) accept |
That's a bummer. When I started this PR I thought it would be out of scope and wanted to stick with Zulu-time ( The next can of worms -- |
This almost never occurs. We'll never support a string like Asia/Bahrain in the data like that. UTC offsets is all we need, as you've implemented already. The real issue is blank timezone; i.e. 2020-01-01 01:02:03. That is interpreted as local timezone by R by default, and as Dirk referred to too, that's what users expect and we need to maintain. This is why I'm nearly done with wrapping this up so it can be merged for this release. |
I think I missed something so just noting it here once I caught up. Original proposal was UTC-dominant. Assumed any timestamp-alike on input was UTC time, so e.g.
Is read as a time written in UTC, and parsed as UTC. But maybe it is written in local time, then how should we parse? And in general if we don't know the timezone, it's hard to adjust ex-post anyway. Because of daylight savings, timezone jumps, etc, the UTC offset vs. "locale time" can be discontinuous. Instead the user has to rely on I'm not even sure a I do think it's common to find CSVs without timezone information there directly (e.g. this benchmark of |
…IXct afterwards to retain local time as before; all tests pass
Read the NYC data with fread( , tz="UTC") and set TZ env variable to "UTC". Done. Work away in UTC but it will feel like whatever timezone the data was written. |
In that case, shouldn't we plan to set |
No, because of backwards compatibility with past data.table versions (colClasses=POSIXct using as.POSIXct), and expectations over matching base R default. The R local time default is ok for people working in one time zone, saving their own data, and loading it up again themselves in their timezone. But for everyone writing and maintaining production systems, or working across timezones, we set the TZ environment variable and work in UTC. What we could do is default tz= to 'UTC" when the timezone of the R session is UTC. In that case local time is UTC so reading the unmarked datetimes as UTC is the same as reading as local. As this PR stands now, as.POSIXct would still be used when TZ is UTC, so that'd be nice to redirect to the direct parser. |
Ok all tests and coverage pass. I'm done and ready to merge. Ok by you, @MichaelChirico? |
OK by me. BTW I would use |
What about |
I see why |
Very common use case for me --
fread
string timestamps in ISO8601 format like2020-05-01T03:14:18.343Z
.To do so, treat the parsing as a series of subparsing jobs --
StrtoI32
for year, month, day, hour, minute, timezone hour offset, timezone minute offset; thenparse_double_regular
for seconds. Hence splitting off the "core" of those functions as_core
functions, and leaving the original as wrappers to feed theFieldParseContext
correctly.Somewhat related in intent to #1656 (smoothing/speeding up date/time I/O), but that issue would make round trips even faster (no conversion to character on output, and "simple" numeric parsing on input)
TODO:
+07:00
+0630
class
'd object, not just underlying valuecolClasses
API to support type overrides from userfread("a,b\n2015-01-01,2015-01-01", colClasses="POSIXct")
reads as character because direct POSIXct requires time to be present (use verbose=TRUE to see the bumps from type A to B to C). Relax direct POSIXct parser to allow date-only.colClasses="POSIXct"
oras.POSIXct()
after the fread call, datetime values are now interpreted as UTC, not local timezone asas.POSIXct
does by default. The news item gives focus to the tzone attribute of the resultant POSIXct column, but for those usages, there will be a silent shift in interpretation; i.e. they might be pleased after upgrade that they had POSIXct before and they still have POSIXct (just much faster) with no code changes, but the datetimes will have shifted. For those usingfasttime
after theirfread
calls, there will be no shift. In short, we need to convey the interpretation of datetime values will change for those using colClasses="POSIXct". Perhaps a warning for calls to fread which havecolClasses=POSIXct
, or a way to create a warning so those calls can be controlled. Or, extend the parser to interpret the datetime in local timezone, so that there is no break from previous versions of data.table, nor base R defaults. Sadly, I know that's hard.fasttime
doesn't do local time for example, which may be why it was never promoted to R. Update: now reads UTC-marked datetime as written by fwrite by default, and unmarked datetime still get read by as.POSIXct in local time.