Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dateFormat argument to fread #5006

Open
Julien-Lopez opened this issue May 17, 2021 · 6 comments
Open

Add a dateFormat argument to fread #5006

Julien-Lopez opened this issue May 17, 2021 · 6 comments
Labels

Comments

@Julien-Lopez
Copy link

Julien-Lopez commented May 17, 2021

As described in this Stack Overflow thread, it would be nice to be able to specify the format of dates types that are automatically detected by fread.

data.table::fread("date\n10-10-2020", dateFormat="%d-%m-%y")

Since both as.Date and as.POSIXct seem to support formatting, I assume it wouldn't be difficult to implement with one format. Maybe this could be extended to a vector of formats? Also, the empty vector could be used to disable the detection entirely thus falling back to the old behavior if needed by the user.

@MichaelChirico
Copy link
Member

MichaelChirico commented May 23, 2021

caveat: I think there is a typo in your example, the example doesn't look like it's in %d-%m-%y format... I'm assuming anyway that it's right in what I write below

fread() does not and I think will never support native detection of/customization to non-default date formats during import in C. It's just too much of a maintenance burden IMO, with no clearly defined scope (why cater to this format, but not this one? scope creep abounds). In cases where it really is a performance burden (e.g. 10M / 100M + rows), please try and correct the data upstream to be produced in a standard format.

So that leaves supporting a way to provide formats in the fread signature that would be applied at the R level. The first thing that comes to mind would be to extend colClasses, e.g.

data.table::fread("date\n2020-10-10", colClasses = list(date = function(x) as.POSIXct(x, "%d-%m-%y")))

Or perhaps a new argument with dateFormat:

data.table::fread("date\n2020-10-10", colClasses = list(Date = "date"), dateFormat = "%d-%m-%y")

and as.Date() is called as as.Date(x, dateFormat) internally.

This is not bad, but to we also need a timeFormat? or is the argument like formatting_args = c(Date = "%d-%m-%y", POSIXct = "%d-%m-%y %T")?

Ultimately, given that there's no performance benefit, I'm a bit loath to implement this myself. The following looks OK to me:

DT = data.table::fread("date\n2020-10-10")
DT[ , date := as.Date(as.character(date), "%d-%m-%y")]

or

DT = data.table::fread("date\n2020-10-10", colClasses="character")
DT[ , date := as.Date(date, "%d-%m-%y")]

And lapply() versions for handling many columns are ~ the same.

@MichaelChirico
Copy link
Member

MichaelChirico commented May 26, 2021

I wonder if it's possible to just have ... as an fread argument, and send ... to whatever as method is being used by colClasses... that could probably buy us a decent amount of flexibility without being too specific.

@jangorecki
Copy link
Member

Related issue #1656

@Julien-Lopez
Copy link
Author

Apologies for the typo in the example. I fixed it.

Thank you for the answers! Looking at the source code of fread, I was hoping that indeed the transfer of the format argument to as.Date and as.POSIXct would be an easy enough change.

If not, do you think the option datatable.old.fread.datetime.character could be made more permanent at least? Converting a column in a format to date in post-processing is simple enough, but it would be nice not to have to revert auto-detected dates columns back to character.

@MichaelChirico
Copy link
Member

have you tried colClasses='character' for those columns? should accomplish the same

@MPagel
Copy link

MPagel commented Jul 15, 2022

disclaimer:

  • Not as fast as using IDate or C-compiled code for "ISO8601" or other already-recognized datetime format
  • Uses lubridate package, so if looking for a pure data.table solution, you would have to modify the appropriate format references and function call(s) to use POSIX conventions.

potential usage:

# set options for fread
# Alternate date formats HAVE to be set prior to the fread call.
# no dateFormat parameter is possible in the fread call itself.
options(datatable.fread.dateFormat=c("dmY T"))

# read in file using `MorphDate` class (defined below)
fread("myFile.csv",
    na.strings=c("NA","N/A","Not Recorded","NR","NULL"),
    colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp","date_with_timestamp")))

Code

library('lubridate')
library('data.table')

### Set up options for data.table, including an "invented" one for the new class we will define ###
options(datatable.fread.dateFormat=c("dmY T"),
        datatable.na.strings=c("N/A","NA","Not Recorded"))

### Create a custom Class with `as` creation function ###
setClass("MorphDate", prototype=NA_integer_)
as.MorphDate.character <- function(from) {
  # from[which(from=="Not Recorded")] <- NA # handle with na.strings option of fread
  DateObj <- lubridate::parse_date_time(from, orders=getOption("datatable.fread.dateFormat","Ymd T"), truncated=1)
  doi <- as.integer(DateObj)
  if (any((doi %% 86400L) > 0L, na.rm = TRUE)) {
    attr(DateObj,"class") <- c("MorphDate","POSIXct","POSIXt")     # could use IDateTime class, but that class uses a two column data.table of IDate and ITime and would require an explicit call to a conversion function.
  } else if (all(doi < 86400L,  na.rm = TRUE)) {
    DateObj <- as.ITime(doi)                                       # I don't know if the call to ITime is strictly required.  setting the class may be enough.
    attr(DateObj,"class") <- c("MorphDate","ITime")
  } else {
    DateObj <- as.Date(doi %/% 86400L, origin='1970-01-01')  # or as.IDate if that is available.
    attr(DateObj,"class") <- c("MorphDate","IDate","Date")
  }
  DateObj
}

### Define conversion functions to/from character string
MorphDate <- as.MorphDate.character
as.character.MorphDate <- function(from) format(from)
setAs("character","MorphDate",as.MorphDate.character)
setAs("MorphDate","character",as.character.MorphDate)
print.MorphDate <- function(x, ...) print(format(x, ...))
setOldClass("MorphDate",S4Class="MorphDate")
### END Class Def ###

### read in file via fread, using new "MorphDate" class
fread("myFile.csv",
    colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp_column","date_with_timestamp_column")))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants