Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using dplyr functions on grouped data frame with variable of class difftime generates error #390

Closed
Henrik-P opened this issue Apr 15, 2014 · 8 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@Henrik-P
Copy link

Using dplyr data manipulation functions on a grouped data frame which contains a variable of class difftime generates the error:

Error in eval(expr, envir, enclos) : 
  column 'the-name-of-the-difftime column' has unsupported type

I illustrate this using some toy data with a grouping variable (grp), a column with some values (val), two date columns (date1, date2), and a variable of class difftime (the difference between date1 and date2):

df <- data.frame(
  grp =   c(1, 1,  2, 2),
  val =   c(1, 3,  4, 6),
  date1 = c(rep(Sys.Date() - 10, 2), rep(Sys.Date() - 20, 2)),
  date2 = Sys.Date() + 1:2)

df$diffdate <- difftime(df$date2, df$date1, unit = "days")
df

I tried to add the mean of vals within each group to the original data set. The desired output can be created using ddply :

library(plyr)
df_dd <- ddply(.data = df, .variables = .(grp), mutate,
               mean_val = mean(val))

df_dd
#   grp val      date1      date2 diffdate mean_val
#1   1   1 2014-04-04 2014-04-15  11 days        2
#2   1   3 2014-04-04 2014-04-16  12 days        2
#3   2   4 2014-03-25 2014-04-15  21 days        5
#4   2   6 2014-03-25 2014-04-16  22 days        5

str(df_dd)
# ...
# $ diffdate:Class 'difftime'

When I try to create the same output with dplyr, an error is generated

detach("package:plyr", unload = TRUE)
library(dplyr)

df %.%
  group_by(grp) %.%
  mutate(
    mean_val = mean(val)
  )
# Error in eval(expr, envir, enclos) : 
#   column 'diffdate' has unsupported type

Just to check, the same error is generated when the difftime variable is itself subject to the calculation, e.g.

df %.%
  group_by(grp) %.%
  mutate(
    mean_diff = mean(diffdate)
    )

...or when using (toy examples of) summarise, filter, select or arrange:

df %.%
  group_by(grp) %.%
  summarise(
    mean_val = mean(val)
  )

df %.%
  group_by(grp) %.%
  filter(
    sum(val) > 5
  )

df %.%
  group_by(grp) %.%
  select(-val)

df %.%
  group_by(grp) %.%
  arrange(-val)

The difftime variable does not cause any problem when mutate is used on an ungrouped data frame:

df2 <- mutate(df, diffdate = difftime(date2, date1, unit = "days"))
df2
str(df2)

mutate(df2, mean_val = mean(val), mean_diff = mean(diffdate))

...or on an ungrouped 'tbl_df':

tbl <- tbl_df(df)
mutate(tbl, mean_val = mean(val), mean_diff = mean(diffdate))

Neither does the difftime variable cause any problem when various dplyr data manipulation functions are applied on a grouped data.table version of df:

library(data.table)
dt <- data.table(df)
dt2 <- dt %.%
  group_by(grp) %.%
  mutate(
    mean_val = mean(val)
  )
dt2
# Source: local data table [4 x 6]
# Groups: grp

#   grp val      date1      date2 diffdate mean_val
#1   1   1 2014-04-05 2014-04-16  11 days        2
#2   1   3 2014-04-05 2014-04-17  12 days        2
#3   2   4 2014-03-26 2014-04-16  21 days        5
#4   2   6 2014-03-26 2014-04-17  22 days        5

str(dt2)

dt %.%
  group_by(grp) %.%
  summarise(
    mean_val = mean(val)
  )

dt %.%
  group_by(grp) %.%
  filter(
    sum(val) > 5
  )

dt %.%
  group_by(grp) %.%
  select(-val)

dt %.%
  group_by(grp) %.%
  arrange(-val)

My current quick and dirty workaround is to convert the difftime variable to numeric:

df$diffdate <- as.numeric(difftime(df$date2, df$date1, unit = "days"))
df %.%
  group_by(grp) %.%
  mutate(
    mean_val = mean(val)
  )

However, there are quite a few methods for the difftime class (see Detail in ?difftime). Thus, it would be nice if dplyr could handle grouped data frames containing a variable of class difftime.

Search on SO and google for 'dplyr difftime "Error in eval(expr, envir, enclos)" : column has unsupported type' gave no hits.

Thanks a lot for your great work with a fantastic package.

Best regards,

Henrik

R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
data.table_1.9.2, dplyr_0.1.3, plyr_1.8.1
@romainfrancois
Copy link
Member

What we do is check for attributes:

> str( df$diffdate )
Class 'difftime'  atomic [1:4] 11 12 21 22
  ..- attr(*, "units")= chr "days"

We don't know how to handle the units attribute. To support this, we would have to special case what to do with the units attribute, the same way we treat the time zone attribute for POSIXct.

@barryrowlingson
Copy link

This just bit me too. Where's the relevant code for the current special case? Surely that could be made more generic based on the class of the column...

Oh, is it "all over the place" and in C++.

The really annoying thing is that dplyr complains even if you don't use that column in your chain.

My workaround is to use transform, eg:

transform(mdmh, ahead=as.numeric(ahead)) %.% group_by(etc) %.% etc

@hadley
Copy link
Member

hadley commented Jul 28, 2014

@romainfrancois can we add support for difftime objects too please?

@hadley hadley added the bug label Jul 28, 2014
@hadley hadley added this to the 0.3 milestone Jul 28, 2014
@romainfrancois
Copy link
Member

I've commited a series of fixes and some tests for handling of "difftime". Please @Henrik-P @barryrowlingson test again :)

@Robinlovelace
Copy link
Contributor

@barryrowlingson @romainfrancois FYI I'm still getting this error when trying to group the UK's National Travel Survey (NTS) by individual id.

What's strange is that when I tried to provide a reproducible example, the error disappeared.

If my df is ntstrips, per_person <- group_by(ntstrips, house, i1), it failed on 1st try:

> per_person <- group_by(ntstrips, house, i1)
Error: column 'dist' has unsupported type

Then do this: ntstrips <- ntstrips[1:nrow(ntstrips),] and it's fixed! Very strange behaviour.

@Henrik-P
Copy link
Author

@romainfrancois, Thanks a lot for your work. All the examples in my original post now runs smoothly using dplyr_0.3

@Robinlovelace
Copy link
Contributor

Apologies - just updated to 0.3 and seems to fix it.
Robin

@FabianRoger
Copy link

Hej,

I came across a maybe confusing behaviour with dplyr and difftime. Using the example given above:

df <- data.frame(
  grp =   c(1, 1,  2, 2),
  val =   c(1, 3,  4, 6),
  date1 = c(rep(Sys.Date() - 10, 2), rep(Sys.Date() - 20, 2)),
  date2 = Sys.Date() + 1:2)

df$diffdate <- difftime(df$date2, df$date1, unit = "days")
df

When I (wrongly ! ) try to use filter(df, grp, val, diffdate)instead of the (correct) select(df, grp, val, diffdate), I get the following error:

Error: '&' not defined for "difftime" objects

If i do not include the diffdate column, the data frame is returned unchanged. I realise that this is not a big issue. it just confused me and made me think that the problem was difftime and not me not knowing how to use dplyr.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

6 participants