Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combining mean with reformatted Date class and by throwing an error #1876

Closed
rossholmberg opened this issue Oct 13, 2016 · 6 comments · Fixed by #3567
Closed

combining mean with reformatted Date class and by throwing an error #1876

rossholmberg opened this issue Oct 13, 2016 · 6 comments · Fixed by #3567
Labels
GForce issues relating to optimized grouping calculations (GForce)
Milestone

Comments

@rossholmberg
Copy link

rossholmberg commented Oct 13, 2016

I've come across an issue using a formatted date column with by. Take the following data:

library( data.table )
data <- data.table( session = c( 1,1,1,1,2,2,2,2,2,2,3,3,3,3 ),
                    date = as.Date( c( "2016-01-01", "2016-01-02", "2016-01-03", "2016-01-03",
                                       "2016-04-30", "2016-04-30", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03",
                                       "2016-08-28", "2016-08-28", "2016-08-28", "2016-08-28" ) )
)

I want to mark each session with a label, formatted %b-%Y, based on the mean date for that session.

I can find the mean date of each session, using the by parameter:

output <- copy( data )[ , Month := mean( date ), by = session ]

I can also reformat a mean date the way I want within data.table:

output <- copy( data )[ , Month := format( mean( date ), "%b-%Y" ) ]

But I can't do both:

output <- copy( data )[ , Month := format( mean( date ), "%b-%Y" ), by = session ]

The above returns an error:
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid 'trim' argument In addition: Warning message: In mean(date) : argument is not numeric or logical: returning NA

Note I can do what I need in two steps (below), and it works OK, but there does seem to be something going wrong with the above :

output <- copy( data )[ , Month := mean( date ), by = session 
                        ][ , Month := format( Month, "%b-%Y" ) ]

It also works fine with mean.Date:

output <- copy( data )[ , Month := format( mean.Date( date ), "%b-%Y" ), by = session ]
@MichaelChirico
Copy link
Member

pushing aside statistical issue of "what does the 'average date' mean? why
not modal date/median date?"

not on a machine so can only suggest to try and replicate the issue in one
line in base R (using tapply, for example) to be sure it's actually a
data.table issue

On Oct 13, 2016 6:11 AM, "Ross Holmberg" notifications@github.com wrote:

I've come across an issue using a formatted date column with by. Take the
following data:

library( data.table )
data <- data.table( session = c( 1,1,1,1,2,2,2,2,2,2,3,3,3,3 ),
date = as.Date( c( "2016-01-01", "2016-01-02", "2016-01-03", "2016-01-03",
"2016-04-30", "2016-04-30", "2016-05-03", "2016-05-03", "2016-05-03",
"2016-05-03",
"2016-08-28", "2016-08-28", "2016-08-28", "2016-08-28" ) )
)

I want to mark each session with a label, formatted %b-%Y, based on the
mean date for that session.

I can find the mean date of each session, using the by parameter:

output <- copy( data )[ , Month := mean( date ), by = session ]

I can also reformat a mean date the way I want within data.table:

output <- copy( data )[ , Month := format( mean( date ), "%b-%Y" ) ]

But I can't do both:

output <- copy( data )[ , Month := format( mean( date ), "%b-%Y" ), by = session ]

The above returns an error:

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'trim' argument
In addition: Warning message:
In mean(date) : argument is not numeric or logical: returning NA

Note I can do what I need in two steps (below), and it works OK, but there
does seem to be something going wrong with the above :

output <- copy( data )[ , Month := mean( date ), by = session
][ , Month := format( Month, "%b-%Y" ) ]

It also works fine with mean.Date:

output <- copy( data )[ , Month := format( mean( date ), "%b-%Y" ), by = session ]


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1876, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdXWp0m8cSSjbW3BjIFwNltzLCVPzks5qzgOzgaJpZM4KVtlC
.

@franknarf1
Copy link
Contributor

The OP's question on SO: http://stackoverflow.com/questions/40011525/data-table-not-accepting-by-and-format-for-date-at-the-same-time

It does seem like a bug. On the other hand, it also seems that what you are attempting is not a good idea. If two sessions both have a majority of their dates in the same month, they will end up, you won't be able to use this as a label.

@rossholmberg
Copy link
Author

rossholmberg commented Oct 14, 2016

Thanks Michael and Frank.

@MichaelChirico I've done a couple of tests, and other comparable methods seem to work fine. Starting from the same data as above, just with data.frame instead of data.table:

data <- data.frame( session = as.integer( c( 1,1,1,1,2,2,2,2,2,2,3,3,3,3 ) ),
                    date = as.Date( c( "2016-01-01", "2016-01-02", "2016-01-03", "2016-01-03",
                                       "2016-04-30", "2016-04-30", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03",
                                       "2016-08-28", "2016-08-28", "2016-08-28", "2016-08-28" ) )
)

for( d in unique( data$session ) ) {
    data$Month1[ data$session == d ] <- format( mean( data$date[ data$session == d ] ), "%b-%Y" )
}

library( dplyr )
data <- data %>%
    group_by( session ) %>%
    mutate( Month2 = format( mean( date ), "%b-%Y" ) )

identical( data$Month1, data$Month2 )
[1] TRUE

Both of the above methods work fine. I know they're not technically "one line", but the calculation of mean date and format conversion are being done in one step in both cases.

@franknarf1 you're right, my method here could be different (the situation you mention should never happen here, but importantly this labelling system was decided upon by someone else), and that's why I was happy to accept the working solution offered on SO. That solution was simply to avoid using mean, and use mean.Date instead, I just thought it was odd that the difference would cause an error, particularly only in data.table.

@MichaelChirico
Copy link
Member

well, there's probably your issue! (?)

I imagine data.table's own internal, optimized mean function is being
called -- search around for GForce a bit. that would mean the proper method
dispatch is being skipped. try adding the verbose=TRUE argument to confirm
this.

On Oct 14, 2016 5:19 AM, "Ross Holmberg" notifications@github.com wrote:

Thanks Michael and Frank.

@MichaelChirico https://github.com/MichaelChirico I've done a couple of
tests, and other comparable methods seem to work fine. Starting from the
same data as above, just with data.frame instead of data.table:

for( d in unique( data1$session ) ) {
data1$Month[ data1$session == d ] <- format( mean( data1$date[ data1$session == d ] ), "%b-%Y" )
}

library( dplyr )
data2 <- data2 %>%
group_by( session ) %>%
mutate( Month = format( mean( date ), "%b-%Y" ) ) %>%
ungroup()

identical( data1$Month, data2$Month )
[1] TRUE

Both of the above methods work fine. I know they're not technically "one
line", but the calculation of mean date and format conversion are being
done in one step in both cases.

@franknarf1 https://github.com/franknarf1 you're right, my method here
could be different, and that's why I was happy to accept the working
solution offered on SO. That solution was simply to avoid using mean, and
use mean.Date instead, I just thought it was odd that the difference
would cause an error, particularly only in data.table.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1876 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQde711-8qp5e0hIJu_eGUU3qXsbdyks5qz0kRgaJpZM4KVtlC
.

@franknarf1
Copy link
Contributor

@MichaelChirico Oh, that might be it, but the details aren't clear to me. GForce only kicks in for the vanilla data[, mean(date), by=session, verbose=TRUE] (which runs fine), while it is not active in data[, format(mean(date), "%b-%Y"), by = session, verbose=TRUE] (the errorful way).

@rossholmberg
Copy link
Author

rossholmberg commented Oct 16, 2016

According to the error, "invalid 'trim' argument", I think format is being called differently in each situation, could this be the problem?

According to help(format), trim is a parameter in the "Default S3 method", but not in the "S3 method for class 'data.frame'". I believe that the latter (data.frame method) is being called in my second example, but the former (Default S3 method) is being called in both my first example (where there is no input to get confused for a trim parameter) and third (not working) example. I don't understand why the 'data.frame' method wouldn't be called in all cases here, but that difference seems to make sense based on the output.

In other words, I think that the method used by format is different when by is used. The "Default S3 method" is called when a by parameter is used, and the "method for class 'data.frame'" is called when no by parameter is used. Would that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
4 participants