Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X[Y] Merge Produces Wrong Output When nomatch=0 and roll=T #700

Closed
my-R-help opened this issue Jun 20, 2014 · 11 comments
Closed

X[Y] Merge Produces Wrong Output When nomatch=0 and roll=T #700

my-R-help opened this issue Jun 20, 2014 · 11 comments
Assignees
Milestone

Comments

@my-R-help
Copy link

This unexpected behavior is best illustrated with an example. (Reproducible code is further below.) With nomatch=0 and roll=T we should have three rows in the sapply output, but in fact we have five rows (except for me having three rows).

> CS
   LPERMCO   datadate         me
1:       7 2013-07-26 626550.353
2:      33 2013-07-26   7766.385
> SP
   PERMCO       date       RET
1:      7 2013-06-28 -0.118303
2:      7 2013-07-31  0.141225
3:     33 2013-06-28 -0.031370
4:     33 2013-06-28 -0.025330
5:     33 2013-07-31  0.045967
6:     33 2013-07-31  0.043694
> CS[SP, nomatch = 0, roll = T]
   LPERMCO   datadate         me       RET
1:       7 2013-07-31 626550.353  0.141225
2:      33 2013-06-28   7766.385 -0.031370
3:      33 2013-06-28   7766.385 -0.025330
4:      33 2013-07-31 626550.353  0.045967
5:      33 2013-07-31   7766.385  0.043694
Warning message:
In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate =
c("2013-07-31",  :
  number of rows of result is not a multiple of vector length (arg 3)
> sapply(CS[SP, nomatch = 0, roll = T], length)
 LPERMCO datadate       me      RET
       5        5        3        5

Here's the code to run this example:

CS <-
  data.table(
    structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912,
15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names =
c("LPERMCO",
"datadate", "me"), class = "data.frame", row.names = c(NA, -2L
)),
    key = "LPERMCO,datadate")
SP <-
  data.table(
    structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date =
structure(c(15884,
15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303,
0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO",
"date", "RET"), class = "data.frame", row.names = c(NA, -6L)),
    key = "PERMCO,date")
sapply(CS[SP, nomatch = 0, roll = T], length)

As a side note (not sure about it), maybe it makes sense to add a little check into data.table to catch related bugs. The check would be run, after each operation on any data.table X like this (together with a message to contact the devs to report a bug):

stopifnot((lgth <- sapply(X, length)) == lgth[1]); rm(lgth)

I don't think it would be prohibitively expensive to run, and it would help to catch related bugs in the future based on real data that people are running. I'm mentioning this since I found this bug only by accident, and if such a check would have been in place it would have been much easier to notice.

@arunsrinivasan
Copy link
Member

@my-R-help, thanks for the report. On your side note, if you just did:

CS[SP, nomatch=0L, roll=TRUE]

It does warn you:

Warning message:
# In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate = c("2013-07-31",  :
#   number of rows of result is not a multiple of vector length (arg 3)

It just disappears when you wrap it with sapply.

@my-R-help
Copy link
Author

What you wrote is true. However, what I did originally was something like this to save the result into a new data.table (which is a common use pattern, I believe), and here no warning appeared:

M <- CS[SP, nomatch = 0, roll = T]

@arunsrinivasan
Copy link
Member

Michael,
Right. Typing M again shows the warning. But that's of course not practical. Yes, this shouldn't happen. Point noted :).

arunsrinivasan added a commit that referenced this issue Jun 20, 2014
@my-R-help
Copy link
Author

The patch still gives different column lengths when run on the example code above.

@arunsrinivasan
Copy link
Member

@my-R-help, I am not sure. This is what I get:

> CS[SP, roll=TRUE, nomatch=0L]
   LPERMCO   datadate         me      RET
1:       7 2013-07-31 626550.353 0.141225
2:      33 2013-07-31   7766.385 0.045967
3:      33 2013-07-31   7766.385 0.043694

sapply(CS[SP, roll=TRUE, nomatch=0L], length)
 LPERMCO datadate       me      RET 
       3        3        3        3 

How did you install, may I ask? The tests pass as well, which is basically your example, with key for i set and not-set.

@my-R-help
Copy link
Author

I used the following command. Did I miss anything? (I'm still getting the bug.)

devtools::install_github("data.table", "Rdatatable")

@arunsrinivasan arunsrinivasan added this to the v1.9.4 milestone Jun 21, 2014
@arunsrinivasan
Copy link
Member

The commit is not available on the master branch yet. It's on branch issue_700.

@my-R-help
Copy link
Author

Thanks, using the following installation method, it works now.

devtools::install_github("data.table", "Rdatatable", ref = "issue_700")

@arunsrinivasan
Copy link
Member

Nice, but the branch was created until we can confirm that the issue is fixed. I'd not recommend installing non-master branches.

@my-R-help
Copy link
Author

Sure, I did it just for testing and reverted later to 1.9.2. How would you recommend to install this branch temporarily?

@arunsrinivasan
Copy link
Member

That's good. I should've mentioned "not recommend staying on non-master branches permanently".
👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants