-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed duplicate column names in merge()
when by.x in names(y)
#2631
Conversation
Also added an equivalent warning message to base for cases when duplicate column names are returned. With the patch above, the only way this can happen is if the user supplies identical values to the suffixes argument (e.g. |
inst/tests/tests.Rraw
Outdated
joined = merge(parents, children, by.x="name", by.y="parent") | ||
test(1877.1, length(names(joined)), length(unique(names(joined)))) | ||
test(1877.2, merge(parents, children, by.x="name", by.y="parent", suffixes=c("",""), | ||
warning = "column names 'name', 'sex', 'age' are duplicated in the result"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing ,
should be removed.
RE: failed checks: I can't work out how to write the test that checks for the appropriate warning message - would appreciate advice here since there's no documentation for data.table:::test(). I also can't run the tests on my local machine because the package fails to compile (I don't have OpenMP). |
Looks good. On |
This test now fails because the output of merge.data.table does not match the output of merge.data.frame because base::merge.data.frame still leads to duplicate column names where by.x is in names(y).
Codecov Report
@@ Coverage Diff @@
## master #2631 +/- ##
==========================================
+ Coverage 93.13% 93.14% +<.01%
==========================================
Files 61 61
Lines 12120 12130 +10
==========================================
+ Hits 11288 11298 +10
Misses 832 832
Continue to review full report at Codecov.
|
Thanks @mattdowle, I couldn't get the package to build using RStudio package build system - but I probably have something misconfigured in my environment. I was able to get the tests to work with a bit of trial and error. |
I added a little bit to the Contributing wiki to make it a bit easier to understand how the testing regime works. Thanks for the PR. |
I love it, although it breaks consistency with |
@sritchie73 I assume the Rstudio issue is the same as #2585. I've been using the command line to re-install lately |
Thanks @MarkusBonsch , for what its worth I've also sent a similar patch to the R-devel mailing list for merge.data.frame - but the two people who responded (neither core dev team) were pessimistic about my chances of getting the patch accepted. @MichaelChirico I get a different error:
In the end I managed to get data.table:::test() working almost the same way in my local R session, by copying the file preamble, then fixed the remaining issues that came up in Travis after pushing the changes out to my branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR makes merge.data.table to be less consistent to data.frame method then please describe that in manual. You can link your R-devel patch there also so we (anyone) can track it later on if it happens to be merged to R-devel.
Here is the thread on R-devel: http://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html There is now a suggestion of just adding the suffix to the column name in y to keep backwards compatibility (i.e. any by.x column can still be referred to by its original name). If that patch is accepted I can similarly update merge.data.table to have this behaviour also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You sometimes use paste(, sep="")
and another time paste0
, as we now depends on R 3.1 we can safely use paste0, but this is minor thing. We can wait a little bit and see what R-devel will decide.
Thanks @jangorecki - I had used |
Lets wait for R-devel, paste is not that important |
Given Martin Maechler's reply today, it's looking likely to be accepted for R. Wow! Well navigated! I updated manual page accordingly (including link to thread too as Jan suggested) and will merge. |
Thanks! I am following up with Martin to clarify the functionality of the proposed |
When joining two
data.tables
usingmerge()
the resultingdata.table
will contain duplicate column names ifby.x != by.y
andby.x
is also innames(y)
.An example:
Output:
This behaviour is also present in
base:::merge.data.frame()
, but throws an additional warning:This patch fixes this problem by checking for names shared between
by.x
andnames(y)
, and adding the appropriatesuffixes
to those column names.