Fixed duplicate column names in `merge()` when by.x in names(y) #2631

sritchie73 · 2018-02-17T01:12:26Z

When joining two data.tables using merge() the resulting data.table will contain duplicate column names if by.x != by.y and by.x is also in names(y).

An example:

# Create example data.frames
parents <- data.table(name=c("Sarah", "Max", "Qin", "Lex"), 
                      sex=c("F", "M", "F", "M"), 
                      age=c(41, 43, 36, 51))
children <- data.table(parent=c("Sarah", "Max", "Qin"), 
                       name=c("Oliver", "Sebastian", "Kai-lee"),
                       sex=c("M", "M", "F"), 
                       age=c(5,8,7))

# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")

Output:

   name sex.x age.x      name sex.y age.y
1   Max     M    43 Sebastian     M     8
2   Qin     F    36   Kai-lee     F     7
3 Sarah     F    41    Oliver     M     5

This behaviour is also present in base:::merge.data.frame(), but throws an additional warning:

Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
  column name ‘name’ is duplicated in the result

This patch fixes this problem by checking for names shared between by.x and names(y), and adding the appropriate suffixes to those column names.

sritchie73 · 2018-02-17T01:33:36Z

Also added an equivalent warning message to base for cases when duplicate column names are returned. With the patch above, the only way this can happen is if the user supplies identical values to the suffixes argument (e.g. suffixes=c("", "")).

HughParsonage · 2018-02-17T01:42:27Z

inst/tests/tests.Rraw

+joined = merge(parents, children, by.x="name", by.y="parent")                       
+test(1877.1, length(names(joined)), length(unique(names(joined)))) 
+test(1877.2, merge(parents, children, by.x="name", by.y="parent", suffixes=c("",""),
+     warning = "column names 'name', 'sex', 'age' are duplicated in the result"), 


Trailing , should be removed.

sritchie73 · 2018-02-17T04:05:18Z

RE: failed checks: I can't work out how to write the test that checks for the appropriate warning message - would appreciate advice here since there's no documentation for data.table:::test().

I also can't run the tests on my local machine because the package fails to compile (I don't have OpenMP).

mattdowle · 2018-02-17T17:32:12Z

Looks good. On test(), search tests.Rraw for warning= to see other examples in other tests to follow. You can also pass output= and/or error= to search for patterns in those messages.
It should compile ok without OpenMP as it should detect if OpenMP is available and switch out. How are you compiling and what is the message it fails with at what point? This is my dev cycle: cc.R and Pasha added a Makefile (both linked in item 8 here).

This test now fails because the output of merge.data.table does not match the output of merge.data.frame because base::merge.data.frame still leads to duplicate column names where by.x is in names(y).

codecov-io · 2018-02-18T02:52:39Z

Codecov Report

Merging #2631 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2631      +/-   ##
==========================================
+ Coverage   93.13%   93.14%   +<.01%     
==========================================
  Files          61       61              
  Lines       12120    12130      +10     
==========================================
+ Hits        11288    11298      +10     
  Misses        832      832

Impacted Files	Coverage Δ
R/merge.R	`94.73% <100%> (+0.79%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2386ab...8ad0242. Read the comment docs.

sritchie73 · 2018-02-18T03:06:41Z

Thanks @mattdowle, I couldn't get the package to build using RStudio package build system - but I probably have something misconfigured in my environment. I was able to get the tests to work with a bit of trial and error.

MichaelChirico · 2018-02-18T04:12:43Z

I added a little bit to the Contributing wiki to make it a bit easier to understand how the testing regime works. Thanks for the PR.

MarkusBonsch · 2018-02-18T06:19:06Z

I love it, although it breaks consistency with merge.data.frame. There are quite some other issues with merge.data.table at the moment (#2593). Once these are all fixed (I am working on it), we have to update the documentation to clearly reflect, where we have differences to merge.data.frame.

MichaelChirico · 2018-02-18T06:53:54Z

@sritchie73 I assume the Rstudio issue is the same as #2585. I've been using the command line to re-install lately

sritchie73 · 2018-02-18T09:48:01Z

Thanks @MarkusBonsch , for what its worth I've also sent a similar patch to the R-devel mailing list for merge.data.frame - but the two people who responded (neither core dev team) were pessimistic about my chances of getting the patch accepted.

@MichaelChirico I get a different error:

==> R CMD INSTALL --no-multiarch --with-keep.source data.table

* installing to library ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library’
* installing *source* package ‘data.table’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG   -I/usr/local/include  -fopenmp -fPIC  -Wall -g -O2  -c assign.c -o assign.o
clang: error: unsupported option '-fopenmp'
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/data.table’

Exited with status 1.

In the end I managed to get data.table:::test() working almost the same way in my local R session, by copying the file preamble, then fixed the remaining issues that came up in Travis after pushing the changes out to my branch.

jangorecki

If this PR makes merge.data.table to be less consistent to data.frame method then please describe that in manual. You can link your R-devel patch there also so we (anyone) can track it later on if it happens to be merged to R-devel.

sritchie73 · 2018-02-18T20:21:16Z

Here is the thread on R-devel: http://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html

There is now a suggestion of just adding the suffix to the column name in y to keep backwards compatibility (i.e. any by.x column can still be referred to by its original name). If that patch is accepted I can similarly update merge.data.table to have this behaviour also.

jangorecki

LGTM. You sometimes use paste(, sep="") and another time paste0, as we now depends on R 3.1 we can safely use paste0, but this is minor thing. We can wait a little bit and see what R-devel will decide.

sritchie73 · 2018-02-18T23:13:39Z

Thanks @jangorecki - I had used paste rather than paste0 in the first instance to match the code style for the existing suffix handling which immediately precedes my patch. Would you prefer I change it to paste0, or keep it consistent with the code preceding it?

jangorecki · 2018-02-18T23:29:09Z

Lets wait for R-devel, paste is not that important

mattdowle · 2018-02-23T00:42:21Z

Given Martin Maechler's reply today, it's looking likely to be accepted for R. Wow! Well navigated! I updated manual page accordingly (including link to thread too as Jan suggested) and will merge.
Will invite you to be project member too.

sritchie73 · 2018-02-23T01:34:27Z

Thanks!

I am following up with Martin to clarify the functionality of the proposed no.dups argument so we can mirror the behaviour of merge.data.frame.

sritchie73 added 4 commits February 17, 2018 11:40

Fixed duplicate column names in merge when by.x in names(y)

05465b0

added test

79653e7

added NEWS entry

091251d

merge throws warning when duplicate names in result

b6aaa98

HughParsonage reviewed Feb 17, 2018

View reviewed changes

syntax typo

f9c1380

sritchie73 added 5 commits February 18, 2018 12:54

warning test

560b390

fixing merge conflicts with master

1c12786

warning test output comparison no longer generates warning

f81eaa5

integer vs double failed check fixed

2ec4526

Removed test 1543.3

2006b14

This test now fails because the output of merge.data.table does not match the output of merge.data.frame because base::merge.data.frame still leads to duplicate column names where by.x is in names(y).

jangorecki requested changes Feb 18, 2018

View reviewed changes

Updated merge documentation

12fb8ac

jangorecki approved these changes Feb 18, 2018

View reviewed changes

mattdowle added 3 commits February 22, 2018 13:21

Merge branch 'master' into merge-key-suffix

81cb55d

Minor wording: 'function' => 'data.table method'

b3d287a

Added link and status to merge.Rd

3f79e7d

mattdowle added 2 commits February 22, 2018 16:46

Fixed my bracket typo

a483996

Moved new paragraph to Details section.

8ad0242

mattdowle merged commit b392bfe into Rdatatable:master Feb 23, 2018

sritchie73 mentioned this pull request Mar 3, 2018

no.dups argument to merge.data.table #2653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed duplicate column names in `merge()` when by.x in names(y) #2631

Fixed duplicate column names in `merge()` when by.x in names(y) #2631

sritchie73 commented Feb 17, 2018

sritchie73 commented Feb 17, 2018

HughParsonage Feb 17, 2018

sritchie73 commented Feb 17, 2018

mattdowle commented Feb 17, 2018 •

edited

Loading

codecov-io commented Feb 18, 2018 •

edited

Loading

sritchie73 commented Feb 18, 2018

MichaelChirico commented Feb 18, 2018

MarkusBonsch commented Feb 18, 2018

MichaelChirico commented Feb 18, 2018

sritchie73 commented Feb 18, 2018

jangorecki left a comment

sritchie73 commented Feb 18, 2018 •

edited

Loading

jangorecki left a comment •

edited

Loading

sritchie73 commented Feb 18, 2018

jangorecki commented Feb 18, 2018

mattdowle commented Feb 23, 2018

sritchie73 commented Feb 23, 2018

Fixed duplicate column names in merge() when by.x in names(y) #2631

Fixed duplicate column names in merge() when by.x in names(y) #2631

Conversation

sritchie73 commented Feb 17, 2018

sritchie73 commented Feb 17, 2018

HughParsonage Feb 17, 2018

Choose a reason for hiding this comment

sritchie73 commented Feb 17, 2018

mattdowle commented Feb 17, 2018 • edited Loading

codecov-io commented Feb 18, 2018 • edited Loading

Codecov Report

sritchie73 commented Feb 18, 2018

MichaelChirico commented Feb 18, 2018

MarkusBonsch commented Feb 18, 2018

MichaelChirico commented Feb 18, 2018

sritchie73 commented Feb 18, 2018

jangorecki left a comment

Choose a reason for hiding this comment

sritchie73 commented Feb 18, 2018 • edited Loading

jangorecki left a comment • edited Loading

Choose a reason for hiding this comment

sritchie73 commented Feb 18, 2018

jangorecki commented Feb 18, 2018

mattdowle commented Feb 23, 2018

sritchie73 commented Feb 23, 2018

Fixed duplicate column names in `merge()` when by.x in names(y) #2631

Fixed duplicate column names in `merge()` when by.x in names(y) #2631

mattdowle commented Feb 17, 2018 •

edited

Loading

codecov-io commented Feb 18, 2018 •

edited

Loading

sritchie73 commented Feb 18, 2018 •

edited

Loading

jangorecki left a comment •

edited

Loading