Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing bug in 'on' with complex variable names #3094

Merged
merged 14 commits into from
Oct 7, 2018
Merged

Conversation

MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Oct 4, 2018

Closes #3092
Closes #2931

Changes
The logic, how the 'on' string gets parsed and variable names + operators are extracted has been rewritten in order to:

  • fix bug where complex variable names with operators in the on clause caused an error, e.g. colA>=1
  • allow white spaces around the operators in an on clause
    Many thanks to @telenz for figuring out the algorithm.

Tests

  • No existing tests changed.
  • 8 new tests added

I am assuming that the documented standard behaviour of 'on' is thoroughly tested already. Therefore, new tests only comprise situations with complex variable names including operators like colA>=1 and 'on' clauses with white spaces around the operator.

Performance
No significant additional overhead. Tests in tests.Rraw run as fast as before. The following example with a rather complex 'on' clause shows no speed impact of back-tick 'ed variable names:

library(data.table)
library(microbenchmark)
DT <- data.table(`a1>=` = 1:3, `<=b` = 1:3, `==c` = 1:3, `d==` = 1:3)
DT2 <- data.table(a = 1:3, b = 1:3, c = 1:3, d = 1:3)
i  <- data.table(`a1>=i` = 1:3, `<=bi` = 1:3, `==ci` = 1:3, `d==i` = 1:3)
i2  <- data.table(ai = 1:3, bi = 1:3, ci = 1:3, di = 1:3)
DT[i, on = c("`a1>=` == `a1>=i`", "`<=b` == `<=bi`", "`==c` == `==ci`", "`d==` == `d==i`")]
test <- microbenchmark(backTick = DT[i, on = c("`a1>=` == `a1>=i`", "`<=b` == `<=bi`", "`==c` == `==ci`", "`d==` == `d==i`")], 
                       noBackTick = DT2[i2, on = c("a==ai", "b==bi", "c==ci", "d==di")],
                       times = 100)
# Unit: milliseconds
#        expr        min         lq             mean       median           uq      max            neval
#      backTick 1.698332 1.776262 1.964937 1.840638 1.911018 5.127662   100
#  noBackTick 1.641280 1.728642 1.881203 1.798994 1.873094 5.633216   100

@MarkusBonsch MarkusBonsch requested a review from mattdowle October 4, 2018 07:52
@codecov
Copy link

codecov bot commented Oct 4, 2018

Codecov Report

Merging #3094 into master will increase coverage by 0.04%.
The diff coverage is 98.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3094      +/-   ##
==========================================
+ Coverage   90.92%   90.97%   +0.04%     
==========================================
  Files          61       61              
  Lines       11817    11844      +27     
==========================================
+ Hits        10745    10775      +30     
+ Misses       1072     1069       -3
Impacted Files Coverage Δ
R/data.table.R 92.21% <98.36%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6ac844...d4c20aa. Read the comment docs.

@jangorecki
Copy link
Member

wow, this is great example of high quality PR.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I would extract parse_on into top level of R file, so the [ will gets a little smaller and more read-able. Unless you strongly agree with me, no need to proceed with this as it is minor syntactic sugar stuff.

@MarkusBonsch
Copy link
Contributor Author

I tend to strongly agree with you @jangorecki :). Therefore, I have factored it out in the latest commit.
Thanks for the review and the suggestion!

@mattdowle mattdowle added this to the 1.12.0 milestone Oct 5, 2018
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just needs coverage of the two stop()s and one branch to achieve 100% diff coverage please.

@MarkusBonsch
Copy link
Contributor Author

Thanks @mattdowle for the review. I have adapted the tests so that all stops and branches are hit. Still, diff coverage is only 98.3 %. Looking into it suggests, that a curly bracket after an if statement is missed even though the if statement is hit in general. This seems to be a bug in codecov, IMHO. So I think, we have full coverage here.

@mattdowle
Copy link
Member

I filed the covr issue here : r-lib/covr#339

@mattdowle mattdowle merged commit afb99aa into master Oct 7, 2018
@mattdowle mattdowle deleted the backTickFix branch October 7, 2018 07:08
@jangorecki
Copy link
Member

we need to replace trimws as R 3.1 raises

Error in .parse_on(substitute(on), isnull_inames) : 
  could not find function "trimws"
Calls: all.equal ... all.equal.data.table -> [ -> [.data.table -> .parse_on

@MichaelChirico
Copy link
Member

trimws is a very simple helper function... we should add it to utils if session R version < 3.2

@mattdowle
Copy link
Member

To link it for completeness, trimws was backported here: #3114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants