Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior when subsetting with column names quoted with backticks in I of data.table #2931

Closed
mt1022 opened this issue Jun 12, 2018 · 2 comments · Fixed by #3094
Closed
Assignees
Milestone

Comments

@mt1022
Copy link

mt1022 commented Jun 12, 2018

Subsetting with complex column names quoted with backticks will throw errors. See the following reproducible example:

library(data.table)

DT <- data.table(id = letters[1:3], `counts(a>=0)` = 1:3)

DT[`counts(a>=0)` == 2] 
#> Error in `[.data.table`(DT, `counts(a>=0)` == 2): Column(s) [counts(a] not found in x

This has been posted on Stackoverflow here. This seems to be an issue exists in v1.11.+ (as found by David Arenburg and PKumar) but not previous release. I tried the developmental version and got the same error. David Arenburg has investigated this issue and found it might be caused by .prepareFastSubset (see the comments in the linked SO post for details), which I think would be very helpful to fix this issue.

Some workarounds like DT[as.numeric(`counts(a>=0)`) == 2] and DT[(`counts(a>=0)`) == 2] by Aurelien callens and Marius work fine.

Here is the session info:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.5

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4    yaml_2.1.17
@jangorecki jangorecki added this to the 1.11.6 milestone Jun 12, 2018
@MarkusBonsch
Copy link
Contributor

As @DavidArenburg correctly pointed out, this is most likely a regression that was introduced by my subsetting optimization. There, I introduced the function .prepareFastSubset that investigates i (or more correctly isub, the substituted i) whether it supports fast subsetting. This check seems to fail for the here shown edge case. I am currently on vacation and have little time to investigate, but will try my best.

@MarkusBonsch
Copy link
Contributor

I did some further investigations: the problem is that it seems impossible to have column names with mathematical operators in the on clause of joins: DT[DT, on = "counts(a>=0)==counts(a>=0)"] fails. I will see, if I can change that but it could be difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants