-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deep copy of indices from shallow() #4440
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4440 +/- ##
=======================================
Coverage 99.61% 99.61%
=======================================
Files 73 73
Lines 14112 14118 +6
=======================================
+ Hits 14057 14063 +6
Misses 55 55
Continue to review full report at Codecov.
|
This is great, I was recently worrying about using too much of |
// We copy all attributes that refer to column names so that calling setnames on either | ||
// the original or the shallow copy doesn't break anything. | ||
SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++; | ||
setAttrib(newdt, sym_index, shallow_duplicate(index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shallow_duplicate
was AFAIK designed to be used on VECSXP
, LISTSXP
, etc. You call it here on INTSXP
. It would be good to figure out proper/better? way to shallow copy index. PS. I do exactly the same in #4386.
Hope @mattdowle will have good idea how to handle that. Maybe we need to change index structure, and wrap it into extra list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://github.com/wch/r-source/blob/trunk/src/main/duplicate.c that doesn't seem to be an issue. It starts out exactly like duplicate
in that it makes a copy of the INTSXP
and then a shallow copy of the attributes. And the flow looks the same for R 3.1 as far as I can tell.
Initially, I had done this more manually by allocating a zero length INTSXP
and making a shallow copy of the attributes but this seemed more succinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made an issue that could help to address this overhead: #4467
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This INTSXP
being copied here is an empty one though (the dummy object upon which the indexes are attached as attributes). So I'm not sure what's improper here in the PR as it stands now. @jangorecki Do you mean that it's possible that shallow_duplicate()
might not work like this in future in R on this case of being passed INTSXP
? If so, that's a good spot but on balance it feels right to use it here for this. And CI will spot if and when R-devel makes any change to shallow_duplicate()
; e.g. via the new tests in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt. Good spot. Sorry for confusion. Indeed copy of integer()
(a placeholder for indices) is not a problem. The issue would be to copying of indices itself, which is not happening here. So the related items I was linking here are not really matching use case here.
We need tests that uses |
Yeah that makes sense. So comparing |
|
||
// We copy all attributes that refer to column names so that calling setnames on either | ||
// the original or the shallow copy doesn't break anything. | ||
SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protect AFAIK is not needed. dt
would need to be garbage collected before we will assign index as an attribute of newdt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right but I wasn't quite sure so decided to follow the pattern of just a few lines later and take the cautious approach. Looking through the rest of the file, there are a few instances where the call to getAttrib
is not PROTECT
ed and many where it is. I also don't think that dt
could be garbage collected.
Anyway, I wouldn't mind taking it out but I don't think it hurts and then it begs the question of what to do further down in the function or in the rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I spotted in the code there were protects sometimes where it was not needed. It is not really an issue, but it spreads the pattern of overprotecting, as you see on yourself. I used to do it as well. Still, I might be wrong, but this will be verified when preparing release for CRAN and running strict memory tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the PROTECT statements and also went ahead and removed the one on what is now line 174 for consistency. I considered just going through the entire file while I was at it, but that seemed out of scope for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thoughts and logic as you both. But rchk
revealed that getAttrib
can sometimes allocate inside it, perhaps more so now with ALTREP. Search for "A common source of true errors is a failure to protect the result of getAttrib when retrieving an attribute that may be automatically generated/converted (e.g. names, dimnames)" in https://github.com/kalibera/rchk/blob/master/doc/USAGE.md. Also search https://github.com/kalibera/rchk/blob/master/doc/INTERNALS.md for getAttrib
.
I suspect in data.table's usage of the R API, we will never see getAttrib
allocate, but R API is more general. So to pass rchk
(as per steps in CRAN_Release.cmd, and as required by CRAN under additional checks) we have to protect getAttrib
calls.
The variance in why some getAttrib
calls are not protected in data.table, may be that rchk
knows some cases of getAttrib
do not need to be protected depending on what the 2nd argument is.
In the past I've gone through and protected any getAttrib
calls that rchk
spots until it passes.
Removing over-zealous protection, as long as rchk
still passes, is worthwhile for speed and simpler code I agree. Perhaps rchk
could be added to GLCI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting, thanks. I don't mind reverting those changes at all. I just have two thoughts:
- From reading the docs you linked, the seemingly inconsistent calls to
PROTECT
may be related to garbage collection only being triggered by allocating calls as not all instances ofgetAttrib(*, R_NamesSymbol)
are currently protected. - As far as I can tell, the only instances where it does allocate are for rownames of a data.frame and with
R_NamesSymbol
when the first argument is a pairlist or a language object (butrchk
can't check the first argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not fully following there. But I did a grep and yes I see that you're right and not all
getAttrib(*, R_NamesSymbol)
are protected, for example inrbindlist.c
. It could be that those will be picked up when I rerunrchk
before release (perhaps those lines were added or changed in dev since 1.12.8), or more likely, from prior discussions with Thomas I suspect thatrchk
looks afterwards to the usage of the unprotectedgetAttrib
. If there is no possible GC between the getAttrib and its last usage, or if the result is protected by dint of being passed as the value tosetAttrib
, thenrchk
is clever enough (I guess) to not raise the unprotectedgetAttrib
. It does an extreme amount of tracking to spot unbalanced protection, for example, all statically (by looking at the source code without using runtime tests) and I've learnt not to underestimate how advanced it is. - That's one case. But isn't it possible too that column names are also some kind of ALTREP object of 1,000,000 columns, where even column names are not materialized either as well as the column data. Even
STRING_ELT
can allocate if the vector it is passed is an ALTREP.
If you need to benchmark this PR on a bigger machine, please prepare best-case and worst-case cases where I can just use |
It's not so much about needing a bigger machine, it's more about finding the time to come up with meaningful benchmarks. Just for shallow <- data.table:::shallow
set.seed(1L)
n <- 1e8
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)
system.time({
dt2 <- shallow(dt1)
})
# before
# user system elapsed
# 0.11 0.03 0.15
# now
# user system elapsed
# 0 0 0 And for some functions: library('data.table')
shallow <- data.table:::shallow
set.seed(1L)
n <- 5e6
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)
dt2 <- data.table(a=sample(1e2, 1e2))
setindex(dt2, a)
gc()
system.time({
for (i in seq_len(1e2))
fintersect(dt2, dt1)
})
# before
# user system elapsed
# 72.35 9.45 54.41
# now
# user system elapsed
# 69.65 8.80 49.03
system.time({
for (i in seq_len(1e2))
duplicated(dt1)
})
# before
# user system elapsed
# 16.25 4.11 14.28
# now
# user system elapsed
# 17.13 4.53 14.76
system.time({
for (i in seq_len(1e2))
frankv(dt1, 'a')
})
# before
# user system elapsed
# 65.17 7.88 50.44
# now
# user system elapsed
# 64.72 6.45 48.38
system.time({
for (i in seq_len(1e2))
all.equal(dt1, dt1)
})
# before
# user system elapsed
# 19.09 7.73 28.00
# now
# user system elapsed
# 15.23 5.46 21.61 |
GitHub Action + atime test to observe the performance regression fixed by PR Rdatatable#4440
Using my GitHub Action to observe the performance regression fixed by PR Rdatatable#4440
…nce the last commit of #4440 failed to check out
Close #4311.
This
still needs a news itemmight need some benchmarking to confirm that the other occurrences ofshallow()
had the same impact as they did on joins. But they should. Those would beAnd indirectly of course everything that calls these. But the overhead is only really noticeable when there are many calls as in the original report. I don't know when I'll be able to finish this so I'm putting it up as is for now.
In the code there was a reference to
example(merge)
breaking if the sorted attribute wasn't copied. That seems to have been due to a pattern of calling setnames internally in the past. But that pattern hasn't been in use for about five years and nothing currently breaks if I don't make that copy. I do it anyway and have added a test for it as I think it's expected to work, even ifshallow()
isn't exported.