Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deep copy of indices from shallow() #4440

Merged
merged 12 commits into from
Jun 26, 2020
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,9 @@ unit = "s")

10. Starting from 4.0.0, data.table is using R's `rbind` and `cbind` methods, as described in v1.12.6 news entry. Support for R 3.x.x is resolved when processing `NAMESPACE` file, at install time, or at the time of building package binaries. As a result, users on R 3.x.x, if installing from binaries, must use binaries built by R 3.x.x, and users on R 4.x.x, if installing from binaries, must use binaries built by R 4.x.x. Users will see `package ‘data.table’ was built under R version...` warning when this happen. Thanks to @vinhdizzo for reporting in [#4528](https://github.com/Rdatatable/data.table/issues/4528).

11. Internal function `shallow()` no longer makes a deep copy of secondary indices. This eliminates a relatively small time and memory overhead when indices are present that added up significantly when performing many operations, such as joins, in a loop or when joining in `j` by group, [#4311](https://github.com/Rdatatable/data.table/issues/4311). Many thanks to @renkun-ken for the report, and @tlapak for the investigation and PR.


# data.table [v1.12.8](https://github.com/Rdatatable/data.table/milestone/15?closed=1) (09 Dec 2019)

## NEW FEATURES
Expand Down
33 changes: 24 additions & 9 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -12871,30 +12871,32 @@ ids <- sample(letters[1:3], 10, replace=TRUE)
scores <- rnorm(10)
dt <- data.table(id=ids, score=scores)
dt.s4 <- new("Data.Table", data.table(id=ids, score=scores))
test(1914.1, isS4(dt.s4))
test(1914.2, inherits(dt.s4, 'data.table'))
test(1914.01, isS4(dt.s4))
test(1914.02, inherits(dt.s4, 'data.table'))
# Test possible regression. shallow() needs to preserve the S4 bit to support S4 classes that contain data.table
test(1914.03, isS4(shallow(dt.s4)))
## pull out data from S4 as.list, and compare to list from dt
dt.s4.list <- dt.s4@.Data
names(dt.s4.list) <- names(dt.s4)
test(1914.3, dt.s4.list, as.list(dt)) # Underlying data not identical
test(1914.04, dt.s4.list, as.list(dt)) # Underlying data not identical
# simple S4 conversion-isms work
df = data.frame(a=sample(letters, 10), b=1:10)
dt = as.data.table(df)
test(1914.4, identical(as(df, 'data.table'), dt))
test(1914.5, identical(as(dt, 'data.frame'), df))
test(1914.05, identical(as(df, 'data.table'), dt))
test(1914.06, identical(as(dt, 'data.frame'), df))
# data.table can be used in an S4 slot
dt <- data.table(a=sample(letters[1:3], 10, replace=TRUE), score=rnorm(10))
dt.comp <- new("S4Composition", data=dt)
test(1914.6, dt.comp@data, dt)
test(1914.07, dt.comp@data, dt)
# S4 methods dispatch properly on data.table slots"
dt <- data.table(a=sample(letters[1:3], 10, replace=TRUE), score=rnorm(10))
dt.comp <- new("S4Composition", data=dt)
setGeneric("dtGet", function(x, what) standardGeneric("dtGet"))
setMethod("dtGet", c(x="S4Composition", what="missing"), function(x, what){x@data})
setMethod("dtGet", c(x="S4Composition", what="ANY"), function(x, what) {x@data[[what]]})
test(1914.7, dtGet(dt.comp), dt) # actually
test(1914.8, identical(dtGet(dt.comp, 1), dt[[1]]))
test(1914.9, identical(dtGet(dt.comp, 'b'), dt$b))
test(1914.08, dtGet(dt.comp), dt) # actually
test(1914.09, identical(dtGet(dt.comp, 1), dt[[1]]))
test(1914.10, identical(dtGet(dt.comp, 'b'), dt$b))
removeClass("Data.Table") # so that test 1914.2 passes on the second run of cc() in dev
removeClass("S4Composition")
# END port of old testthat tests
Expand Down Expand Up @@ -16999,3 +17001,16 @@ B = data.table(x=1:2)
X = A == B
A[, y := 3:4]
test(2148, colnames(X), c('x'))

# shallow() shouldn't take a deep copy of indices, #4311
dt <- data.table(a = c(3, 1))
setindex(dt, a)
dt2 <- shallow(dt)
test(2149.1, address(attr(attr(dt, 'index'), '__a')), address(attr(attr(dt2, 'index'), '__a')))
# Testing possible future regression. shallow() needs to copy the names of indices and keys.
setnames(dt2, 'a', 'A')
test(2149.2, indices(dt), 'a')
setkey(dt, a)
dt2 <- shallow(dt)
setnames(dt2, 'a', 'A')
test(2149.3, key(dt), 'a')
20 changes: 16 additions & 4 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,25 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
R_len_t i,l;
int protecti=0;
SEXP newdt = PROTECT(allocVector(VECSXP, n)); protecti++; // to do, use growVector here?
//copyMostAttrib(dt, newdt); // including class
DUPLICATE_ATTRIB(newdt, dt);
SET_ATTRIB(newdt, shallow_duplicate(ATTRIB(dt)));
SET_OBJECT(newdt, OBJECT(dt));
IS_S4_OBJECT(dt) ? SET_S4_OBJECT(newdt) : UNSET_S4_OBJECT(newdt); // To support S4 objects that incude data.table
//SHALLOW_DUPLICATE_ATTRIB(newdt, dt); // SHALLOW_DUPLICATE_ATTRIB would be a bit neater but is only available from R 3.3.0

// TO DO: keepattr() would be faster, but can't because shallow isn't merely a shallow copy. It
// also increases truelength. Perhaps make that distinction, then, and split out, but marked
// so that the next change knows to duplicate.
// Does copyMostAttrib duplicate each attrib or does it point? It seems to point, hence DUPLICATE_ATTRIB
// for now otherwise example(merge.data.table) fails (since attr(d4,"sorted") gets written by setnames).
// keepattr() also merely points to the entire attrbutes list and thus doesn't allow replacing
// some of its elements.

// We copy all attributes that refer to column names so that calling setnames on either
// the original or the shallow copy doesn't break anything.
SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protect AFAIK is not needed. dt would need to be garbage collected before we will assign index as an attribute of newdt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right but I wasn't quite sure so decided to follow the pattern of just a few lines later and take the cautious approach. Looking through the rest of the file, there are a few instances where the call to getAttrib is not PROTECTed and many where it is. I also don't think that dt could be garbage collected.

Anyway, I wouldn't mind taking it out but I don't think it hurts and then it begs the question of what to do further down in the function or in the rest of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I spotted in the code there were protects sometimes where it was not needed. It is not really an issue, but it spreads the pattern of overprotecting, as you see on yourself. I used to do it as well. Still, I might be wrong, but this will be verified when preparing release for CRAN and running strict memory tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the PROTECT statements and also went ahead and removed the one on what is now line 174 for consistency. I considered just going through the entire file while I was at it, but that seemed out of scope for the PR.

Copy link
Member

@mattdowle mattdowle Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thoughts and logic as you both. But rchk revealed that getAttrib can sometimes allocate inside it, perhaps more so now with ALTREP. Search for "A common source of true errors is a failure to protect the result of getAttrib when retrieving an attribute that may be automatically generated/converted (e.g. names, dimnames)" in https://github.com/kalibera/rchk/blob/master/doc/USAGE.md. Also search https://github.com/kalibera/rchk/blob/master/doc/INTERNALS.md for getAttrib.
I suspect in data.table's usage of the R API, we will never see getAttrib allocate, but R API is more general. So to pass rchk (as per steps in CRAN_Release.cmd, and as required by CRAN under additional checks) we have to protect getAttrib calls.
The variance in why some getAttrib calls are not protected in data.table, may be that rchk knows some cases of getAttrib do not need to be protected depending on what the 2nd argument is.
In the past I've gone through and protected any getAttrib calls that rchk spots until it passes.
Removing over-zealous protection, as long as rchk still passes, is worthwhile for speed and simpler code I agree. Perhaps rchk could be added to GLCI.

Copy link
Contributor Author

@tlapak tlapak Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, thanks. I don't mind reverting those changes at all. I just have two thoughts:

  1. From reading the docs you linked, the seemingly inconsistent calls to PROTECT may be related to garbage collection only being triggered by allocating calls as not all instances of getAttrib(*, R_NamesSymbol) are currently protected.
  2. As far as I can tell, the only instances where it does allocate are for rownames of a data.frame and with R_NamesSymbol when the first argument is a pairlist or a language object (but rchk can't check the first argument)

Copy link
Member

@mattdowle mattdowle Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not fully following there. But I did a grep and yes I see that you're right and not all getAttrib(*, R_NamesSymbol) are protected, for example in rbindlist.c. It could be that those will be picked up when I rerun rchk before release (perhaps those lines were added or changed in dev since 1.12.8), or more likely, from prior discussions with Thomas I suspect that rchk looks afterwards to the usage of the unprotected getAttrib. If there is no possible GC between the getAttrib and its last usage, or if the result is protected by dint of being passed as the value to setAttrib, then rchk is clever enough (I guess) to not raise the unprotected getAttrib. It does an extreme amount of tracking to spot unbalanced protection, for example, all statically (by looking at the source code without using runtime tests) and I've learnt not to underestimate how advanced it is.
  2. That's one case. But isn't it possible too that column names are also some kind of ALTREP object of 1,000,000 columns, where even column names are not materialized either as well as the column data. Even STRING_ELT can allocate if the vector it is passed is an ALTREP.

setAttrib(newdt, sym_index, shallow_duplicate(index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shallow_duplicate was AFAIK designed to be used on VECSXP, LISTSXP, etc. You call it here on INTSXP. It would be good to figure out proper/better? way to shallow copy index. PS. I do exactly the same in #4386.
Hope @mattdowle will have good idea how to handle that. Maybe we need to change index structure, and wrap it into extra list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://github.com/wch/r-source/blob/trunk/src/main/duplicate.c that doesn't seem to be an issue. It starts out exactly like duplicate in that it makes a copy of the INTSXP and then a shallow copy of the attributes. And the flow looks the same for R 3.1 as far as I can tell.
Initially, I had done this more manually by allocating a zero length INTSXP and making a shallow copy of the attributes but this seemed more succinct.

Copy link
Member

@jangorecki jangorecki Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue that could help to address this overhead: #4467

Copy link
Member

@mattdowle mattdowle Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This INTSXP being copied here is an empty one though (the dummy object upon which the indexes are attached as attributes). So I'm not sure what's improper here in the PR as it stands now. @jangorecki Do you mean that it's possible that shallow_duplicate() might not work like this in future in R on this case of being passed INTSXP? If so, that's a good spot but on balance it feels right to use it here for this. And CI will spot if and when R-devel makes any change to shallow_duplicate(); e.g. via the new tests in this PR.

Copy link
Member

@jangorecki jangorecki Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt. Good spot. Sorry for confusion. Indeed copy of integer() (a placeholder for indices) is not a problem. The issue would be to copying of indices itself, which is not happening here. So the related items I was linking here are not really matching use case here.


SEXP sorted = PROTECT(getAttrib(dt, sym_sorted)); protecti++;
setAttrib(newdt, sym_sorted, duplicate(sorted));
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved

SEXP names = PROTECT(getAttrib(dt, R_NamesSymbol)); protecti++;
SEXP newnames = PROTECT(allocVector(STRSXP, n)); protecti++;
if (isNull(cols)) {
Expand Down