Remove deep copy of indices from shallow() #4440

tlapak · 2020-05-10T20:44:43Z

This ~~still needs a news item~~ might need some benchmarking to confirm that the other occurrences of shallow() had the same impact as they did on joins. But they should. Those would be

most setops
frankv
dcast
foverlaps
duplicated.data.table

And indirectly of course everything that calls these. But the overhead is only really noticeable when there are many calls as in the original report. I don't know when I'll be able to finish this so I'm putting it up as is for now.

In the code there was a reference to example(merge) breaking if the sorted attribute wasn't copied. That seems to have been due to a pattern of calling setnames internally in the past. But that pattern hasn't been in use for about five years and nothing currently breaks if I don't make that copy. I do it anyway and have added a test for it as I think it's expected to work, even if shallow() isn't exported.

codecov · 2020-05-10T20:59:38Z

Codecov Report

Merging #4440 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4440   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          73       73           
  Lines       14112    14118    +6     
=======================================
+ Hits        14057    14063    +6     
  Misses         55       55

Impacted Files	Coverage Δ
src/assign.c	`99.84% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad7b67c...769f02c. Read the comment docs.

jangorecki · 2020-05-10T21:32:16Z

This is great, I was recently worrying about using too much of shallow and being affected by this issue.

src/assign.c

jangorecki · 2020-05-10T21:38:50Z

src/assign.c

+  // We copy all attributes that refer to column names so that calling setnames on either
+  // the original or the shallow copy doesn't break anything.
+  SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++;
+  setAttrib(newdt, sym_index, shallow_duplicate(index));


shallow_duplicate was AFAIK designed to be used on VECSXP, LISTSXP, etc. You call it here on INTSXP. It would be good to figure out proper/better? way to shallow copy index. PS. I do exactly the same in #4386.
Hope @mattdowle will have good idea how to handle that. Maybe we need to change index structure, and wrap it into extra list?

Looking at https://github.com/wch/r-source/blob/trunk/src/main/duplicate.c that doesn't seem to be an issue. It starts out exactly like duplicate in that it makes a copy of the INTSXP and then a shallow copy of the attributes. And the flow looks the same for R 3.1 as far as I can tell.
Initially, I had done this more manually by allocating a zero length INTSXP and making a shallow copy of the attributes but this seemed more succinct.

I made an issue that could help to address this overhead: #4467

This INTSXP being copied here is an empty one though (the dummy object upon which the indexes are attached as attributes). So I'm not sure what's improper here in the PR as it stands now. @jangorecki Do you mean that it's possible that shallow_duplicate() might not work like this in future in R on this case of being passed INTSXP? If so, that's a good spot but on balance it feels right to use it here for this. And CI will spot if and when R-devel makes any change to shallow_duplicate(); e.g. via the new tests in this PR.

Thanks Matt. Good spot. Sorry for confusion. Indeed copy of integer() (a placeholder for indices) is not a problem. The issue would be to copying of indices itself, which is not happening here. So the related items I was linking here are not really matching use case here.

jangorecki · 2020-05-10T21:45:22Z

We need tests that uses address function, those tests need to fail on current master, and pass on this PR.
Don't worry about codecov/project job, it is false positive from codecov due to #4424.

tlapak · 2020-05-10T22:27:59Z

Yeah that makes sense. So comparing address(attr(attr(dt, 'index'), '__a')) to the shallow copy should do it I think.
Oh yeah, I wasn't too worried, only a bit mystified by nqrecreateindices.c showing up in the report. But I reckon this is due to me working off of the branch before the last merge.

src/assign.c

jangorecki · 2020-05-12T20:43:54Z

src/assign.c

+
+  // We copy all attributes that refer to column names so that calling setnames on either
+  // the original or the shallow copy doesn't break anything.
+  SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++;


Protect AFAIK is not needed. dt would need to be garbage collected before we will assign index as an attribute of newdt.

You are probably right but I wasn't quite sure so decided to follow the pattern of just a few lines later and take the cautious approach. Looking through the rest of the file, there are a few instances where the call to getAttrib is not PROTECTed and many where it is. I also don't think that dt could be garbage collected.

Anyway, I wouldn't mind taking it out but I don't think it hurts and then it begs the question of what to do further down in the function or in the rest of the code.

Agree, I spotted in the code there were protects sometimes where it was not needed. It is not really an issue, but it spreads the pattern of overprotecting, as you see on yourself. I used to do it as well. Still, I might be wrong, but this will be verified when preparing release for CRAN and running strict memory tests.

I've removed the PROTECT statements and also went ahead and removed the one on what is now line 174 for consistency. I considered just going through the entire file while I was at it, but that seemed out of scope for the PR.

I had the same thoughts and logic as you both. But rchk revealed that getAttrib can sometimes allocate inside it, perhaps more so now with ALTREP. Search for "A common source of true errors is a failure to protect the result of getAttrib when retrieving an attribute that may be automatically generated/converted (e.g. names, dimnames)" in https://github.com/kalibera/rchk/blob/master/doc/USAGE.md. Also search https://github.com/kalibera/rchk/blob/master/doc/INTERNALS.md for getAttrib.
I suspect in data.table's usage of the R API, we will never see getAttrib allocate, but R API is more general. So to pass rchk (as per steps in CRAN_Release.cmd, and as required by CRAN under additional checks) we have to protect getAttrib calls.
The variance in why some getAttrib calls are not protected in data.table, may be that rchk knows some cases of getAttrib do not need to be protected depending on what the 2nd argument is.
In the past I've gone through and protected any getAttrib calls that rchk spots until it passes.
Removing over-zealous protection, as long as rchk still passes, is worthwhile for speed and simpler code I agree. Perhaps rchk could be added to GLCI.

Very interesting, thanks. I don't mind reverting those changes at all. I just have two thoughts:

From reading the docs you linked, the seemingly inconsistent calls to PROTECT may be related to garbage collection only being triggered by allocating calls as not all instances of getAttrib(*, R_NamesSymbol) are currently protected.

As far as I can tell, the only instances where it does allocate are for rownames of a data.frame and with R_NamesSymbol when the first argument is a pairlist or a language object (but rchk can't check the first argument)

I'm not fully following there. But I did a grep and yes I see that you're right and not all getAttrib(*, R_NamesSymbol) are protected, for example in rbindlist.c. It could be that those will be picked up when I rerun rchk before release (perhaps those lines were added or changed in dev since 1.12.8), or more likely, from prior discussions with Thomas I suspect that rchk looks afterwards to the usage of the unprotected getAttrib. If there is no possible GC between the getAttrib and its last usage, or if the result is protected by dint of being passed as the value to setAttrib, then rchk is clever enough (I guess) to not raise the unprotected getAttrib. It does an extreme amount of tracking to spot unbalanced protection, for example, all statically (by looking at the source code without using runtime tests) and I've learnt not to underestimate how advanced it is.

That's one case. But isn't it possible too that column names are also some kind of ALTREP object of 1,000,000 columns, where even column names are not materialized either as well as the column data. Even STRING_ELT can allocate if the vector it is passed is an ALTREP.

jangorecki · 2020-05-13T09:30:53Z

If you need to benchmark this PR on a bigger machine, please prepare best-case and worst-case cases where I can just use N variable to scale up. @tlapak

tlapak · 2020-05-22T18:32:36Z

It's not so much about needing a bigger machine, it's more about finding the time to come up with meaningful benchmarks. Just for shallow() it's clear enough so I'll add a small example below. But other operations take so much more time that they drown out the index copying in most situations and the effect only really becomes apparent in loops, similar to the original report. The results aren't massive, but they're there.

shallow <- data.table:::shallow
set.seed(1L)
n <- 1e8
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)

system.time({
    dt2 <- shallow(dt1)
})
# before
# user  system elapsed 
# 0.11    0.03    0.15 

# now
# user  system elapsed 
#    0       0       0

And for some functions:

library('data.table')

shallow <- data.table:::shallow
set.seed(1L)
n <- 5e6
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)

dt2 <- data.table(a=sample(1e2, 1e2))
setindex(dt2, a)

gc()
system.time({
  for (i in seq_len(1e2))
    fintersect(dt2, dt1)
})
# before
#   user  system elapsed 
#  72.35    9.45   54.41 

# now
#   user  system elapsed 
#  69.65    8.80   49.03 

system.time({
  for (i in seq_len(1e2))
    duplicated(dt1)
})
# before
#   user  system elapsed 
#  16.25    4.11   14.28 

# now
#   user  system elapsed 
#  17.13    4.53   14.76 

system.time({
  for (i in seq_len(1e2))
    frankv(dt1, 'a')
})
# before
#   user  system elapsed 
#  65.17    7.88   50.44 

# now
#   user  system elapsed 
#  64.72    6.45   48.38 

system.time({
  for (i in seq_len(1e2))
    all.equal(dt1, dt1)
})
# before
#   user  system elapsed 
#  19.09    7.73   28.00 

# now
#   user  system elapsed 
#  15.23    5.46   21.61

GitHub Action + atime test to observe the performance regression fixed by PR Rdatatable#4440

Using my GitHub Action to observe the performance regression fixed by PR Rdatatable#4440

…nce the last commit of #4440 failed to check out

tlapak added 2 commits May 4, 2020 15:58

Only copy attributes that refer to column names

0f0e712

Adding regression tests

bc155fa

tlapak changed the title ~~Remove deep copy from shallow()~~ Remove deep copy of indices from shallow() May 10, 2020

jangorecki reviewed May 10, 2020

View reviewed changes

Compatibility with R 3.1.0 and additional unit test

be699cf

MichaelChirico reviewed May 11, 2020

View reviewed changes

src/assign.c Show resolved Hide resolved

Added news item

2d4dac1

tlapak marked this pull request as ready for review May 12, 2020 20:17

jangorecki reviewed May 12, 2020

View reviewed changes

jangorecki mentioned this pull request May 21, 2020

separate index attributes from index #4467

Open

tlapak added 4 commits May 22, 2020 20:35

Make sure to set S4 bit in shallow()

332f6b7

Remove superfluous PROTECTs from shallow()

1d3cea1

Add S4 test

1334aa7

Merge branch 'master' into tuning_shallow

801a822

jangorecki added this to the 1.12.9 milestone Jun 21, 2020

jangorecki mentioned this pull request Jun 21, 2020

Major performance drop of keyed := when index is present #4311

Closed

mattdowle and others added 4 commits June 25, 2020 14:57

Merge branch 'master' into tuning_shallow

40f9165

Revert removal of PROTECTs

959b676

Tweak news item

e87c5f0

Further tweak to news item

769f02c

mattdowle merged commit 9d3b920 into Rdatatable:master Jun 26, 2020

jangorecki mentioned this pull request Jun 26, 2020

join operation almost 2 times slower #3928

Open

DorisAmoakohene mentioned this pull request Oct 18, 2023

closed commit ids data.table issue#4440 DorisAmoakohene/data.table_test#7

Open

Anirban166 mentioned this pull request Mar 19, 2024

GitHub Action + atime test to observe the performance regression introduced by PR #4491 and fixed by PR #5463 Anirban166/data.table#2

Merged

Anirban166 added a commit to Anirban166/data.table that referenced this pull request Apr 6, 2024

Merge pull request #3 from Anirban166/after-4440-got-merged

bad5608

GitHub Action + atime test to observe the performance regression fixed by PR Rdatatable#4440

Anirban166 mentioned this pull request Apr 9, 2024

Using my GitHub Action to observe the performance regression fixed by PR #4440 Anirban166/data.table#8

Open

Anirban166 added a commit to Anirban166/data.table that referenced this pull request Apr 9, 2024

Merge pull request Rdatatable#7 from Anirban166/after-4440

a4c9c10

Using my GitHub Action to observe the performance regression fixed by PR Rdatatable#4440

Anirban166 added a commit that referenced this pull request Apr 12, 2024

Reverted changes to the 'Fixed' commit SHA for the first test case si…

a3d5cf9

…nce the last commit of #4440 failed to check out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove deep copy of indices from shallow() #4440

Remove deep copy of indices from shallow() #4440

tlapak commented May 10, 2020 •

edited

Loading

codecov bot commented May 10, 2020 •

edited

Loading

jangorecki commented May 10, 2020

jangorecki May 10, 2020

tlapak May 10, 2020

jangorecki Jun 25, 2020 •

edited

Loading

mattdowle Jun 26, 2020 •

edited

Loading

jangorecki Jun 26, 2020 •

edited

Loading

jangorecki commented May 10, 2020 •

edited

Loading

tlapak commented May 10, 2020

jangorecki May 12, 2020

tlapak May 13, 2020

jangorecki May 13, 2020

tlapak May 22, 2020

mattdowle Jun 25, 2020 •

edited

Loading

tlapak Jun 25, 2020 •

edited

Loading

mattdowle Jun 25, 2020 •

edited

Loading

jangorecki commented May 13, 2020 •

edited

Loading

tlapak commented May 22, 2020 •

edited

Loading

Remove deep copy of indices from shallow() #4440

Remove deep copy of indices from shallow() #4440

Conversation

tlapak commented May 10, 2020 • edited Loading

codecov bot commented May 10, 2020 • edited Loading

Codecov Report

jangorecki commented May 10, 2020

jangorecki May 10, 2020

Choose a reason for hiding this comment

tlapak May 10, 2020

Choose a reason for hiding this comment

jangorecki Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

mattdowle Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

jangorecki Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

jangorecki commented May 10, 2020 • edited Loading

tlapak commented May 10, 2020

jangorecki May 12, 2020

Choose a reason for hiding this comment

tlapak May 13, 2020

Choose a reason for hiding this comment

jangorecki May 13, 2020

Choose a reason for hiding this comment

tlapak May 22, 2020

Choose a reason for hiding this comment

mattdowle Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

tlapak Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

mattdowle Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

jangorecki commented May 13, 2020 • edited Loading

tlapak commented May 22, 2020 • edited Loading

tlapak commented May 10, 2020 •

edited

Loading

codecov bot commented May 10, 2020 •

edited

Loading

jangorecki Jun 25, 2020 •

edited

Loading

mattdowle Jun 26, 2020 •

edited

Loading

jangorecki Jun 26, 2020 •

edited

Loading

jangorecki commented May 10, 2020 •

edited

Loading

mattdowle Jun 25, 2020 •

edited

Loading

tlapak Jun 25, 2020 •

edited

Loading

mattdowle Jun 25, 2020 •

edited

Loading

jangorecki commented May 13, 2020 •

edited

Loading

tlapak commented May 22, 2020 •

edited

Loading