Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deep copy of indices from shallow() #4440

Merged
merged 12 commits into from
Jun 26, 2020

Conversation

tlapak
Copy link
Contributor

@tlapak tlapak commented May 10, 2020

Close #4311.

This still needs a news item might need some benchmarking to confirm that the other occurrences of shallow() had the same impact as they did on joins. But they should. Those would be

  • most setops
  • frankv
  • dcast
  • foverlaps
  • duplicated.data.table

And indirectly of course everything that calls these. But the overhead is only really noticeable when there are many calls as in the original report. I don't know when I'll be able to finish this so I'm putting it up as is for now.

In the code there was a reference to example(merge) breaking if the sorted attribute wasn't copied. That seems to have been due to a pattern of calling setnames internally in the past. But that pattern hasn't been in use for about five years and nothing currently breaks if I don't make that copy. I do it anyway and have added a test for it as I think it's expected to work, even if shallow() isn't exported.

@codecov
Copy link

codecov bot commented May 10, 2020

Codecov Report

Merging #4440 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4440   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          73       73           
  Lines       14112    14118    +6     
=======================================
+ Hits        14057    14063    +6     
  Misses         55       55           
Impacted Files Coverage Δ
src/assign.c 99.84% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad7b67c...769f02c. Read the comment docs.

@tlapak tlapak changed the title Remove deep copy from shallow() Remove deep copy of indices from shallow() May 10, 2020
@jangorecki
Copy link
Member

This is great, I was recently worrying about using too much of shallow and being affected by this issue.

src/assign.c Outdated Show resolved Hide resolved
// We copy all attributes that refer to column names so that calling setnames on either
// the original or the shallow copy doesn't break anything.
SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++;
setAttrib(newdt, sym_index, shallow_duplicate(index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shallow_duplicate was AFAIK designed to be used on VECSXP, LISTSXP, etc. You call it here on INTSXP. It would be good to figure out proper/better? way to shallow copy index. PS. I do exactly the same in #4386.
Hope @mattdowle will have good idea how to handle that. Maybe we need to change index structure, and wrap it into extra list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://github.com/wch/r-source/blob/trunk/src/main/duplicate.c that doesn't seem to be an issue. It starts out exactly like duplicate in that it makes a copy of the INTSXP and then a shallow copy of the attributes. And the flow looks the same for R 3.1 as far as I can tell.
Initially, I had done this more manually by allocating a zero length INTSXP and making a shallow copy of the attributes but this seemed more succinct.

Copy link
Member

@jangorecki jangorecki Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue that could help to address this overhead: #4467

Copy link
Member

@mattdowle mattdowle Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This INTSXP being copied here is an empty one though (the dummy object upon which the indexes are attached as attributes). So I'm not sure what's improper here in the PR as it stands now. @jangorecki Do you mean that it's possible that shallow_duplicate() might not work like this in future in R on this case of being passed INTSXP? If so, that's a good spot but on balance it feels right to use it here for this. And CI will spot if and when R-devel makes any change to shallow_duplicate(); e.g. via the new tests in this PR.

Copy link
Member

@jangorecki jangorecki Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt. Good spot. Sorry for confusion. Indeed copy of integer() (a placeholder for indices) is not a problem. The issue would be to copying of indices itself, which is not happening here. So the related items I was linking here are not really matching use case here.

@jangorecki
Copy link
Member

jangorecki commented May 10, 2020

We need tests that uses address function, those tests need to fail on current master, and pass on this PR.
Don't worry about codecov/project job, it is false positive from codecov due to #4424.

@tlapak
Copy link
Contributor Author

tlapak commented May 10, 2020

Yeah that makes sense. So comparing address(attr(attr(dt, 'index'), '__a')) to the shallow copy should do it I think.
Oh yeah, I wasn't too worried, only a bit mystified by nqrecreateindices.c showing up in the report. But I reckon this is due to me working off of the branch before the last merge.

@tlapak tlapak marked this pull request as ready for review May 12, 2020 20:17

// We copy all attributes that refer to column names so that calling setnames on either
// the original or the shallow copy doesn't break anything.
SEXP index = PROTECT(getAttrib(dt, sym_index)); protecti++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protect AFAIK is not needed. dt would need to be garbage collected before we will assign index as an attribute of newdt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right but I wasn't quite sure so decided to follow the pattern of just a few lines later and take the cautious approach. Looking through the rest of the file, there are a few instances where the call to getAttrib is not PROTECTed and many where it is. I also don't think that dt could be garbage collected.

Anyway, I wouldn't mind taking it out but I don't think it hurts and then it begs the question of what to do further down in the function or in the rest of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I spotted in the code there were protects sometimes where it was not needed. It is not really an issue, but it spreads the pattern of overprotecting, as you see on yourself. I used to do it as well. Still, I might be wrong, but this will be verified when preparing release for CRAN and running strict memory tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the PROTECT statements and also went ahead and removed the one on what is now line 174 for consistency. I considered just going through the entire file while I was at it, but that seemed out of scope for the PR.

Copy link
Member

@mattdowle mattdowle Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thoughts and logic as you both. But rchk revealed that getAttrib can sometimes allocate inside it, perhaps more so now with ALTREP. Search for "A common source of true errors is a failure to protect the result of getAttrib when retrieving an attribute that may be automatically generated/converted (e.g. names, dimnames)" in https://github.com/kalibera/rchk/blob/master/doc/USAGE.md. Also search https://github.com/kalibera/rchk/blob/master/doc/INTERNALS.md for getAttrib.
I suspect in data.table's usage of the R API, we will never see getAttrib allocate, but R API is more general. So to pass rchk (as per steps in CRAN_Release.cmd, and as required by CRAN under additional checks) we have to protect getAttrib calls.
The variance in why some getAttrib calls are not protected in data.table, may be that rchk knows some cases of getAttrib do not need to be protected depending on what the 2nd argument is.
In the past I've gone through and protected any getAttrib calls that rchk spots until it passes.
Removing over-zealous protection, as long as rchk still passes, is worthwhile for speed and simpler code I agree. Perhaps rchk could be added to GLCI.

Copy link
Contributor Author

@tlapak tlapak Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, thanks. I don't mind reverting those changes at all. I just have two thoughts:

  1. From reading the docs you linked, the seemingly inconsistent calls to PROTECT may be related to garbage collection only being triggered by allocating calls as not all instances of getAttrib(*, R_NamesSymbol) are currently protected.
  2. As far as I can tell, the only instances where it does allocate are for rownames of a data.frame and with R_NamesSymbol when the first argument is a pairlist or a language object (but rchk can't check the first argument)

Copy link
Member

@mattdowle mattdowle Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not fully following there. But I did a grep and yes I see that you're right and not all getAttrib(*, R_NamesSymbol) are protected, for example in rbindlist.c. It could be that those will be picked up when I rerun rchk before release (perhaps those lines were added or changed in dev since 1.12.8), or more likely, from prior discussions with Thomas I suspect that rchk looks afterwards to the usage of the unprotected getAttrib. If there is no possible GC between the getAttrib and its last usage, or if the result is protected by dint of being passed as the value to setAttrib, then rchk is clever enough (I guess) to not raise the unprotected getAttrib. It does an extreme amount of tracking to spot unbalanced protection, for example, all statically (by looking at the source code without using runtime tests) and I've learnt not to underestimate how advanced it is.
  2. That's one case. But isn't it possible too that column names are also some kind of ALTREP object of 1,000,000 columns, where even column names are not materialized either as well as the column data. Even STRING_ELT can allocate if the vector it is passed is an ALTREP.

@jangorecki
Copy link
Member

jangorecki commented May 13, 2020

If you need to benchmark this PR on a bigger machine, please prepare best-case and worst-case cases where I can just use N variable to scale up. @tlapak

@tlapak
Copy link
Contributor Author

tlapak commented May 22, 2020

It's not so much about needing a bigger machine, it's more about finding the time to come up with meaningful benchmarks. Just for shallow() it's clear enough so I'll add a small example below. But other operations take so much more time that they drown out the index copying in most situations and the effect only really becomes apparent in loops, similar to the original report. The results aren't massive, but they're there.

shallow <- data.table:::shallow
set.seed(1L)
n <- 1e8
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)

system.time({
    dt2 <- shallow(dt1)
})
# before
# user  system elapsed 
# 0.11    0.03    0.15 

# now
# user  system elapsed 
#    0       0       0 

And for some functions:

library('data.table')

shallow <- data.table:::shallow
set.seed(1L)
n <- 5e6
dt1 <- data.table(a=sample(n, n))
setindex(dt1, a)

dt2 <- data.table(a=sample(1e2, 1e2))
setindex(dt2, a)

gc()
system.time({
  for (i in seq_len(1e2))
    fintersect(dt2, dt1)
})
# before
#   user  system elapsed 
#  72.35    9.45   54.41 

# now
#   user  system elapsed 
#  69.65    8.80   49.03 

system.time({
  for (i in seq_len(1e2))
    duplicated(dt1)
})
# before
#   user  system elapsed 
#  16.25    4.11   14.28 

# now
#   user  system elapsed 
#  17.13    4.53   14.76 

system.time({
  for (i in seq_len(1e2))
    frankv(dt1, 'a')
})
# before
#   user  system elapsed 
#  65.17    7.88   50.44 

# now
#   user  system elapsed 
#  64.72    6.45   48.38 

system.time({
  for (i in seq_len(1e2))
    all.equal(dt1, dt1)
})
# before
#   user  system elapsed 
#  19.09    7.73   28.00 

# now
#   user  system elapsed 
#  15.23    5.46   21.61 

Anirban166 added a commit to Anirban166/data.table that referenced this pull request Apr 6, 2024
GitHub Action + atime test to observe the performance regression fixed by PR Rdatatable#4440
Anirban166 added a commit to Anirban166/data.table that referenced this pull request Apr 9, 2024
Using my GitHub Action to observe the performance regression fixed by PR Rdatatable#4440
Anirban166 added a commit that referenced this pull request Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Major performance drop of keyed := when index is present
4 participants