-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move some setDT validation checks to C #5427
Conversation
There is a PR related to making setDT more lightweight already #4477. AFAIU it won't address problem described here. |
Some functionality discussed here in comments is already implemented in src/utils.c in https://github.com/Rdatatable/data.table/pull/4370/files |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5427 +/- ##
==========================================
- Coverage 97.51% 97.49% -0.03%
==========================================
Files 80 80
Lines 14979 14880 -99
==========================================
- Hits 14607 14507 -100
- Misses 372 373 +1 ☔ View full report in Codecov by Sentry. |
if (is.data.table(x)) { | ||
# fix for #1078 and #1128, see .resetclass() for explanation. | ||
setattr(x, 'class', .resetclass(x, 'data.table')) | ||
if (!missing(key)) setkeyv(x, key) # fix for #1169 | ||
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE)) | ||
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x) | ||
} else if (is.data.frame(x)) { | ||
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581 | ||
# for performance, only warn on the first such column, #5426 | ||
for (jj in seq_along(x)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be preferable to combine the logic for the for
loops instead of just emitting the warning, but I didn't see an easy way to do so -- the setdt_nrows
parallel also does the other checks in the same loop. Unless we think looping over columns twice is fine to exchange for the clarity of sharing this logic. But this loop is pretty straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good performance improvement.
} | ||
len_xi = INTEGER(dim_xi)[0]; | ||
} else { | ||
len_xi = LENGTH(xi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with #5981 in mind I do have some wariness about potentially introducing issues by skipping dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments but nothing that would require changes.
if (is.data.table(x)) { | ||
# fix for #1078 and #1128, see .resetclass() for explanation. | ||
setattr(x, 'class', .resetclass(x, 'data.table')) | ||
if (!missing(key)) setkeyv(x, key) # fix for #1169 | ||
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE)) | ||
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x) | ||
} else if (is.data.frame(x)) { | ||
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581 | ||
# for performance, only warn on the first such column, #5426 | ||
for (jj in seq_along(x)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.
if (is.data.table(x)) { | ||
# fix for #1078 and #1128, see .resetclass() for explanation. | ||
setattr(x, 'class', .resetclass(x, 'data.table')) | ||
if (!missing(key)) setkeyv(x, key) # fix for #1169 | ||
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE)) | ||
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x) | ||
} else if (is.data.frame(x)) { | ||
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581 | ||
# for performance, only warn on the first such column, #5426 | ||
for (jj in seq_along(x)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good performance improvement.
here is a performance analysis from #6094 from the left-most panel, we see that the proposed changes would reduce time complexity to what appears to be constant (grey Fast curve, independent of the number of columns N), is that expected? I thought that it should still be linear in the number of columns, even after moving to C code, because there still is a for loop over columns right? Maybe we are seeing a constant trend because the C code is so fast that we can not see the asymptotic trend? |
(aside: @Anirban166, I think it would be much better to give a clearer title to the graphs, "Regression fixed in #5463" is not very useful as I have to remember which issue is which, there's no linking in the chart title, etc, better to summarize what's being benchmarked, exactly, e.g. "memrecycle performance") |
Something seems off, I tried locally and don't see that, e.g. l=replicate(1e5, 1, simplify=FALSE)
system.time(setDT(l))
# user system elapsed
# 0.054 0.000 0.054
l=replicate(1e6, 1, simplify=FALSE)
system.time(setDT(l))
# user system elapsed
# 0.544 0.012 0.556
l=replicate(1e7, 1, simplify=FALSE)
system.time(setDT(l))
# user system elapsed
# 6.898 0.084 6.982 This still represents a big improvement over
|
the first one takes time, and the subsequent runs are a no-op: > l=replicate(1e5, 1, simplify=FALSE)
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
user system elapsed
0.05 0.02 0.06
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
user system elapsed
0 0 0
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
user system elapsed
0 0 0 |
oh, right, so probably the benchmarked operation should be |
there is some inconsistency between system.time and bench::mark, because it does a memory measurement first (which makes L a data.table and then subsequent timings are no-op), bench says it is really fast/microseconds (and that is what we are using in atime), whereas system.time is slow/correct I think. > l=replicate(1e6, 1, simplify=FALSE)
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 Fast 79.9µs 79.9µs 12516. 38.2MB 0 1 0 79.9µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 Fast 83.6µs 83.6µs 11962. 0B 0 1 0 83.6µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> l=replicate(1e6, 1, simplify=FALSE)
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
user system elapsed
0.66 0.00 0.66
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
user system elapsed
0 0 0 |
here is a solution (expr.list <- atime::atime_versions_exprs(
pkg.path = "~/R/data.table",
pkg.edit.fun = pkg.edit.fun,
expr = {
##data.table:::setDT(data.table::copy(L))
data.table:::setattr(L,"class",NULL)
data.table:::setDT(L)
},
sha.vec=sha.vec))
library(data.table)
atime.result <- atime::atime(
expr.list=expr.list,
N=10^seq(1,6,by=0.5),
setup = {
L <- replicate(N, 1, simplify = FALSE)
setDT(L)
})
plot(atime.result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
performance increase is great
thanks! let's merge #6094 first to get some "official" atime reporting here as a "production" proof |
Yup, we are working towards that!
Agreed here, and how about "memrecycle regression fixed in #5463"? (me and Toby discussed this a while ago and we think it's useful to keep the number for quick reference, and that using 'performance' everytime would be redundant) |
Moved tests to the more appropriate directory and added a test case based on the performance improvement to be brought by #5427
Generated via commit 1e758e6 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 12 minutes and 41 seconds Time taken to run |
hi @MichaelChirico the peformance test looks good for this PR, and I added a NEWS item, so please feel free to merge. |
love it, thanks @Anirban166 and @tdhock for the work on CB!! |
Closes #5426
It looks like we are mostly bottlenecked by base R here -- setting the names to
V1:Vncol
requires building a huge character vector which appears to be choking the global string cache (our best friend :) )To see evidence of this, here's timings for (1) a plain
setDT()
as cited in the original issue; (2) a re-run ofsetDT()
, where the names are already in the string cache; and (3) a run where the input is already named (so the branch runningpaste0(...)
is skipped):This PR represents a definite improvement (especially for repeated runs, which TBH probably aren't all that practically relevant), but still hits that base bottleneck: