-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tables(mb=type_size) faster lower bound MB by default #5524
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5524 +/- ##
==========================================
- Coverage 98.32% 98.32% -0.01%
==========================================
Files 80 80
Lines 14787 14798 +11
==========================================
+ Hits 14540 14550 +10
- Misses 247 248 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…) call that 1538 in #5520 bumped up against; perhaps combined with object.size. Anyway, both now improved.
@@ -19,9 +19,9 @@ tables(mb=TRUE, order.col="NAME", width=80, | |||
\details{ | |||
Usually \code{tables()} is executed at the prompt, where \code{parent.frame()} returns \code{.GlobalEnv}. \code{tables()} may also be useful inside functions where \code{parent.frame()} is the local scope of the function; in such a scenario, simply set it to \code{.GlobalEnv} to get the same behaviour as at prompt. | |||
|
|||
Note that on older versions of \R, \code{object.size} may be slow, so setting \code{mb=FALSE} may speed up execution of \code{tables} significantly. | |||
`mb = utils::object.size` provides a higher and more accurate estimate of size, but may take longer. Its default `units="b"` is appropriate. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should document the differences (esp. that char/list columns are skipped, and column attributes are skipped (except factor levels).
it may also make sense to leave mb=object.size as the default and set mb=type_size in our suite. my thinking is that in practice tables may have a large part of memory in string columns; hiding that by default may be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if people do use tables()
they use it mainly to get a list of their data.table objects, number of rows and columns, column names, etc. Like a database has ways just to see the tables that exist. I think they rarely use the MB column. So the slowness of getting a better MB figure definitely gets in the way of the more common desire just to list the tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under the mb
argument further up in .Rd
I did document the differences and esp. :
\item{mb}{ a function which accepts a \code{data.table} and returns its size in bytes. By default, \code{type_size} (same as \code{TRUE}) provides a fast lower bound by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared). A column \code{"MB"} is included in the output unless \code{FALSE} or \code{NULL}. }
char/list columns are not completely skipped: the size of their pointers is included, but what those pointers point to is excluded. That's what I thought I conveyed quite well in mb argument there, no? The character string part applies to both character columns and the levels of factor columns.
It is indeed missing that column attributes are skipped though. That could be added to the .Rd
. It does say 'fast lower bound', so perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does when the user can just type data.table:::type_size
to see for themselves. That's a point: I didn't export type_size. Partly because I wasn't sure for the best name and this was an aside when I was focused on the memtest PR. It's really only intended to be used for DT and it is focused on just a set of columns, so maybe not exporting it at least for now is best after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they rarely use the MB column.
can't claim to speak for a typical user (or what is 'common' usage of tables) but personally that column is my main use case
e.g. when I'm doing exploratory stuff in a sandbox session and need to clear out some temp objects, tables() is a quickly one-liner and then I can run rm(x, tmp2, DT2) or w/e to axe the biggest memory hogs
That's what I thought I conveyed quite well in mb argument there, no?
yes, conveyed well on re-read.
perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does
yes this works for me, over-documenting would also be a mistake in this case -- fortunes(250): Use the Source, Luke!
not exporting it at least for now
also agree w keeping it private, especially for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes even if I'm right about what's most common, there will be users like you who always use MB. As a utility function used for its printed output (i.e. it's not code where we're avoiding options()
affecting behavior), the mb=
argument could be getOption("datatable.tables.mb", type_size)
then? Then you could set that option to object.size
. Also a better name than type_size
could be fast_lower_bound
? That way tables(mb=fast_lower_bound,...
reads in your face in the argument list and .Rd. Maybe the "MB"
column could be named differently depending on the mb=
function used; e.g. "MB_fast_lower_bound"
and "MB_object.size"
? Although, those are a bit long? A new argument could be mb_name = paste0("MB_", substitute(mb))
(or whatever is needed to get the function name there) perhaps with an option()
for that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just rename type_size
as lower_bound
to be a bit shorter so that mb=lower_bound
and "MB_lower_bound"
leaves no room for misleading anyone. It can be left to inference that the reason for the lower_bound
is to be fast. So we don't need the word fast
in there.
…ing commas to NCOL, NROW and MB columns, and more
tables()
:# object.size() is slow hence optional; TODO revisit
mb=TRUE
still works and uses the defaulttype_size
functionmb=
, such asobject.size