Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tables(mb=type_size) faster lower bound MB by default #5524

Merged
merged 2 commits into from
Nov 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,8 @@

41. New function `%notin%` provides a convenient alternative to `!(x %in% y)`, [#4152](https://github.com/Rdatatable/data.table/issues/4152). Thanks to Jan Gorecki for suggesting and Michael Czekanski for the PR. `%notin%` uses half the memory because it computes the result directly as opposed to `!` which allocates a new vector to hold the negated result. If `x` is long enough to occupy more than half the remaining free memory, this can make the difference between the operation working, or failing with an out-of-memory error.

42. `tables()` is faster by default by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared). `mb=` now accepts any function which accepts a `data.table` and returns a higher and better estimate of its size in bytes, albeit more slowly; e.g. `mb = utils::object.size`.

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
26 changes: 22 additions & 4 deletions R/tables.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,24 @@
# globals to pass NOTE from R CMD check, see http://stackoverflow.com/questions/9439256
MB = NCOL = NROW = NULL

tables = function(mb=TRUE, order.col="NAME", width=80,
type_size = function(DT) {
# for speed and ram efficiency, a lower bound by not descending into character string lengths or list items
# if a more accurate and higher estimate is needed then user can pass object.size or alternative to mb=
# in case number of columns is very large (e.g. 1e6 columns) then we use a for() to avoid allocation of sapply()
ans = 0L
lookup = c("raw"=1L, "integer"=4L, "double"=8L, "complex"=16L)
for (i in seq_along(DT)) {
col = DT[[i]]
tt = lookup[storage.mode(col)]
if (is.na(tt)) tt = .Machine$sizeof.pointer
tt = tt*nrow(DT)
if (is.factor(col)) tt = tt + length(levels(col))*.Machine$sizeof.pointer
ans = ans + tt
}
ans + ncol(DT)*.Machine$sizeof.pointer # column name pointers
}

tables = function(mb=type_size, order.col="NAME", width=80,
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
env=parent.frame(), silent=FALSE, index=FALSE)
{
# Prints name, size and colnames of all data.tables in the calling environment by default
Expand All @@ -13,14 +30,15 @@ tables = function(mb=TRUE, order.col="NAME", width=80,
if (!silent) catf("No objects of class data.table exist in %s\n", if (identical(env, .GlobalEnv)) ".GlobalEnv" else format(env))
return(invisible(data.table(NULL)))
}
if (isTRUE(mb)) mb=type_size # can still use TRUE, although TRUE will now be the lower faster type_size method
DT_names = all_obj[is_DT]
info = rbindlist(lapply(DT_names, function(dt_n){
DT = get(dt_n, envir=env) # doesn't copy
data.table( # data.table excludes any NULL items (MB and INDICES optional) unlike list()
NAME = dt_n,
NROW = nrow(DT),
NCOL = ncol(DT),
MB = if (mb) round(as.numeric(object.size(DT))/1024^2), # object.size() is slow hence optional; TODO revisit
MB = if (is.function(mb)) round(as.numeric(mb(DT))/1024^2),
COLS = list(names(DT)),
KEY = list(key(DT)),
INDICES = if (index) list(indices(DT)))
Expand All @@ -38,9 +56,9 @@ tables = function(mb=TRUE, order.col="NAME", width=80,
tt = copy(info)
tt[ , NROW := pretty_format(NROW, width=4L)]
tt[ , NCOL := pretty_format(NCOL, width=4L)]
if (mb) tt[ , MB := pretty_format(MB, width=2L)]
if (is.function(mb)) tt[ , MB := pretty_format(MB, width=2L)]
print(tt, class=FALSE, nrows=Inf)
if (mb) catf("Total: %sMB\n", prettyNum(sum(info$MB), big.mark=","))
if (is.function(mb)) catf("Total: %sMB\n", prettyNum(sum(info$MB), big.mark=","))
}
invisible(info)
}
Expand Down
8 changes: 4 additions & 4 deletions man/tables.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
Convenience function for concisely summarizing some metadata of all \code{data.table}s in memory (or an optionally specified environment).
}
\usage{
tables(mb=TRUE, order.col="NAME", width=80,
tables(mb=type_size, order.col="NAME", width=80,
env=parent.frame(), silent=FALSE, index=FALSE)
}
\arguments{
\item{mb}{ \code{logical}; \code{TRUE} adds the rough size of each \code{data.table} in megabytes to the output under column \code{MB}. }
\item{mb}{ a function which accepts a \code{data.table} and returns its size in bytes. By default, \code{type_size} (same as \code{TRUE}) provides a fast lower bound by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared). A column \code{"MB"} is included in the output unless \code{FALSE} or \code{NULL}. }
\item{order.col}{ Column name (\code{character}) by which to sort the output. }
\item{width}{ \code{integer}; number of characters beyond which the output for each of the columns \code{COLS}, \code{KEY}, and \code{INDICES} are truncated. }
\item{env}{ An \code{environment}, typically the \code{.GlobalEnv} by default, see Details. }
Expand All @@ -19,9 +19,9 @@ tables(mb=TRUE, order.col="NAME", width=80,
\details{
Usually \code{tables()} is executed at the prompt, where \code{parent.frame()} returns \code{.GlobalEnv}. \code{tables()} may also be useful inside functions where \code{parent.frame()} is the local scope of the function; in such a scenario, simply set it to \code{.GlobalEnv} to get the same behaviour as at prompt.

Note that on older versions of \R, \code{object.size} may be slow, so setting \code{mb=FALSE} may speed up execution of \code{tables} significantly.
`mb = utils::object.size` provides a higher and more accurate estimate of size, but may take longer. Its default `units="b"` is appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document the differences (esp. that char/list columns are skipped, and column attributes are skipped (except factor levels).

it may also make sense to leave mb=object.size as the default and set mb=type_size in our suite. my thinking is that in practice tables may have a large part of memory in string columns; hiding that by default may be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if people do use tables() they use it mainly to get a list of their data.table objects, number of rows and columns, column names, etc. Like a database has ways just to see the tables that exist. I think they rarely use the MB column. So the slowness of getting a better MB figure definitely gets in the way of the more common desire just to list the tables.

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under the mb argument further up in .Rd I did document the differences and esp. :

\item{mb}{ a function which accepts a \code{data.table} and returns its size in bytes. By default, \code{type_size} (same as \code{TRUE}) provides a fast lower bound by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared). A column \code{"MB"} is included in the output unless \code{FALSE} or \code{NULL}. }

char/list columns are not completely skipped: the size of their pointers is included, but what those pointers point to is excluded. That's what I thought I conveyed quite well in mb argument there, no? The character string part applies to both character columns and the levels of factor columns.

It is indeed missing that column attributes are skipped though. That could be added to the .Rd. It does say 'fast lower bound', so perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does when the user can just type data.table:::type_size to see for themselves. That's a point: I didn't export type_size. Partly because I wasn't sure for the best name and this was an aside when I was focused on the memtest PR. It's really only intended to be used for DT and it is focused on just a set of columns, so maybe not exporting it at least for now is best after all.

Copy link
Member

@MichaelChirico MichaelChirico Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they rarely use the MB column.

can't claim to speak for a typical user (or what is 'common' usage of tables) but personally that column is my main use case

e.g. when I'm doing exploratory stuff in a sandbox session and need to clear out some temp objects, tables() is a quickly one-liner and then I can run rm(x, tmp2, DT2) or w/e to axe the biggest memory hogs

That's what I thought I conveyed quite well in mb argument there, no?

yes, conveyed well on re-read.

perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does

yes this works for me, over-documenting would also be a mistake in this case -- fortunes(250): Use the Source, Luke!

not exporting it at least for now

also agree w keeping it private, especially for now

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes even if I'm right about what's most common, there will be users like you who always use MB. As a utility function used for its printed output (i.e. it's not code where we're avoiding options() affecting behavior), the mb= argument could be getOption("datatable.tables.mb", type_size) then? Then you could set that option to object.size. Also a better name than type_size could be fast_lower_bound? That way tables(mb=fast_lower_bound,... reads in your face in the argument list and .Rd. Maybe the "MB" column could be named differently depending on the mb= function used; e.g. "MB_fast_lower_bound" and "MB_object.size" ? Although, those are a bit long? A new argument could be mb_name = paste0("MB_", substitute(mb)) (or whatever is needed to get the function name there) perhaps with an option() for that too.

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just rename type_size as lower_bound to be a bit shorter so that mb=lower_bound and "MB_lower_bound" leaves no room for misleading anyone. It can be left to inference that the reason for the lower_bound is to be fast. So we don't need the word fast in there.

Setting \code{silent=TRUE} prints nothing; the metadata are returned as a \code{data.table}, invisibly, whether silent is \code{TRUE} or \code{FALSE}.
Setting \code{silent=TRUE} prints nothing; the metadata is returned as a \code{data.table} invisibly whether \code{silent} is \code{TRUE} or \code{FALSE}.
}
\value{
A \code{data.table} containing the information printed.
Expand Down