Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tables(mb=type_size) faster lower bound MB by default #5524

Merged
merged 2 commits into from
Nov 12, 2022

Conversation

mattdowle
Copy link
Member

  • Long standing comment in tables(): # object.size() is slow hence optional; TODO revisit
  • mb=TRUE still works and uses the default type_size function
  • alternative functions can be passed to mb=, such as object.size
  • motivated by test ID 1538 in Move ram tests #5520 (comment) but needed doing anyway

@codecov
Copy link

codecov bot commented Nov 12, 2022

Codecov Report

Merging #5524 (0c301cf) into master (c344cee) will decrease coverage by 0.00%.
The diff coverage is 92.85%.

❗ Current head 0c301cf differs from pull request most recent head 3b124bd. Consider uploading reports for the commit 3b124bd to get more accurate results

@@            Coverage Diff             @@
##           master    #5524      +/-   ##
==========================================
- Coverage   98.32%   98.32%   -0.01%     
==========================================
  Files          80       80              
  Lines       14787    14798      +11     
==========================================
+ Hits        14540    14550      +10     
- Misses        247      248       +1     
Impacted Files Coverage Δ
R/tables.R 97.61% <92.85%> (-2.39%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mattdowle mattdowle merged commit 1148ab9 into master Nov 12, 2022
@mattdowle mattdowle deleted the tables_object_size branch November 12, 2022 03:08
mattdowle added a commit that referenced this pull request Nov 12, 2022
…) call that 1538 in #5520 bumped up against; perhaps combined with object.size. Anyway, both now improved.
mattdowle added a commit that referenced this pull request Nov 12, 2022
@@ -19,9 +19,9 @@ tables(mb=TRUE, order.col="NAME", width=80,
\details{
Usually \code{tables()} is executed at the prompt, where \code{parent.frame()} returns \code{.GlobalEnv}. \code{tables()} may also be useful inside functions where \code{parent.frame()} is the local scope of the function; in such a scenario, simply set it to \code{.GlobalEnv} to get the same behaviour as at prompt.

Note that on older versions of \R, \code{object.size} may be slow, so setting \code{mb=FALSE} may speed up execution of \code{tables} significantly.
`mb = utils::object.size` provides a higher and more accurate estimate of size, but may take longer. Its default `units="b"` is appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document the differences (esp. that char/list columns are skipped, and column attributes are skipped (except factor levels).

it may also make sense to leave mb=object.size as the default and set mb=type_size in our suite. my thinking is that in practice tables may have a large part of memory in string columns; hiding that by default may be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if people do use tables() they use it mainly to get a list of their data.table objects, number of rows and columns, column names, etc. Like a database has ways just to see the tables that exist. I think they rarely use the MB column. So the slowness of getting a better MB figure definitely gets in the way of the more common desire just to list the tables.

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under the mb argument further up in .Rd I did document the differences and esp. :

\item{mb}{ a function which accepts a \code{data.table} and returns its size in bytes. By default, \code{type_size} (same as \code{TRUE}) provides a fast lower bound by excluding the size of character strings in R's global cache (which may be shared) and excluding the size of list column items (which also may be shared). A column \code{"MB"} is included in the output unless \code{FALSE} or \code{NULL}. }

char/list columns are not completely skipped: the size of their pointers is included, but what those pointers point to is excluded. That's what I thought I conveyed quite well in mb argument there, no? The character string part applies to both character columns and the levels of factor columns.

It is indeed missing that column attributes are skipped though. That could be added to the .Rd. It does say 'fast lower bound', so perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does when the user can just type data.table:::type_size to see for themselves. That's a point: I didn't export type_size. Partly because I wasn't sure for the best name and this was an aside when I was focused on the memtest PR. It's really only intended to be used for DT and it is focused on just a set of columns, so maybe not exporting it at least for now is best after all.

Copy link
Member

@MichaelChirico MichaelChirico Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they rarely use the MB column.

can't claim to speak for a typical user (or what is 'common' usage of tables) but personally that column is my main use case

e.g. when I'm doing exploratory stuff in a sandbox session and need to clear out some temp objects, tables() is a quickly one-liner and then I can run rm(x, tmp2, DT2) or w/e to axe the biggest memory hogs

That's what I thought I conveyed quite well in mb argument there, no?

yes, conveyed well on re-read.

perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does

yes this works for me, over-documenting would also be a mistake in this case -- fortunes(250): Use the Source, Luke!

not exporting it at least for now

also agree w keeping it private, especially for now

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes even if I'm right about what's most common, there will be users like you who always use MB. As a utility function used for its printed output (i.e. it's not code where we're avoiding options() affecting behavior), the mb= argument could be getOption("datatable.tables.mb", type_size) then? Then you could set that option to object.size. Also a better name than type_size could be fast_lower_bound? That way tables(mb=fast_lower_bound,... reads in your face in the argument list and .Rd. Maybe the "MB" column could be named differently depending on the mb= function used; e.g. "MB_fast_lower_bound" and "MB_object.size" ? Although, those are a bit long? A new argument could be mb_name = paste0("MB_", substitute(mb)) (or whatever is needed to get the function name there) perhaps with an option() for that too.

Copy link
Member Author

@mattdowle mattdowle Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just rename type_size as lower_bound to be a bit shorter so that mb=lower_bound and "MB_lower_bound" leaves no room for misleading anyone. It can be left to inference that the reason for the lower_bound is to be fast. So we don't need the word fast in there.

@mattdowle mattdowle mentioned this pull request Nov 13, 2022
mattdowle added a commit that referenced this pull request Nov 14, 2022
…ing commas to NCOL, NROW and MB columns, and more
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants