Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tables(mb=type_size) faster lower bound MB by default #5524
tables(mb=type_size) faster lower bound MB by default #5524
Changes from all commits
2101eae
3b124bd
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should document the differences (esp. that char/list columns are skipped, and column attributes are skipped (except factor levels).
it may also make sense to leave mb=object.size as the default and set mb=type_size in our suite. my thinking is that in practice tables may have a large part of memory in string columns; hiding that by default may be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if people do use
tables()
they use it mainly to get a list of their data.table objects, number of rows and columns, column names, etc. Like a database has ways just to see the tables that exist. I think they rarely use the MB column. So the slowness of getting a better MB figure definitely gets in the way of the more common desire just to list the tables.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under the
mb
argument further up in.Rd
I did document the differences and esp. :char/list columns are not completely skipped: the size of their pointers is included, but what those pointers point to is excluded. That's what I thought I conveyed quite well in mb argument there, no? The character string part applies to both character columns and the levels of factor columns.
It is indeed missing that column attributes are skipped though. That could be added to the
.Rd
. It does say 'fast lower bound', so perhaps words like "for example, by excluding ..." could be added, rather than trying to document exactly what the function does when the user can just typedata.table:::type_size
to see for themselves. That's a point: I didn't export type_size. Partly because I wasn't sure for the best name and this was an aside when I was focused on the memtest PR. It's really only intended to be used for DT and it is focused on just a set of columns, so maybe not exporting it at least for now is best after all.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't claim to speak for a typical user (or what is 'common' usage of tables) but personally that column is my main use case
e.g. when I'm doing exploratory stuff in a sandbox session and need to clear out some temp objects, tables() is a quickly one-liner and then I can run rm(x, tmp2, DT2) or w/e to axe the biggest memory hogs
yes, conveyed well on re-read.
yes this works for me, over-documenting would also be a mistake in this case -- fortunes(250): Use the Source, Luke!
also agree w keeping it private, especially for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes even if I'm right about what's most common, there will be users like you who always use MB. As a utility function used for its printed output (i.e. it's not code where we're avoiding
options()
affecting behavior), themb=
argument could begetOption("datatable.tables.mb", type_size)
then? Then you could set that option toobject.size
. Also a better name thantype_size
could befast_lower_bound
? That waytables(mb=fast_lower_bound,...
reads in your face in the argument list and .Rd. Maybe the"MB"
column could be named differently depending on themb=
function used; e.g."MB_fast_lower_bound"
and"MB_object.size"
? Although, those are a bit long? A new argument could bemb_name = paste0("MB_", substitute(mb))
(or whatever is needed to get the function name there) perhaps with anoption()
for that too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just rename
type_size
aslower_bound
to be a bit shorter so thatmb=lower_bound
and"MB_lower_bound"
leaves no room for misleading anyone. It can be left to inference that the reason for thelower_bound
is to be fast. So we don't need the wordfast
in there.