-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modification by reference to its extrem #4345
Comments
An example code would always be helpful. DT = data.table(int = 1:2, b = c(1, 2)) then DT is a collection of pointers, having 2 pointers, first one is integer pointer, second is double pointer. There are two approaches that I can imagine (maybe more but I am not really that proficient in C).
This would not be that much different for a in-place grouping. First row of each group would be a row in the result. If groups we have would be Note that above is valid for column-oriented data storage (like R data.frame), it is very different for row-oriented data storage, like most of RDBMS. There you will find very easy to add/remove rows by reference, but difficult to add/remove columns. So now having better understanding of in-place operation, we can see that there is still some, not very big really, space for improvement in memory performance, but the extra cost of time computation makes it less practical, thus lower priority. So I think the answer to your question is that you would get better memory performance, but worse time performance. We try to balance it, now we provide in-place mechanisms where there is virtually no extra costs for time computation. There are specific cases, like one-hot-encoding, cartesian product, where it would definitely make sense to sacrifice time for better memory, but those are not (yet) in data.table. |
Thank you so much for the detailed professional answer. It might take some time for me to consume some details, but I think I get the main ideas in it. The data.table deserves even greater popularity than so far, and I am grateful to be part of the it. |
While I have doubts on the modification by reference in the very first place, I think this design has its merits to save both memory and time. I come up of a design and would like to know if it could be implemented in data.table:
Dump and release memory whenever possible, knowing what columns to use and dump the unused ones. Never make copy in a one-step pipeline. Currently, we could not subset rows in place, and the aggregation also makes copies. I am considering, if we just make one copy at the very first, and dump them bit by bit whenever possible in the pipeline (
[][]
or%>%
). Will we get better performance? (both in time and space)Any consideration on this design? Thanks.
The text was updated successfully, but these errors were encountered: