Modification by reference to its extrem #4345

hope-data-science · 2020-04-03T07:44:39Z

While I have doubts on the modification by reference in the very first place, I think this design has its merits to save both memory and time. I come up of a design and would like to know if it could be implemented in data.table:

Dump and release memory whenever possible, knowing what columns to use and dump the unused ones. Never make copy in a one-step pipeline. Currently, we could not subset rows in place, and the aggregation also makes copies. I am considering, if we just make one copy at the very first, and dump them bit by bit whenever possible in the pipeline ([][] or %>%). Will we get better performance? (both in time and space)

Any consideration on this design? Thanks.

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-04-03T10:02:55Z

An example code would always be helpful.
It is useful to understand internal structure of data.frame for further discussion.
Technically it is not possible to purely subset rows in-place (subset rows by reference is the same as detele rows by reference). Let me explain why.
Note that I will simplify a bit by stripping out SEXP type from C types (SEXP is just an extra layer on top of C types that ships some extra info like length, names, and many other more internal specifics).
Data.frame is a list, list is a collection of C pointers. Collection has some order (the same as columns in data.frame). Those pointers refer to a memory addresses where the data for individual columns are being stored.
So if we have

DT = data.table(int = 1:2, b = c(1, 2))

then DT is a collection of pointers, having 2 pointers, first one is integer pointer, second is double pointer.
If we access DT[["int"]] we are directly accessing C integer array.
If we want to delete a column, we just drop a pointer to DT[["int"]] from our collection of pointers.
If we want to add a column, we just add a pointer DT[["new"]] to collection of pointers.
Those are easy use cases of in-place operations.
But for subset of rows, we would have to touch C integer array itself, not just its pointer. And AFAIK the C does not give any interface for such a modification of object in-place.

There are two approaches that I can imagine (maybe more but I am not really that proficient in C).

establish which rows are not part of the subset, call it z, for each C pointer to data, move data that has to be kept into the location of the data in z, each gap (z-rows) in C data array has to be filled with the next non-z row, the more the gaps the more cycles to do. So if we subset every second row, then there is a lot of work to do. If we subset last 1e6 rows, then almost no work has to be done. At the end we just need to mark what is the new true length of those data arrays. This approach is likely to be slower, than doing copy, but will be memory efficient, because it will not need to allocate new memory where subset of data is being copied to.
find out the z and create a mask over the data, add a marker to data.table that data are actually not as they are, but they have to be filtered on each subsequent operation, using !z mask. So data stays as they were before the subset, but algorithms are able to spot that we have to omit some of the rows when operating on them. This approach is much more complex as it requires changes in tons of places.

This would not be that much different for a in-place grouping. First row of each group would be a row in the result. If groups we have would be c(1,2,3,1,2,3,1,2,3) that would be easy, if they would be c(1,1,1,1,2,2,2,3,3,3) then we would need to move groups over rows of another group. This memory conservative approaches again are likely to be slower.

Note that above is valid for column-oriented data storage (like R data.frame), it is very different for row-oriented data storage, like most of RDBMS. There you will find very easy to add/remove rows by reference, but difficult to add/remove columns.

So now having better understanding of in-place operation, we can see that there is still some, not very big really, space for improvement in memory performance, but the extra cost of time computation makes it less practical, thus lower priority. So I think the answer to your question is that you would get better memory performance, but worse time performance.

We try to balance it, now we provide in-place mechanisms where there is virtually no extra costs for time computation. There are specific cases, like one-hot-encoding, cartesian product, where it would definitely make sense to sacrifice time for better memory, but those are not (yet) in data.table.

hope-data-science · 2020-04-03T12:08:22Z

Thank you so much for the detailed professional answer. It might take some time for me to consume some details, but I think I get the main ideas in it. The data.table deserves even greater popularity than so far, and I am grateful to be part of the it.

jangorecki added the question label Apr 3, 2020

jangorecki closed this as completed Apr 3, 2020

jangorecki mentioned this issue Apr 7, 2020

Delete rows by reference #635

Open

jangorecki mentioned this issue Nov 17, 2022

data.table does not behave like a data.frame #5529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modification by reference to its extrem #4345

Modification by reference to its extrem #4345

hope-data-science commented Apr 3, 2020

jangorecki commented Apr 3, 2020 •

edited

Loading

hope-data-science commented Apr 3, 2020

Modification by reference to its extrem #4345

Modification by reference to its extrem #4345

Comments

hope-data-science commented Apr 3, 2020

jangorecki commented Apr 3, 2020 • edited Loading

hope-data-science commented Apr 3, 2020

jangorecki commented Apr 3, 2020 •

edited

Loading