-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete rows by reference #635
Comments
This comment has been minimized.
This comment has been minimized.
Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much. |
How about adding both |
More advanced example :
Is |
Re right name: doesn't |
I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.
I don't know that there's a sensible way to extend this logic to work inside
Just as new columns cannot be created by |
if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273. ## ---- Deleting rows by reference using data.table*
## ---- *not exactly!
# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of
delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
DT_subset[, (col) := DT[[col]][keep.idxs]]
set(DT, NULL, col, NULL)
}
return(DT_subset)
}
str(delete(DT, keep.idxs))
str(DT) |
@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table: > dim(d1)
[1] 0 0 |
To complement @andrewrech's answer. Here is code as function and example of its usage.
And example of its usage:
Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.
This is my very first GitHub post, btw. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start ! The delete function could be added using @jarno-p awnser and later on modified to be more efficient and works with |
I think the open question is the best API.
The functional approach of @jarno-p would be a change from this, where row deletion would become functional & require |
Although I am all but qualified to comment, should the syntax user perspective be more like:
Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows. Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns). |
An approach to bypass
You would need to find out the environment of |
This comment has been minimized.
This comment has been minimized.
There is an interesting question on SO: Subsetting a large vector uses unnecessarily large amounts of memory Not directly related to |
This comment has been minimized.
This comment has been minimized.
I provided one design idea to address this issue in #4345 (comment) |
Proof of concept based on #4345 (comment) setsubset = function(x, i) {
stopifnot(is.data.table(x), is.integer(i))
if (!length(i)) return(x)
if (anyNA(i) || anyDuplicated(i) || any(i < 1L || i > nrow(x) || is.unsorted(i))) stop("i must be non-NA, no dups, in range of 1:nrow(x) and sorted")
drop = setdiff(1:nrow(x), i)
last_ii = drop[1L]-1L
do_i = i[i > last_ii]
for (ii in do_i) {
last_ii = last_ii+1L
set(x, last_ii, names(x), as.list(x[ii]))
}
## we need to set true length here but this needs C
invisible(x)
}
x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])
x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i]) |
working example using library(data.table)
setsubset = data.table:::setsubset
x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
x
# a b
# <int> <int>
#1: 1 8
#2: 2 7
#3: 3 6
#4: 4 5
#5: 5 4
#6: 6 3
#7: 7 2
#8: 8 1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
# a b
# <int> <int>
#1: 1 8
#2: 2 7
#3: 6 3
#4: 7 2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE
x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
x
# a b
# <int> <int>
#1: 1 8
#2: 2 7
#3: 3 6
#4: 4 5
#5: 5 4
#6: 6 3
#7: 7 2
#8: 8 1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
# a b
# <int> <int>
#1: 3 6
#2: 5 4
#3: 7 2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE |
Cool! I'm wondering if the x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
# set truelength, drop .keep Or if x[, .old_index := .I]
x[i, .keep := TRUE]
setorder(x, .keep, .old_index, na.last=TRUE)
# set truelength, drop .keep, .old_index It looks like memory addresses survive these operations (?). x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
mem = c(address(x), sapply(x, address))
x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
x[, .keep := NULL]
all.equal(x[seq_along(i)], X[i])
all.equal(c(address(x), sapply(x, address)), mem) If the current approach is necessary, I think you could swap... drop = setdiff(seq_len(nrow(x)), i)
last_ii = drop[1L]-1L for something like first_drop = match(FALSE, seq_along(i) == i, nomatch = tail(i, 1L)+1L)
last_ii = first_drop - 1L Why? Speed (avoiding setdiff and scaling with nrow(x)) and handling the edge case where all rows are kept |
Wonderful idea! |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
In case of updates, you will see comments here, or in exists, in a linked PR. BTW. I have impression that many readers misunderstood benefits of this function. It will most likely be slower than copy as it has to be made now. It will not have to allocate memory for extra copy of data.table, but that allocated memory is released just after assignment, so it is really relevant only when you run into OOM error, that could be avoidable if your data would be twice smaller. |
Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link
Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :
We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.
The text was updated successfully, but these errors were encountered: