[Request] assigning with := should retain keys better #2372

MarkusBonsch · 2017-09-20T06:34:53Z

Currently, keys and secondary indices are dropped, if any of the key columns is altered via a := assignment.
This is not necessary, however, since the key is still valid for all key columns before the assigned one.
Given this, we can better retain keys and secondary indices and improve data.table performance.
An example:

DT <- data.table(x1 = sample(1:10, 100, replace = TRUE),
                 x2 = sample(1:10, 100, replace = TRUE),
                 x3 = sample(1:10, 100, replace = TRUE),
                 y  = rnorm(100))

setkey(DT, x1, x2, x3)

## DT has a key on x1, x2, x3
print(key(DT))
# [1] "x1" "x2" "x3"

## assigning to x3 drops the key completely:
DT[2, x3 := 1000]
print(key(DT))
# NULL

## even though the sorting of x1 and x2 is of course not affected:
print(length(data.table:::forderv(DT, by = c("x1", "x2"))) == 0)
# TRUE

The text was updated successfully, but these errors were encountered:

MarkusBonsch · 2017-09-22T12:02:35Z

I am working on a solution. Just need to memCheck for memory leaks before creating a pull request

mattdowle · 2017-09-28T22:35:22Z

This was closed by your PR #2389 - thanks! Awesome! The text at the top of the PR "closes [#2372]" had brackets around the number so it didn't auto close this issue. I edited that text after the merge but it still didn't link up. So I guess it needs to be right at the time of the merge.

MarkusBonsch · 2017-10-04T07:06:01Z

I have discovered a minor fault in my implementation:

I am assuming that secondary indices behave exactly like the key.
This is, however, wrong:

the key is just a piece of information on the existing ordering.
The secondary key is a recipe for reordering.

If there is a key on columns x and y, it is a fully valid key on x as well. If there is an index on columns x and y, it is also a fully valid index on x, but it may cause spurious reordering in y if used as an index for x. The only secondary index that is fully equivalent to a key is a integer(0) index.
Although, there is currently no really clear policy on ordering based on indices (issue #2400), this can cause unexpected behaviour and needs to be fixed.

Therefore, I need to change my implementation as follows:

if any column in a secondary index has been modified, drop it
except if it is an integer(0) index: then, the index is still valid for all columns until the assigned one (exclusive)

I will fix asap and create a PR.

MarkusBonsch mentioned this issue Sep 27, 2017

Improved key handling when assigning with ':=' #2389

Merged

mattdowle added this to the v1.10.6 milestone Sep 28, 2017

mattdowle closed this as completed Sep 28, 2017

MarkusBonsch reopened this Oct 4, 2017

This was referenced Oct 4, 2017

[BUG] .shallow messes up keys and indices #2336

Closed

Fixed issue with potential spurious reordering introduced in #2389 #2401

Merged

mattdowle closed this as completed in #2401 Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] assigning with := should retain keys better #2372

[Request] assigning with := should retain keys better #2372

MarkusBonsch commented Sep 20, 2017

MarkusBonsch commented Sep 22, 2017

mattdowle commented Sep 28, 2017

MarkusBonsch commented Oct 4, 2017

[Request] assigning with := should retain keys better #2372

[Request] assigning with := should retain keys better #2372

Comments

MarkusBonsch commented Sep 20, 2017

MarkusBonsch commented Sep 22, 2017

mattdowle commented Sep 28, 2017

MarkusBonsch commented Oct 4, 2017