Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] assigning with := should retain keys better #2372

Closed
MarkusBonsch opened this issue Sep 20, 2017 · 3 comments · Fixed by #2389 or #2401
Closed

[Request] assigning with := should retain keys better #2372

MarkusBonsch opened this issue Sep 20, 2017 · 3 comments · Fixed by #2389 or #2401
Milestone

Comments

@MarkusBonsch
Copy link
Contributor

Currently, keys and secondary indices are dropped, if any of the key columns is altered via a := assignment.
This is not necessary, however, since the key is still valid for all key columns before the assigned one.
Given this, we can better retain keys and secondary indices and improve data.table performance.
An example:

DT <- data.table(x1 = sample(1:10, 100, replace = TRUE),
                 x2 = sample(1:10, 100, replace = TRUE),
                 x3 = sample(1:10, 100, replace = TRUE),
                 y  = rnorm(100))

setkey(DT, x1, x2, x3)

## DT has a key on x1, x2, x3
print(key(DT))
# [1] "x1" "x2" "x3"

## assigning to x3 drops the key completely:
DT[2, x3 := 1000]
print(key(DT))
# NULL

## even though the sorting of x1 and x2 is of course not affected:
print(length(data.table:::forderv(DT, by = c("x1", "x2"))) == 0)
# TRUE


@MarkusBonsch
Copy link
Contributor Author

I am working on a solution. Just need to memCheck for memory leaks before creating a pull request

@mattdowle
Copy link
Member

This was closed by your PR #2389 - thanks! Awesome! The text at the top of the PR "closes [#2372]" had brackets around the number so it didn't auto close this issue. I edited that text after the merge but it still didn't link up. So I guess it needs to be right at the time of the merge.

@MarkusBonsch
Copy link
Contributor Author

I have discovered a minor fault in my implementation:

I am assuming that secondary indices behave exactly like the key.
This is, however, wrong:

  • the key is just a piece of information on the existing ordering.
  • The secondary key is a recipe for reordering.

If there is a key on columns x and y, it is a fully valid key on x as well. If there is an index on columns x and y, it is also a fully valid index on x, but it may cause spurious reordering in y if used as an index for x. The only secondary index that is fully equivalent to a key is a integer(0) index.
Although, there is currently no really clear policy on ordering based on indices (issue #2400), this can cause unexpected behaviour and needs to be fixed.

Therefore, I need to change my implementation as follows:

  • if any column in a secondary index has been modified, drop it
  • except if it is an integer(0) index: then, the index is still valid for all columns until the assigned one (exclusive)

I will fix asap and create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants