Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vec_get() #141

Open
DavisVaughan opened this issue Nov 26, 2018 · 27 comments
Open

vec_get() #141

DavisVaughan opened this issue Nov 26, 2018 · 27 comments

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 26, 2018

As a complement to vec_slice() @romainfrancois and I have been discussing the utility of a function that would extract 1 observation, but would be more of an analog to [[ than [.

Possible implementation:

vec_rip <- function (x, i) {
  
  if (is.logical(i)) {
    i <- which(i)
  }
  
  stopifnot(is.integer(i) || is.character(i))
  
  # this is new 
  stopifnot(length(i) == 1L)
  
  if (is.null(x)) {
    NULL
  }
  else if (is.data.frame(x)) {
    out <- lapply(x, `[`, i)
    vec_restore(out, x)
  }
  else if (is_vector(x)) {
    d <- vec_dims(x)
    if (d == 1) {
      # this is all that changed?
      if (is.object(x)) {
        # x[i]
        x[[i]]
      }
      else {
        .subset2(x, i)
        # x[[i, drop = FALSE]]
      }
    }
    else if (d == 2) {
      x[i, , drop = FALSE]
    }
    else {
      miss_args <- rep(list(missing_arg()), d - 1)
      eval_bare(expr(x[i, !!!miss_args, drop = FALSE]))
    }
  }
  else {
    stop("`x` must be a vector", call. = FALSE)
  }
}

It would function somewhat like:

vec_rip(list(x = 1, y = 2), 1L)
#> [1] 1
vec_slice(list(x = 1, y = 2), 1L)
#> $x
#> [1] 1

# Not sure about what these should do:
vec_rip(mtcars, 1L)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4
vec_rip(matrix(1:12, nrow = 3), 1L)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    4    7   10

vec_slice(starwars$films, 1L)
#> [[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

vec_rip(starwars$films, 1L)
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# vec_rip() can only extract 1 observation at a time
vec_slice(starwars$films, 1:2)
#> [[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"      
#> 
#> [[2]]
#> [1] "Attack of the Clones"    "The Phantom Menace"     
#> [3] "Revenge of the Sith"     "Return of the Jedi"     
#> [5] "The Empire Strikes Back" "A New Hope"

vec_rip(starwars$films, 1:2)
#> Error in vec_rip(starwars$films, 1:2): length(i) == 1L is not TRUE

We are currently undecided on what it "should" do for data frames and matrices. A couple ideas:

1) Return the 1 row observation as is (so the same as vec_slice(), this is shown above). This doesn't feel right.
2) Extract the 1 row observation, then coerce it to some lower level type. For data.frames, a list and for matrices, a vector.

If 2) is chosen, one question that came up is "what should it do for list columns"? Two possibilities:

data("starwars", package = "dplyr")
starwars <- starwars[,c("species", "films")]
one_ob <- vctrs::vec_slice(starwars, 1L)

# don't drop a dimension
as.list(one_ob)
#> $species
#> [1] "Human"
#> 
#> $films
#> $films[[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# drop 1 dimension
lapply(one_ob, .subset2, 1)
#> $species
#> [1] "Human"
#> 
#> $films
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# but with better support for this
# so it doesn't unravel (i.e. better support 
# for vctr rcrd types)
.subset2(as.POSIXlt(Sys.time()), 1)
#> [1] 26.99161
@DavisVaughan DavisVaughan changed the title vctrs_rip() vec_rip() Nov 26, 2018
@romainfrancois
Copy link
Contributor

This would be useful in purrr::map() situations when you iterate on some custom vctr or rcrd type

@romainfrancois
Copy link
Contributor

Perhaps related to tidyverse/dplyr#3789

@hadley
Copy link
Member

hadley commented Nov 26, 2018

To get this included in vctrs, you need to answer the question: What are the invariants that this function would satisfy?

@jennybc
Copy link
Member

jennybc commented Nov 26, 2018

A word I would like to insert into this discussion is "atom". As in, atomic vectors have atoms that are of the same type. I think maybe vec_rip() gets one atom? I don't have great words around it, but part of what makes list-columns hard for people is that it often feels like you're dealing with one extra layer of indexing. I see the appeal of working this out.

@romainfrancois
Copy link
Contributor

I've been staring at this for some time now, and yes this comes from iterating on list columns (from purrr, or my toy zap).

Not sure about this, here's a draft:

- atom_type(vctrs<T>) = vctrs<T>
- atom_type(list_of<T>) = <T>

- vec_size(atom_type(vctrs<T>)) == 1
- vec_size(atom_type(list_of<T>)) == <??>

- type( !!!atom_type<T>) = <T>

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 26, 2018

There was a bit of a misunderstanding between @romainfrancois and I regarding list column behavior. I've marked out the relevant info above.

At this point, I think the implementation is just:

vec_rip <- function(x, i) {
  stopifnot(length(i) == 1L)
  out <- vec_slice(x, i)
  
  if (rlang::is_bare_list(x)) {
    out <- out[[1]]
  }
  
  out
}

I think this all makes sense in the context of atoms if you think of them as "a basic building block." Notice that for "a list of elements", the building block is a single element and not a list of length 1 (which aligns with what vec_rip() is trying to achieve).

The basic building block of a vector of doubles is a single double.
The basic building block of a list of elements is a single element.
The basic building block of a data frame is a single row.
The basic building block of a matrix is ? (a row or a single entry?)
The basic building block of an array is ?

I'm a little unsure of what the "atom" of a matrix should be. 1 row would make sense, but every element in the matrix is of the same type so 1 single element might also make sense. Same for arrays.

The only other caveat is that a single rcrd should be treated as 1 object, and not as a list of elements.

@romainfrancois
Copy link
Contributor

Could be vec_atom(). The way i think about this is this function should return something sensible for when you iterate on « observations » of the vctr type.

@lionel-
Copy link
Member

lionel- commented Nov 28, 2018

At this point, I think the implementation is just:

Not quite because extracting an observation from an atomic vector should zap the names and all vector-related attributes.

I think instead of atom, we should think about observations. Then this function would better be called vec_obs() and would be what a purrr::map_obs() mapper would use to subset its input.

@lionel-
Copy link
Member

lionel- commented Nov 28, 2018

Not quite because extracting an observation from an atomic vector should zap the names and all vector-related attributes.

Hmm... Is that true? What would an observation from a factor be in that case? A level?

factor(c("a", "b"))[[1]]
#> [1] a
#> Levels: a b

# Does that make sense?
vec_obs(factor(c("a", "b")), 1)
#> [1] "a"

@lionel-
Copy link
Member

lionel- commented Nov 28, 2018

Taking an observation from a data frame is a type-preserving operation. Taking an observation from a list or list_of is not? This problem is harder than I thought. Maybe, somehow, because data frames are atomic (scalars), when taken as a whole?

Slight connection to why flatten(list(mtcars, mtcars)) is (or should be) a no-op, the list is already flat because mtcars is a scalar.

@jennybc's intuition that the notion atomicity is important was spot on.

Still it's not clear what parts of the type should be dropped/preserved when taking an observation of an atomic object. Should we trust the designers of S that vector names are dropped under [[? Has an observation of a matrix lost its row name?

@lionel-
Copy link
Member

lionel- commented Dec 4, 2018

If a data frame can be seen as an atomic vector, when subsetting observations, this would imply that there should be a notion of missing row. Is a missing row a row full of missings? Do we need a predicate are_na() (using plural to denote vectorisation) that would return TRUE for rows that have only missing observations? Can it be that for atomic types implemented with recursive data structures, is_na(x) is every(x, is_na)?

Interestingly:

mtcars[1, ][NA, ]
#>    mpg cyl disp hp drat wt qsec vs am gear carb
#> NA  NA  NA   NA NA   NA NA   NA NA NA   NA   NA

@hadley
Copy link
Member

hadley commented Dec 4, 2018

(@lionel- that's just mtcars[NA_integer_, ] or vec_na(mtcars))

@hadley
Copy link
Member

hadley commented Jan 28, 2019

Also worth thinking about this case:

x <- package_version("1.1.1")
identical(x, x[[1]])
#> [1] TRUE

Created on 2019-01-28 by the reprex package (v0.2.1.9000)

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2019

The inverse operation seems interesting too. Would it be too bad if vec_c() wrapped all non-vector objects in a list? The current behavior seems suboptimal either way:

qr <- qr(lm(y ~ x, data.frame(x = 1, y = 1)))
class(qr)
#> [1] "qr"
typeof(qr)
#> [1] "list"
vctrs::vec_c(qr)
#> $qr
#>   (Intercept) x
#> 1           1 1
#> attr(,"assign")
#> [1] 0 1
#> 
#> $qraux
#> [1] 1 1
#> 
#> $pivot
#> [1] 1 2
#> 
#> $tol
#> [1] 1e-07
#> 
#> $rank
#> [1] 1

Created on 2019-01-28 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

hadley commented Mar 4, 2019

Another thought: maybe we need to more explicitly talk about "container" types, and then vec_rip() would only apply to a container.

@hadley
Copy link
Member

hadley commented Mar 5, 2019

I think this operation should error if you're attempting to index into a vector that is not a recursive, e.g.:

vec_recursive <- function(x) UseMethod("vec_recursive")
vec_recursive.list_of <- function(x) TRUE
vec_recursive.data.frame <- function(x) TRUE
vec_recursive.default <- function(x) {
  if (is_bare_list(x)) {
    TRUE
  } else if (is_vector(x)) {
    FALSE
  } else {
    stop("Non-vector")
  }
}

vec_extract <- function(x, i) {
  stopifnot(length(i) == 1L)
  stopifnot(vec_recursive(x))
  
  vec_slice(x, i)[[1]]
}

This clearly distinguishes it from vec_slice() and I don't think the using [[ to strip attributes is important (or commonly used) enough to be worth mimicking.

Note that vec_extract() would always preserve the invariants of the container class, i.e. x %>% vec_extract(i) %>% vec_size() equals vec_size(x) if x is a data frame, and x %>% vec_extract(i) %>% vec_type() equals x %>% attr("ptype") is x is a list_of.

@lionel-
Copy link
Member

lionel- commented Mar 6, 2019

I think vec_recursive() is not sufficient to determine atomicity. An S3 list can be:

  • A recursive vector type, e.g. list_of and data frames.
  • An atomic vector type, e.g. record vector types like dates.
  • An atom type, e.g. lm objects.

If vec_recursive() returns TRUE, we know we have a vector, but if it returns FALSE we don't know for sure we have an atom.

We might need a more general vec_kind() generic that would return one of:

  • "atom" for non vectors (the default for S3 lists and non-vector base types)
  • "atomic" (the default for S3 and bare atomics)
  • "recursive" (the default for bare lists and data frames)

@hadley
Copy link
Member

hadley commented Mar 6, 2019

Why does that distinction matter for indexing?

@lionel-
Copy link
Member

lionel- commented Mar 6, 2019

I don't know if it matters for extraction (though vec_is_recursive() shouldn't throw an error with non-vectors to make recursive extraction possible). But that distinction is necessary for vec_is() / vec_assert() without arguments. Otherwise, vec_assert(model_fit) won't throw.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jul 4, 2019

I'm running into the case where it would be nice to have the assignment form of this, [[<-. I do think whatever vctrs solution we come up with for that should work with recursive and non-recursive types. For instance, I might be filling list elements, or I might be filling double elements.

vec_assign() is obviously not what I need to be using here. I want some generic function that can [[<- a single element in both lists and double vectors.

library(vctrs)

out <- vec_init(list(), 3)
elt <- 2:3

# cant use vec_assign, of course
vec_assign(out, 1, elt)
#> No common type for `value` <integer> and `x` <list>.

# this is what i want
out[[1]] <- elt
out
#> [[1]]
#> [1] 2 3
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL

# but i also might "init" a dbl
out <- vec_init(double(), 3)
elt <- 2

# can use vec_assign
vec_assign(out, 1, 2)
#> [1]  2 NA NA

# and i can use this
out[[1]] <- 2
out
#> [1]  2 NA NA

Update) For data frames I think this needs to be able to replace a column, not a row.

@romainfrancois
Copy link
Contributor

Looking into dplyr::summarise() for rowwise data frames, I'm going to need something like vec_rip()

@lionel-
Copy link
Member

lionel- commented Aug 14, 2019

It'll probably be called vec_get(), but it's still not clear what the semantics should be.

@DavisVaughan suggested it would return rows as lists in case of a data frame, or as a vector of dimensionality n - 1 in case of an array (for instance a dimless vector if a matrix).

In that case, rowwise extraction in data frames becomes a recursive operation in the sense that you can dig deep into them by calling recursively [[ / vec_get(), in the same way you can dig deep into lists. Arrays become kinda recursive as well, though they can't return a different type.

Despite being recursive, this operation doesn't really connect to algorithms ordinarily used with recursive data structures such as flatten(). However, as Davis noted, vec_get(vec_get(x, i), j) on a 2d structure consistently retrieves the i, j element, which is interesting.

@DavisVaughan
Copy link
Member Author

@romainfrancois, so vec_slice() wouldn't be enough for rowwise data frames? Could you share why not? Would returning the first row as a list like this be helpful? A use case might guide the discussion a bit

@romainfrancois
Copy link
Contributor

Probably it's the whole rowwise() approach that makes it a problem. When we have a list column x, we want x to stand for an element of the list, not a length 1 list with the element

@jennybc
Copy link
Member

jennybc commented Aug 15, 2019

When we have a list column x, we want x to stand for an element of the list, not a length 1 list with the element

This is what I was getting at way, way above:

A word I would like to insert into this discussion is "atom". As in, atomic vectors have atoms that are of the same type. I think maybe vec_rip() gets one atom? I don't have great words around it, but part of what makes list-columns hard for people is that it often feels like you're dealing with one extra layer of indexing.

@DavisVaughan
Copy link
Member Author

What @romainfrancois does here has essentially the same implementation of what I used when attempting to reimplement flatten() with vctrs semantics.
tidyverse/dplyr@ebbd15c#diff-ccca386aac53cf0029fb15ebff8901d5R206

The vec_get() I used:
https://github.com/DavisVaughan/vctrs/blob/84e0800269805a89ee26c995d1d63ab3bcff2e8c/R/get.R#L5

Essentially you vec_slice() unless something is a list (but not a data frame), in which case you [[i]].

(This is different than what Lionel mentioned above about the other semantics I proposed)

@DavisVaughan DavisVaughan changed the title vec_rip() vec_get() Oct 29, 2019
@DavisVaughan
Copy link
Member Author

I think that it would also be nice to have the equivalent of vec_chop() for whatever vec_get() ends up being.

So:

vec_slice(x, i) -> vec_chop(x, indices = NULL)

vec_get(x, pos) -> vec_*(x, positions = NULL)

positions would be NULL or a list of single positions like list(1, "x", 3)

It would be useful for vec_cast.list.data.frame() and vec_cast.list.default()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants