Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework flattening #912

Merged
merged 29 commits into from
Sep 8, 2022
Merged

Rework flattening #912

merged 29 commits into from
Sep 8, 2022

Conversation

hadley
Copy link
Member

@hadley hadley commented Sep 4, 2022

This is a large PR that reworks flattening, removing inconsistencies and switching to vctrs as a backend.

The key idea is to introduce a new family of "combining" functions: list_c(), list_rbind(), and list_cbind(), which replace flatten_lgl(), flatten_int(), flatten_dbl(), flatten_chr() (now list_c()), flatten_dfc() (list_cbind()), and flatten_dfr() (list_rbind()). The new functions are straightforward wrappers around vctrs functions, but somehow feel natural in purrr to me.

This leaves flatten(), which had a rather idiosyncratic interface. It's now been replaced by list_flatten() which now always removes a single layer of list hierarchy (and nothing else). While working on this I realised that this was actually what splice() did, so overall this feels like a major improvement in naming consistency.

With those functions in place we can deprecate map_dfr() and map_dfc() which are actually "flat" map functions because they combine, rather than simplify, the results. They have never actually belonged with map_int() and friends because they don't have the restriction that .f needs to return a length-1 results. This also strongly implies that flat_map() would just be map_c() and is thus not necessary.


Updated to reflect that we no longer believe that _cbind(), _rbind() and _c() are necessary as they're very lightweight wrappers.

@@ -129,8 +129,12 @@ export(lift_lv)
export(lift_vd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level thought:

I really like list_c(), list_cbind(), and list_rbind() as the main high level functions here.

Since flat-map and dfc/dfr operations are somewhat rare, I would actually be very happy with:

x |> map(f) |> list_c()
x |> map(f) |> list_rbind()
x |> map(f) |> list_cbind()

i.e. no need for map_c(), map_cbind(), and map_rbind().

I feel like those may clutter the map API for little benefit, especially considering you need to add map2/pmap variants, which greatly expands the number of options.

I really would love if all map*() functions had the invariant of length(x) == length(map*(x, f)), and we'd get there eventually by removing map_dfr/dfc().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list_c/rbind/cbind() tools just feel semantically separate from map(), since they are something you do after the map or on a generic list.

Like, list_c() and friends will be very useful to call on furrr::future_map() results too, but I'd rather not expand the furrr API with future_map_c(), future_map_cbind(), map2/pmap variants, etc if we don't have to, and it doesn't really seem like we need to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other reason I think these are separate is because map_c() physically can't be faster than sequential calls to map() + list_c(), because you need all of the results from the map() to be able to compute the output size that we'd get from list_c(), and to me that suggests they shouldn't be merged together

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I think the length(x) == length(map*(x, f)) invariant is key.

x |> map_vec(f) feels a bit like this because it'll begin by being implemented as x |> map(f) |> list_simplify() but I think that's more of a current implementation detail. Especially if you do supply a ptype we could potentially implement map_vec() much more efficiently.

R/list-combine.R Outdated Show resolved Hide resolved
R/list-combine.R Outdated Show resolved Hide resolved
R/list-combine.R Outdated Show resolved Hide resolved
R/list-combine.R Outdated
Comment on lines 39 to 40
check_is_list(x)
vctrs::vec_unchop(x, ptype = ptype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the vec_unchop() error here bad enough that you needed check_is_list()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left it for now for consistency with the others.

R/list-flatten.R Outdated Show resolved Hide resolved
R/map-df.R Outdated Show resolved Hide resolved
R/map-df.R Outdated Show resolved Hide resolved
R/map-df.R Outdated Show resolved Hide resolved
tests/testthat/test-list-combine.R Show resolved Hide resolved
@hadley hadley marked this pull request as ready for review September 8, 2022 13:15
R/list-combine.R Outdated
#' @param ptype An optional prototype to ensure that the output type is always
#' the same.
#' @param id By default, `names(x)` are lost. Alternatively, supply a string
#' and the names will be saved into a column with name `id`. If `id`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like the column name is always named id. Perhaps:

Alternatively, supply id with a column name into which the names will be saved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use {id} since that's a convention we use elsewhere and should be familiar to many readers.

R/list-combine.R Outdated
Comment on lines 50 to 59
vec_check_list(x)
vctrs::vec_cbind(!!!x, .name_repair = name_repair, .size = size, .call = current_env())
}

#' @export
#' @rdname list_c
list_rbind <- function(x, id = rlang::zap(), ptype = NULL) {
vec_check_list(x)
vctrs::vec_rbind(!!!x, .names_to = id, .ptype = ptype, .call = current_env())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to require each element to be a data frame, or do we embrace the way vec_rbind() and vec_cbind() convert vectors to data frames?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try requiring data frames.

R/list-flatten.R Outdated Show resolved Hide resolved
x <- modify_if(x, vec_is_list, identity, .else = list)
vec_unchop(
x,
ptype = list(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this should be ptype = x (the original x) to preserve the input type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be something than a list though? x can't be a data frame, and it seems unlikely to be a list_of unless it's a list_of<list>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it has to be a list, which we check with vec_check_list(). But it could be some kind of vctrs list. I agree that list_of is unlikely until we start supporting more complex type definitions.

I think my concern is that it might make sense to preserve types in the future if we have more vctrs lists, but it will be difficult to change then, so better do it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume a list() can always be coerced to x, even if x is a subtype of list? I don't think we can, so this seems risky?

But I tried it anyway, and then I got:

list_flatten(list_of(list(1, 2, 3), list(4), list(4)))
#> Error:
#> ! Can't convert <list> to <list_of<list>>.

So I'm going to leave as is for this PR and we can revisit in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes. I think the proper way would be to initialise the ptype of the input to the full output size, then assign elements into it.

R/map-df.R Outdated Show resolved Hide resolved
R/map-df.R Outdated Show resolved Hide resolved
return(map(.x, .f, ...))
}

# Should this be replaced with a generic way of figuring out atomic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now use vec_is_list()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, tracking in #920

Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@hadley hadley mentioned this pull request Sep 8, 2022
13 tasks
R/reduce.R Outdated
#' map_dfr(~ tibble(value = .x, step = 1:100), .id = "simulation") %>%
#' map_rbind(~ tibble(value = .x, step = 1:100), .id = "simulation") %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an update, not map_rbind() anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. IMO it reads better now that list_rbind() is a separate step.

R/list-combine.R Outdated Show resolved Hide resolved
Conflicts:
	NEWS.md
	R/map.R
	man/map.Rd
	man/map2.Rd
	man/pmap.Rd
@hadley hadley merged commit 4f78bd3 into main Sep 8, 2022
@hadley hadley deleted the flatten branch September 8, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants