Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging layers #1172

Closed
dicorynia opened this issue Oct 10, 2019 · 17 comments
Closed

Merging layers #1172

dicorynia opened this issue Oct 10, 2019 · 17 comments

Comments

@dicorynia
Copy link

dicorynia commented Oct 10, 2019

I used list-columns and map + unnest to merge multiple shapefiles into one sf object. But it recently doesn't work any longer (it worked in March 2019). I'm not sure if it's related to sf or tidyr.

Reprex :

  1. get 3 shapefiles :
    library(tidyverse)
    library(sf)
    library(fs)
    library(httr)
    
    # some shapefiles from https://fr.actualitix.com/blog/shapefiles-des-departements-de-france.html
    url <-  c("https://fr.actualitix.com/blog/actgeoshap/01-Ain.zip",
              "https://fr.actualitix.com/blog/actgeoshap/73-savoie.zip",
              "https://fr.actualitix.com/blog/actgeoshap/74-haute-savoie.zip")
    
    dep <- str_extract(url, "\\d{2}.*$")
    
    list(url, dep) %>% 
      pwalk(~ GET(.x, write_disk(.y)))
    
    walk(dep, unzip, junkpaths = TRUE, exdir = "shp")
  1. merge
    res <- dir_ls("shp", glob = "*.shp") %>% 
      tibble(fname = .) %>%
      mutate(data = map(fname, read_sf)) %>%
      unnest(data) %>%
      st_as_sf()
    # Error: No common type for `..1$data$geometry` <sfc_POLYGON> and `..2$data$geometry` <sfc_MULTIPOLYGON>.
    # Call `rlang::last_error()` to see a backtrace  
 
    rlang::last_error()
    # <error>
    #   message: No common type for `..1$data$geometry` <sfc_    # Error in rbind.data.frame(...) : 
    #   numbers of columns of arguments do not match> and `..2$data$geometry` <sfc_MULTIPOLYGON>.
    # class:   `vctrs_error_incompatible_type`
    # backtrace:
    #   1. fs::dir_ls("shp", glob = "*.shp")
    # 16. tibble::tibble(fname = .)
    # 17. dplyr::mutate(., data = map(fname, read_sf))
    # 18. tidyr::unnest(., data)
    # 19. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
    # 1. vctrs:::stop_incompatible(...)
    # 16. vctrs:::stop_vctrs(...)

I don't know the inner workings of vctrs... Should sf "advertise" in some way that POLYGON and MULTIPOLYGON are compatible (are they ?) ?

Any other method to easily merge multiple layers ?
Since the structures of the files are slightly different (column numbers) I can't use rbind :

    res <- dir_ls("shp", glob = "*.shp") %>% 
      map(read_sf) %>% 
      do.call(rbind, .)
    # Error in rbind.data.frame(...) : 
    #   numbers of columns of arguments do not match

Thanks.

@dicorynia dicorynia changed the title Merging files Merging layers Oct 10, 2019
@edzer
Copy link
Member

edzer commented Oct 10, 2019

even if we force them to be all read as MULTIPOLYGON, using type = 6 in st_read(), we get

    res <- dir_ls("shp", glob = "*.shp") %>% 
      tibble(fname = .) %>%
      mutate(data = map(fname, read_sf, type = 6)) %>%
      unnest(data)
# Error: No common type for `..1$data$geometry` <sfc_MULTIPOLYGON> and `..2$data$geometry` <sfc_MULTIPOLYGON>.

Removing the geometry column makes this work.

As this might indeed be a vctrs issue (or my lack of understanding it), do @lionel- or @DavisVaughan have any idea?

@lionel-
Copy link
Contributor

lionel- commented Oct 10, 2019

You need to write vec_ptype2() and vec_cast() methods for your classes, see the S3 vignettes on the vctrs website.

We're planning a drastic simplification of the way a common type is declared (ptype2 methods). In the mean time you'll need to write the double dispatch methods.

You might also need to define vec-proxy and vec-restore methods. The proxy is the raw data that vctrs will concatenate or subset, and then the restore method restores attributes based on the result. You might need to define your proxy to return a data frame if you have vectorised attributes (i.e. metadata in your attributes for each element of the S3 vector).

I've been meaning to look into sf-vctrs compatibility, sorry I didn't have time to get to it yet.

@edzer
Copy link
Member

edzer commented Oct 10, 2019

Thanks - I think I'll wait for your drastic simplification before diving into this!

@dicorynia
Copy link
Author

Thanks for investigating...
I'll wait for an update.

@edzer
Copy link
Member

edzer commented Nov 19, 2019

In #1196, @lionel- seems to have fixed this - thanks Lionel!

@lionel-
Copy link
Contributor

lionel- commented Nov 20, 2019

It looks like it!

I'm now getting:

#> Simple feature collection with 1018 features and 19 fields
#> geometry type:  GEOMETRY
#> dimension:      XY
#> bbox:           xmin: 834044 ymin: 6504749 xmax: 943513 ymax: 6604240
#> epsg (SRID):    NA
#> proj4string:    NA

With the legacy tidyr, I'm seeing these warnings indicating attributes were being lost:

#> Warning messages:
#> 1: In bind_rows_(x, .id) :
#>   Vectorizing 'sfc_POLYGON' elements may not preserve their attributes
#> 2: In bind_rows_(x, .id) :
#>   Vectorizing 'sfc_MULTIPOLYGON' elements may not preserve their attributes
#> 3: In bind_rows_(x, .id) :
#>   Vectorizing 'sfc_POLYGON' elements may not preserve their attributes

The loss of these attributes is my working hypothesis for the different bbox value for the sf data frame produced by legacy tidyr:

#> bbox:           xmin: 834044 ymin: 6445179 xmax: 1027185 ymax: 6604240

With vctrs-powered tidyr we now manipulate data frames (splitting and combining on rows) while correctly preserving attributes.

@dicorynia Note that as a stopgap you can use unnest_legacy() (with the warnings) until we get these fixes on CRAN.

@dicorynia
Copy link
Author

Thank you both of you, on this issue and for your work on sf...

@edzer
Copy link
Member

edzer commented Nov 20, 2019

There are still two issues running this case:

  • the CRS (epsg and proj4string) are lost
  • by merging POLYGON and MULTIPOLYGON we get GEOMETRY rather than the more friendly MULTIPOLYGON (sf:::rbind.sf also results in GEOMETRY)

I think the first we should address, the second may be a somewhat deeper problem, related to usability.

@lionel-
Copy link
Contributor

lionel- commented Nov 20, 2019

Agreed that we should look into geometries later on, probably in several stages.

Regarding the CRS attributes, they are propagated from the first element without any check right? I see in c.sfc:

attributes(ret) = attributes(lst[[1]]) # crs

@lionel-
Copy link
Contributor

lionel- commented Nov 20, 2019

With this in vec_ptype2()

	# Propagate CRS attributes from the LHS
	ret = st_sfc(crs = st_crs(x))

We now have:

#> Simple feature collection with 1018 features and 19 fields
#> geometry type:  GEOMETRY
#> dimension:      XY
#> bbox:           xmin: 834044 ymin: 6504749 xmax: 943513 ymax: 6604240
#> epsg (SRID):    2154
#> proj4string:    +proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs

If it doesn't make sense to always propagate the CRS of the first element, other possible strategies are:

  • Clear the CRS if inconsistent.
  • Throw an error if inconsistent.

Regarding precision, what should be the combination strategy? Maybe take the largest precision?

@rsbivand
Copy link
Member

rsbivand commented Nov 20, 2019

Just a remark - all code relating to crs is in flux because of changes in external software that sf builds on. The assumptions made when the crs object was defined no longer hold with certainty #1187. Experiments are progressing in sp/rgdal, but extension to sf will folllow. So crs handling will need to be marked and possibly revisited before March 2020. Probably clearing crs should clear sf too, because geometries without correct metadata are not much use to anyone, and as of now, we do not know how to test whether crs are identical in WKT2_2018 representation.

@edzer
Copy link
Member

edzer commented Nov 20, 2019

Thanks @rsbivand - this will not affect the work here, since we are only assigning / copying entire CRS objects, possibly checking they are identical, which is handled by other code (the == method for crs).

@rsbivand
Copy link
Member

It was the == method I was worried about, as I fear we'll have to importFromWkt() both, and then do the comparison inside GDAL. I vaguely recall code in pyproj or elsewhere, but still experimental I think.

@edzer
Copy link
Member

edzer commented Nov 20, 2019

@lionel- I think vec_ptype2() should do this:

       if (st_crs(x) != st_crs(y))
           stop("coordinate reference systems not equal: use st_transform() first?")
	# Propagate CRS attributes from the LHS
	ret = st_sfc(crs = st_crs(x))

@rsbivand that == or != does the right thing here is important, indeed, but a different issue.

@lionel-
Copy link
Contributor

lionel- commented Nov 20, 2019

@edzer oops I now see your message, I'll update the error messages to use your wording.

@lionel-
Copy link
Contributor

lionel- commented Nov 20, 2019

by merging POLYGON and MULTIPOLYGON we get GEOMETRY rather than the more friendly MULTIPOLYGON (sf:::rbind.sf also results in GEOMETRY)

Is there a description or implementation of the combination strategy you're looking for? AFAICS GEOMETRY is consistently implemented as the common type for all combinations of incongruent geometries:

class(st_sfc(st_point(), st_multipoint()))
#> [1] "sfc_GEOMETRY" "sfc"

class(st_sfc(st_polygon(), st_multipolygon()))
#> [1] "sfc_GEOMETRY" "sfc"

Also, should the sfc geometry be set to MULTIPOLYGON even though it contains POLYGON geometries? Or should the individual geometries be upcoerced to MULTIPOLYGON? But surely changing the individual types is potentially problematic. Can a multipolygon of size 1 considered equal to the equivalent polygon (as I understand yes, through the notion of spatial equality)?

@edzer
Copy link
Member

edzer commented Nov 21, 2019

It is a more generic usability thing, although I brought it up I would also leave it for here/now.

Users can anticipate this by specifying type=6 to st_read, which causes POLYGON geometries to be read as (single polygon) MULTIPOLYGONs, this then would not lead to GEOMETRY at the end; or alternatively use st_cast(. ,"MULTIPOLYGON") at the end of the pipe.

@edzer edzer closed this as completed Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants