Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I set INTERLEAVED_READING options with PBF files #1213

Closed
agila5 opened this issue Dec 9, 2019 · 34 comments
Closed

How can I set INTERLEAVED_READING options with PBF files #1213

agila5 opened this issue Dec 9, 2019 · 34 comments

Comments

@agila5
Copy link
Contributor

agila5 commented Dec 9, 2019

Hi! I'm working on a project on spatial networks in Greater London and I'm downloading the network data from Geofabrik.

When I read the data I get a warning

london_highways <- sf::st_read(
    dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", 
    layer = "lines", 
    stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62788 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs

Created on 2019-12-09 by the reprex package (v0.3.0)

so I checked on gdal website on the meaning of that error and, after reading your answers in #1157, I tried to that set parameter. The problem is that none of these options work:

london_highways <- sf::st_read(
    dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", 
    layer = "lines", 
    options = c("OGR_INTERLEAVED_READING=YES"), 
    stringsAsFactors = FALSE
)
#> options:        OGR_INTERLEAVED_READING=YES 
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62788 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
london_highways
#> Simple feature collection with 62788 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#> First 10 features:
#>    osm_id                 name     highway waterway aerialway barrier man_made
#> 1      74        Ballards Lane     primary     <NA>      <NA>    <NA>     <NA>
#> 2      75            High Road     primary     <NA>      <NA>    <NA>     <NA>
#> 3      79        East End Road     primary     <NA>      <NA>    <NA>     <NA>
#> 4     482     Cockfosters Road     primary     <NA>      <NA>    <NA>     <NA>
#> 5     488          Western Way residential     <NA>      <NA>    <NA>     <NA>
#> 6     489          The Linkway residential     <NA>      <NA>    <NA>     <NA>
#> 7     490        Sherrards Way residential     <NA>      <NA>    <NA>     <NA>
#> 8     491     Grasvenor Avenue residential     <NA>      <NA>    <NA>     <NA>
#> 9    1200          Hanger Lane       trunk     <NA>      <NA>    <NA>     <NA>
#> 10   1201 Hanger Lane Gyratory  trunk_link     <NA>      <NA>    <NA>     <NA>
#>                          geometry
#> 1  LINESTRING (-0.193124 51.60...
#> 2  LINESTRING (-0.1767883 51.6...
#> 3  LINESTRING (-0.1979855 51.5...
#> 4  LINESTRING (-0.1607372 51.6...
#> 5  LINESTRING (-0.1886468 51.6...
#> 6  LINESTRING (-0.1883204 51.6...
#> 7  LINESTRING (-0.1896293 51.6...
#> 8  LINESTRING (-0.1892364 51.6...
#> 9  LINESTRING (-0.291608 51.51...
#> 10 LINESTRING (-0.2925582 51.5...

london_highways <- sf::st_read(
    dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", 
    layer = "lines", 
    options = c("INTERLEAVED_READING=YES"), 
    stringsAsFactors = FALSE
)
#> options:        INTERLEAVED_READING=YES 
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Simple feature collection with 0 features and 9 fields
#> bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
london_highways
#> Simple feature collection with 0 features and 9 fields
#> bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>  [1] osm_id     name       highway    waterway   aerialway  barrier   
#>  [7] man_made   z_order    other_tags geometry  
#> <0 rows> (or 0-length row.names)

Created on 2019-12-09 by the reprex package (v0.3.0)

How can I set that parameter? Do I have to modify the .ini config file?

This problem is related to ropensci/osmextract#12 so apologies for "crossposting".

The PBF file is approximately 55MB of data and the same problem/warning disapper if I consider a smaller PBF.

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language EN                          
#>  collate  English_United Kingdom.1252 
#>  ctype    English_United Kingdom.1252 
#>  tz       Europe/Berlin               
#>  date     2019-12-09                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  callr         3.3.2   2019-09-22 [1] CRAN (R 3.6.1)
#>  class         7.3-15  2019-01-01 [2] CRAN (R 3.6.1)
#>  classInt      0.4-2   2019-10-17 [1] CRAN (R 3.6.1)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  DBI           1.0.0   2018-05-02 [1] CRAN (R 3.6.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.1)
#>  digest        0.6.23  2019-11-23 [1] CRAN (R 3.6.1)
#>  e1071         1.7-2   2019-06-05 [1] CRAN (R 3.6.0)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)
#>  KernSmooth    2.23-15 2015-06-29 [2] CRAN (R 3.6.1)
#>  knitr         1.26    2019-11-12 [1] CRAN (R 3.6.1)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.1)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.1)
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
#>  rlang         0.4.2   2019-11-23 [1] CRAN (R 3.6.1)
#>  rmarkdown     1.17    2019-11-13 [1] CRAN (R 3.6.1)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  sf            0.8-0   2019-09-17 [1] CRAN (R 3.6.1)
#>  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat      2.3.0   2019-11-05 [1] CRAN (R 3.6.1)
#>  units         0.6-5   2019-10-08 [1] CRAN (R 3.6.1)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun          0.11    2019-11-12 [1] CRAN (R 3.6.1)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] C:/Users/Utente/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.1/library
@edzer
Copy link
Member

edzer commented Dec 9, 2019

I see the same warning, but why do you think it is an error? Do you have an indication that not all the features are being read?

@agila5
Copy link
Contributor Author

agila5 commented Dec 10, 2019

I never checked that (and I'm not sure how) but this morning I tested that hypothesis with the following expertiment (following the same ideas as in #1157). I think it shows some evidence that some features are missing with the OGR_INTERLEAVED_READING warning.

# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# download some data
download.file("http://download.geofabrik.de/europe/isle-of-man-latest.osm.pbf", "isle-of-man.osm.pbf", mode = "wb")
download.file("http://download.geofabrik.de/europe/andorra-latest.osm.pbf", "andorra.osm.pbf", mode = "wb")
download.file("http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", "greater-london.osm.pbf", mode = "wb")

# isle of man example
isle_of_man <- st_read(
  "isle-of-man.osm.pbf",
  layer = "lines", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\isle-of-man.osm.pbf' using driver `OSM'
#> Simple feature collection with 12943 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -20.05167 ymin: 48.18019 xmax: -2.913484 ymax: 56.7
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(isle_of_man %>% filter(highway == "primary"))
#> [1] 462

isle_of_man_query <- st_read(
  "isle-of-man.osm.pbf",
  layer = "lines", 
  query = "SELECT * FROM lines WHERE (highway = 'primary')", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\isle-of-man.osm.pbf' using driver `OSM'
#> Simple feature collection with 462 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -4.792454 ymin: 54.06365 xmax: -4.31848 ymax: 54.41379
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(isle_of_man_query)
#> [1] 462

# andorra example
andorra <- st_read(
  "andorra.osm.pbf",
  layer = "lines", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\andorra.osm.pbf' using driver `OSM'
#> Simple feature collection with 5439 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0.975577 ymin: 42.32422 xmax: 1.824654 ymax: 42.78834
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(andorra %>% filter(highway == "primary"))
#> [1] 460

andorra_query <- st_read(
  "andorra.osm.pbf",
  layer = "lines", 
  query = "SELECT * FROM lines WHERE (highway = 'primary')", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\andorra.osm.pbf' using driver `OSM'
#> Simple feature collection with 460 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 1.419741 ymin: 42.43526 xmax: 1.737154 ymax: 42.63297
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(andorra_query)
#> [1] 460

# greater london example
greater_london <- st_read(
  "greater-london.osm.pbf",
  layer = "lines", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\greater-london.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62785 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(greater_london %>% filter(highway == "primary"))
#> [1] 2094

greater_london_query <- st_read(
  "greater-london.osm.pbf",
  layer = "lines", 
  query = "SELECT * FROM lines WHERE (highway = 'primary')", 
  stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\greater-london.osm.pbf' using driver `OSM'
#> Simple feature collection with 12600 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.5022575 ymin: 51.28324 xmax: 0.2661656 ymax: 51.68595
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
nrow(greater_london_query)
#> [1] 12600

Created on 2019-12-10 by the reprex package (v0.3.0)

I applied the same idea in all the examples: in the first part of each example I read the data using st_read and then I applied a filter (i.e. highway == "primary") while in the second part of each example I applied the same query before reading in the data (i.e. using the query parameter of st_read). In the first two examples I get the same number of rows using both methods while in the third example I get a OGR_INTERLEAVED_READING warning and less features in the first method wrt the second method.

@Robinlovelace
Copy link
Contributor

Just to confirm: I can reproduce this issue with GDAL 3.0.2 and have found a work-around, convert to another file format with ogr2ogr first:

# Aim: test gdal's interleaved settings

gdal-config --version

wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf

ogr2ogr \
  -f GPKG lnd.gpkg \
  greater-london-latest.osm.pbf \
  "lines"

Then back in R:

greater_london_gpkg = read_sf("lnd.gpkg")
nrow(greater_london_gpkg %>% filter(highway == "primary")) # 12600

@edzer
Copy link
Member

edzer commented Dec 10, 2019

Yes, or all in R:

> gdal_utils("vectortranslate", "greater-london-latest.osm.pbf", "greater-london-latest.gpkg")
> st_layers("greater-london-latest.gpkg")
Driver: GPKG 
Available layers:
        layer_name       geometry_type features fields
1           points               Point   326046     10
2            lines         Line String   352014      9
3 multilinestrings   Multi Line String     4332      4
4    multipolygons       Multi Polygon   544434     25
5  other_relations Geometry Collection    15859      4
> st_layers("greater-london-latest.osm.pbf", do_count = TRUE)
Driver: OSM 
Available layers:
        layer_name       geometry_type features fields
1           points               Point   326046     10
2            lines         Line String    62788      9
3 multilinestrings   Multi Line String        0      4
4    multipolygons       Multi Polygon        0     25
5  other_relations Geometry Collection        0      4
Warning messages:
1: In CPL_get_layers(dsn, options, do_count) :
  GDAL Error 1: Too many features have accumulated in lines layer. Use OGR_INTERLEAVED_READING=YES mode
2: In CPL_get_layers(dsn, options, do_count) :
  GDAL Error 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode

@tim-salabim
Copy link
Member

Hi @edzer not quite related to this issue, but is there a reason that srsinfo is not supported in gdal_utils?
Use-case here

@edzer
Copy link
Member

edzer commented Dec 10, 2019

Yes, it is not part of the GDAL C API for GDAL utilities.

@tim-salabim
Copy link
Member

Ah I see, thanks for clarifying!

@agila5
Copy link
Contributor Author

agila5 commented Dec 10, 2019

I just tested that code and it doesn't work on my laptop since I get the following error:

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

> gdal_utils("vectortranslate", "greater-london-latest.osm.pbf", "greater-london-latest.gpkg")
#> Error in gdal_utils("vectortranslate", "greater-london-latest.osm.pbf",  : 
#> gdal_utils vectortranslate: an error occured
#> In addition: Warning messages:
#> 1: In CPL_gdalvectortranslate(source, destination, options) :
#> GDAL Message 6: Normalized/laundered field name: 'admin_level' to 'admin_le_3'
#> 2: In CPL_gdalvectortranslate(source, destination, options) :
#> GDAL Error 6: Geometry type of `Geometry Collection' not supported in shapefiles.  Type can be #> overridden with a layer creation option of SHPT=POINT/ARC/POLYGON/MULTIPOINT/POINTZ/ARCZ/POLYGONZ/MULTIPOINTZ/MULTIPATCH.
Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language EN                          
#>  collate  English_United Kingdom.1252 
#>  ctype    English_United Kingdom.1252 
#>  tz       Europe/Berlin               
#>  date     2019-12-10                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                       
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)               
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)               
#>  callr         3.4.0   2019-12-09 [1] CRAN (R 3.6.1)               
#>  class         7.3-15  2019-01-01 [2] CRAN (R 3.6.1)               
#>  classInt      0.4-2   2019-10-17 [1] CRAN (R 3.6.1)               
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)               
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)               
#>  DBI           1.0.0   2018-05-02 [1] CRAN (R 3.6.0)               
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)               
#>  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.1)               
#>  digest        0.6.23  2019-11-23 [1] CRAN (R 3.6.1)               
#>  e1071         1.7-3   2019-11-26 [1] CRAN (R 3.6.1)               
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)               
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)               
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.1)               
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)               
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)               
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)               
#>  KernSmooth    2.23-15 2015-06-29 [2] CRAN (R 3.6.1)               
#>  knitr         1.26    2019-11-12 [1] CRAN (R 3.6.1)               
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)               
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)               
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.1)               
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)               
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)               
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.0)               
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)               
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)               
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.1)               
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)               
#>  rlang         0.4.2   2019-11-23 [1] CRAN (R 3.6.1)               
#>  rmarkdown     1.18    2019-11-27 [1] CRAN (R 3.6.1)               
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)               
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)               
#>  sf          * 0.8-1   2019-12-10 [1] Github (r-spatial/sf@60208b7)
#>  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)               
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)               
#>  testthat      2.3.1   2019-12-01 [1] CRAN (R 3.6.1)               
#>  units         0.6-5   2019-10-08 [1] CRAN (R 3.6.1)               
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)               
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)               
#>  xfun          0.11    2019-11-12 [1] CRAN (R 3.6.1)               
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)               
#> 
#> [1] C:/Users/Utente/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.1/library

so I think that it's a problem related to my GDAL version. I followed the instructions here for updating the GDAL distribution on windows using Conda and I completed all the steps but I still get the same "old" GDAL version linked to sf package. Sorry for the stupid question, but how can I update that?

@Robinlovelace
Copy link
Contributor

Switch to Linux ; )

@edzer
Copy link
Member

edzer commented Dec 10, 2019

Support for auto-recognizing target driver from file name extension (i.e. .gpkg means geopackage) was apparently added to GDAL after 2.2.3; you'd have to specify the output format, which you can do. Anyway, this is getting of topic, so closing now.

@edzer edzer closed this as completed Dec 10, 2019
@agila5
Copy link
Contributor Author

agila5 commented Dec 10, 2019

Thanks, and just for future reference, I solved that problem as follows:

gdal_utils(
  util = "vectortranslate", 
  source = "greater-london-latest.osm.pbf", 
  destination = "greater-london-latest.gpkg", 
  options = c("-f", "GPKG")
)

If you want I could create a PR and add this example to gdal_utils help page. I've never used GDAL (so maybe it's an obvious example) but it took me a lot of time to understand that options = "-f GPKG" was not the correct syntax.

@edzer
Copy link
Member

edzer commented Dec 10, 2019

PR welcome!

@Robinlovelace
Copy link
Contributor

I'm not sure the original issue is solved. @edzer do you think the best solution in this situation, building on the reprex above, is to first convert the object and then read it in, e.g.:

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
download.file("http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", "greater-london-latest.osm.pbf", mode = "wb")

gdal_utils(
  util = "vectortranslate", 
  source = "greater-london-latest.osm.pbf", 
  destination = "greater-london-latest.gpkg", 
  options = c("-f", "GPKG")
)

greater_london_gpkg = read_sf("greater-london-latest.gpkg", layer = "lines")
nrow(greater_london_gpkg)
#> [1] 352158

Created on 2019-12-10 by the reprex package (v0.3.0)

I imagine there could be a faster way, e.g. selecting the layer in the translate stage, but not sure.

@Robinlovelace
Copy link
Contributor

Out of interest I did a benchmark. An interesting question is 'at what point is interleaving/translate needed? For context, @agila5 and I are working on a package that aims to make access to large OSM datasets quicker and easier for R users (comments/suggestions on this welcome): https://github.com/ITSLeeds/geofabric

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
setwd("~/atfutures/who-data/")

get_highways1 = function(x, highway_type) {
  x_gpkg = paste0(x, ".gpkg")
  gdal_utils(
    util = "vectortranslate", 
    source = x, 
    destination = x_gpkg, 
    options = c("-f", "GPKG")
  )
  full_res = read_sf(x_gpkg, layer = "lines")
  full_res[full_res$highway == highway_type, ]
}

get_highways2 = function(x, highway_type) {
  q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
  read_sf(x, layer = "lines", query = q)
}

h1 = get_highways1("greater-london-latest.osm.pbf", "primary")
h2 = get_highways2("greater-london-latest.osm.pbf", "primary")

plot(st_geometry(h1))
plot(st_geometry(h2))

bench::mark(iterations = 1, check = F,
            get_highways1("greater-london-latest.osm.pbf", "primary"),
            get_highways2("greater-london-latest.osm.pbf", "primary")
            )
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                                  min median
#>   <bch:expr>                                                <bch> <bch:>
#> 1 get_highways1("greater-london-latest.osm.pbf", "primary") 30.1s  30.1s
#> 2 get_highways2("greater-london-latest.osm.pbf", "primary")  2.7s   2.7s
#> # … with 3 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>, `gc/sec` <dbl>

Created on 2019-12-11 by the reprex package (v0.3.0)

@agila5
Copy link
Contributor Author

agila5 commented Dec 11, 2019

I'm not sure the original issue is solved.

IMO the issue is not solved from the geofabric package pov and for pbf much bigger than my original example. For example, the PBF file of the region where I live is here and it's almost 400MB. The corresponding GPKG file is 2.2GB and the conversion time is more than 15 minutes.

> system.time({
+   gdal_utils(
+     util = "vectortranslate", 
+     source = "nord-ovest-latest.osm.pbf", 
+     destination = "nord-ovest-latest.gpkg", 
+     options = c("-f", "GPKG")
+   )
+ })
   user  system elapsed 
 964.39  425.41 1464.15 
> file.size("nord-ovest-latest.gpkg") / 1e9
[1] 2.404442

Still I'm not sure if it's possible to read the PBF file without the conversion e.g. by properly setting the INTERLEAVED_READING option (and check the timings) since I don't know how to do that.

An interesting question is 'at what point is interleaving/translate needed?

I read the GDAL documentation of the PBF format but they did not mention anything about the "maximum size" (and I'm not sure it does make sense to talk about a max size). They explain the problem in the interleaved reading section.

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Dec 12, 2019

Interesting, does that section mean that OGR_INTERLEAVED_READING=YES is outdated on recent versions of GDAL?

Starting with GDAL 2.2, applications should use the GDALDataset::GetNextFeature() API to iterate over features in the order they are produced.

@edzer this issue does not seem resolved and am not sure why it was closed, other than it briefly going off topic.

@edzer
Copy link
Member

edzer commented Dec 12, 2019

It seems OGR_... is, but INTERLEAVED_READING=YES is not outdated:

edzer@gin-edzer:/tmp$ ogrinfo -oo INTERLEAVED_READING=YES *pbf points > x2
edzer@gin-edzer:/tmp$ ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1
Warning 6: driver OSM does not support open option OGR_INTERLEAVED_READING
ERROR 1: Too many features have accumulated in lines layer. Use OGR_INTERLEAVED_READING=YES mode

but that the effect is the same (x1 and x2 are identical). My idea behind closing this was that I felt it was not an sf issue, but if anything, a GDAL issue. What do you think?

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Dec 12, 2019

I think it may be a GDAL issue and was considering raising an issue. If you're up for mentoring us through the process and help us develop a decent reprex (e.g. based on the rich examples above) I would happily submit one. Note, thought, that I've commented on a closed related issue here: OSGeo/gdal#1785

From what I can gather a good starting point for a GDAL reprex could be something like:

wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf
ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1

Although I must confess I'm not 100% sure what the second line does and suspect we may get a better response if you raise the issue on our (and sf's) behalf.

@edzer
Copy link
Member

edzer commented Dec 12, 2019

I think the warning & error message sufficiently illustrate the problem, and that the first command

ogrinfo -oo INTERLEAVED_READING=YES *pbf points > x2

suggests the docs are correct, but that the error message gives the wrong hint.

@Robinlovelace
Copy link
Contributor

OK, shall I open a ticket on the GitHub Issue tracker?

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Dec 12, 2019

And will the solution to that issue resolve this issue sf side (or do we just need to add some incantation like options = c("INTERLEAVED_READING", "YES") (I've not tested this)?

@edzer
Copy link
Member

edzer commented Dec 12, 2019

And will the solution to that issue resolve this issue sf side (or do we just need to add some incantation like options = c("INTERLEAVED_READING", "YES") (I've not tested this)?

What makes you think so?

@Robinlovelace
Copy link
Contributor

Tested it and no it doesn't work:

greater_london <- st_read(
+   "greater-london-latest.osm.pbf",
+   layer = "lines", 
+   stringsAsFactors = FALSE,
+   options = c("INTERLEAVED_READING", "YES")
+ )
options:        INTERLEAVED_READING YES 
Reading layer `lines' from data source `/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/atfutures/creds2/od-data/greater-london-latest.osm.pbf' using driver `OSM'
Simple feature collection with 0 features and 9 fields
bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
Warning messages:
1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Message 6: open option 'INTERLEAVED_READING' is not formatted with the key=value format
2: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Message 6: open option 'YES' is not formatted with the key=value format

@Robinlovelace
Copy link
Contributor

This one did though, issue solved I think 🎉

greater_london <- st_read(
+   "greater-london-latest.osm.pbf",
+   layer = "lines", 
+   stringsAsFactors = FALSE,
+   options = "INTERLEAVED_READING=YES")
options:        INTERLEAVED_READING=YES 
Reading layer `lines' from data source `/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/atfutures/creds2/od-data/greater-london-latest.osm.pbf' using driver `OSM'
Simple feature collection with 0 features and 9 fields
bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs

Can you reproduce this @agila5 ?

@agila5
Copy link
Contributor Author

agila5 commented Dec 12, 2019

I think it's the same example as in my first comment and the problem is that with that option it reads 0 features. It doesn't make sense, right?

@edzer
Copy link
Member

edzer commented Dec 12, 2019

I agree, but I get the same when I try

edzer@gin-edzer:/tmp$ ogrinfo -oo INTERLEAVED_READING=YES *pbf lines > l2
edzer@gin-edzer:/tmp$ ogrinfo *pbf lines > l1
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode
edzer@gin-edzer:/tmp$ wc l1 l2
  470818  1744967 19993830 l1
      30       74      831 l2
  470848  1745041 19994661 total

suggesting it is a GDAL issue, not an sf issue.

@Robinlovelace
Copy link
Contributor

Correct, it returns 0 features, sorry for going round in circles. Looks like a GDAL issue that could solve this issue when closed (it may be worth keeping this issue open as an issue that requires changes in GDAL to be closed). Final question: how to I demonstrate that

ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1

yields 0 features?

@edzer
Copy link
Member

edzer commented Dec 12, 2019

Open x1 in a text editor?

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Dec 12, 2019

Please proof-read this issue:

Dear GDAL dev team, I think there is an issue with the osm driver. The error message it produces does not seem to be useful (it says Use OGR_INTERLEAVED_READING=YES when that option is specified) and when using INTERLEAVED_READING=YES, which I think is correctly documented here https://gdal.org/drivers/vector/osm.html#interleaved-reading , the results seem to have no features.

Please could you try to reproduce the (hopefully reproducible) example below?

# Aim: demonstrate issues with the OSM driver

gdalinfo --version
# $ GDAL 2.4.2, released 2019/06/28

# get data
wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf

ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf lines > x2
# $ Warning 6: driver OSM does not support open option OGR_INTERLEAVED_READING
# $ ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode
ogrinfo -oo INTERLEAVED_READING=YES *pbf lines > x3
wc x3
# $  27  65 740 x3 # contains no features

This may be related to OSGeo/gdal#1785.

See here for use case and some additional tests from within R: #1213

@agila5
Copy link
Contributor Author

agila5 commented Dec 12, 2019

(If you post that comment as a github issue I would also cite the other related github issue, i.e. OSGeo/gdal#1785, maybe helps with the context.)

@Robinlovelace
Copy link
Contributor

Thanks @agila5, I've added the link to that closed issue. @edzer good to go from your perspective?

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Dec 14, 2019

I think we have a solution. In OSGeo/gdal#2100 Even Rouault suggests converting only the layer we need to a GeoPackage before reading. I was sceptical about this 2 stage process but it seems it is indeed the quickest way, as benchmarked below (with conversion only needing to run once, hence 2 get3() calls):

get1 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
  x_gpkg = paste0(x, ".gpkg")
  gdal_utils(
    util = "vectortranslate",
    source = x,
    destination = x_gpkg,
    options = c("-f", "GPKG")
  )
  full_res = read_sf(x_gpkg, layer = "lines")
  full_res[full_res$highway == highway_type, ]
}

get2 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
  q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
  read_sf(x, layer = "lines", query = q)
}

get3 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
  q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
  dest_filename = paste0(x, ".gpkg")
  if(!file.exists(dest_filename)) {
    sf::gdal_utils(
      util = "vectortranslate",
      source = "greater-london-latest.osm.pbf",
      destination = dest_filename,
      options = c("-f", "GPKG", "lines")
    )
  }
  st_read(dest_filename, query = q)
}

h1 = get1()
h2 = get2()
h3 = get3()
nrow(h1)
nrow(h2)
nrow(h3)

file.remove("greater-london-latest.osm.pbf.gpkg")

bench::mark(iterations = 1, check = FALSE, get1(), get2(), get3(), get3())

Could you try to reproduce this @agila5 to check it works on Windows? If so I suggest we go with this solution in geofabric.

@agila5
Copy link
Contributor Author

agila5 commented Dec 14, 2019

I get slighlty different results between h1 and h2 and h3 because full_res[full_res$highway == highway_type, ] includes only the rows in which highway is NA. In fact,

> table(h1$highway, useNA = "ifany")
residential        <NA> 
      70406       74234 

Another thing that I noticed is that the geometry column in h2 isn't called geom or geometry

> colnames(h2)
 [1] "osm_id"         "name"           "highway"        "waterway"       "aerialway"     
 [6] "barrier"        "man_made"       "z_order"        "other_tags"     "_ogr_geometry_"

If I correct those "problems", then I get:

> bench::mark(iterations = 10, get1(), get2(), get3())
# A tibble: 3 x 13
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 get1()      1.61m  1.67m    0.0100   193.4MB   0.0631    10    63      16.6m <tibb~ <df[,~ <bch~
2 get2()     20.96s 21.38s    0.0463    26.5MB   0.0370    10     8       3.6m <tibb~ <df[,~ <bch~
3 get3()      1.45s   1.5s    0.658     26.5MB   0.592     10     9      15.2s <tibb~ <df[,~ <bch~
# ... with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

Anyway I will add more tests and I think that now we can move the discussion to geofabrik repo.

@Robinlovelace
Copy link
Contributor

Great, all that is pointing towards the 'h3' solution for me, main thing, it works (touch wood)! Great thread, good idea to move it back to geofabric repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants