Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_buffer very slow for complex linestrings #2039

Closed
jniedballa opened this issue Nov 12, 2022 · 10 comments
Closed

st_buffer very slow for complex linestrings #2039

jniedballa opened this issue Nov 12, 2022 · 10 comments

Comments

@jniedballa
Copy link

st_buffer() can be very slow with complex linestrings, e.g. from GPS tracklogs. Especially clusters of points (e.g. when a GPS device didn't move but kept recording points, as shown below) can take very long to process:

image

In the track shown above st_buffer() took 1250 seconds, but in QGIS gdal:buffervectors took about 50 seconds.

Is there any way to improve the performance of st_buffer() in such clustered points?

As a reproducible example, below is a function for creating random linestrings, either linear or clustered, with adjustable number of points, and some benchmark results to illustrate the differences between linear and clustered linestrings.

library(sf)
library(ggplot2)


simulate_linestring <- function(n_pts, type, dist = 1) {
  if(type == "cluster") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[(n_pts/2) : n_pts,1] <- coords_tmp[(n_pts/2) : n_pts,1] + 10
  }
  
  if(type == "linear") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[,1] <- coords_tmp[,1] + seq(1, length.out = nrow(coords_tmp), by = 1)
  }
  
  linestr <- st_linestring(x = coords_tmp, dim = "XY")
  
  timing <- system.time(
    linestr_buff <- st_buffer(linestr, dist = dist)
  )
  
  timing_df <- cbind(as.data.frame(t(c(timing, 
                                       n_pts = n_pts))),
                     type = type)
  
  return(list(linestring = linestr,
              buffer = linestr_buff,
              timing = timing_df))
}

Simulate linetrings with varying number of points:

number_points <- c(10, 100, 500, 1000, 2000)
out_lin <- lapply(number_points, 
               simulate_linestring,
               type = "linear")

out_clu <- lapply(number_points, 
                  simulate_linestring,
                  type = "cluster")

names(out_lin) <- names(out_clu) <- number_points

Difference between linear and clustered linestrings (both with their buffer):

par(mfrow = c(1,2))
plot(out_lin$`500`$buffer, col = "grey", main = "linear")
plot(out_lin$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

plot(out_clu$`500`$buffer, col = "grey", main = "cluster")
plot(out_clu$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

image

st_buffer takes much longer to process the clustered points (~300x longer for 2000 vertices):

df_timings_lin <- do.call("rbind", lapply(out_lin, FUN = function(x) x$timing))
df_timings_clu <- do.call("rbind", lapply(out_clu, FUN = function(x) x$timing))

df_timings <- rbind(df_timings_lin, df_timings_clu)

ggplot(df_timings, aes(x = n_pts, y = elapsed, color = type)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  xlab ("Number of vertices") + ylab("Time elapsed [s]")

image

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.0 sf_1.0-8     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         rstudioapi_0.14    magrittr_2.0.3     units_0.8-0        munsell_0.5.0      tidyselect_1.2.0  
 [7] colorspace_2.0-3   R6_2.5.1           rlang_1.0.6        fansi_1.0.3        dplyr_1.0.10       tools_4.2.1       
[13] grid_4.2.1         gtable_0.3.1       KernSmooth_2.23-20 utf8_1.2.2         cli_3.4.1          e1071_1.7-12      
[19] DBI_1.1.3          withr_2.5.0        class_7.3-20       assertthat_0.2.1   tibble_3.1.8       lifecycle_1.0.3   
[25] farver_2.1.1       vctrs_0.5.0        glue_1.6.2         labeling_0.4.2     proxy_0.4-27       compiler_4.2.1    
[31] pillar_1.8.1       scales_1.2.1       generics_0.1.3     classInt_0.4-8     pkgconfig_2.0.3 
@edzer
Copy link
Member

edzer commented Nov 12, 2022

Buffering happens in the upstream libraries, GEOS or s2geometry, see for instance here, for performance issues with these libraries this is the wrong place. What you could do in sf is for instance simplifying the line geometry, and buffering that.

@kadyb
Copy link
Contributor

kadyb commented Nov 12, 2022

If you are using a planar coordinate system, you can try geos directly -- it's a bit faster. You can also limit the number of segments (nQuadSegs, quad_segs), but it doesn't make a significant difference. As Edzer wrote, it's probably best to simplify the geometry.

library("sf")
library("geos")
n = 1000
x = simulate_linestring(n, "cluster") # this returns `linestr`

t = bench::mark(
  check = FALSE,
  st_buffer(x, dist = 1, nQuadSegs = 30),
  st_buffer(x, dist = 1, nQuadSegs = 5),
  geos_buffer(x, distance = 1)
)
t[, 1:4]
#>   expression                                  min   median `itr/sec`
#> 1 st_buffer(x, dist = 1, nQuadSegs = 30)    3.77s    3.77s     0.265
#> 2 st_buffer(x, dist = 1, nQuadSegs = 5)     3.62s    3.62s     0.276
#> 3 geos_buffer(x, distance = 1)              2.75s    2.75s     0.364

@jniedballa
Copy link
Author

Thank you both, and apologies for the out-of-scope question. Good to know about geos_buffer, that's helpful. st_simplify unfortunately doesn't seem to make a big difference (in terms of processing time of subsequent calls to st_buffer) on these point/line clusters.

For what it's worth, calling the GDAL buffervectors tool from R via the OSGEO4W shell with something like the code below is fast (even with the original data) and may be useful in some cases:

system("C:/OSGeo4W/OSGeo4W.bat qgis_process-qgis run gdal:buffervectors --distance_units=meters --area_units=m2 --ellipsoid=EPSG:7030 --INPUT=C:/path/to/file/input.shp --GEOMETRY=geometry --DISTANCE=20 --FIELD= --DISSOLVE=false --EXPLODE_COLLECTIONS=false --OPTIONS= --OUTPUT=C:/path/to/file/output.shp")

@kadyb
Copy link
Contributor

kadyb commented Nov 14, 2022

st_simplify unfortunately doesn't seem to make a big difference

By removing 40% of vertices I see ~3x speedup on this example and the results look very similar.

n = 2000
set.seed(123)
x = simulate_linestring(n, "cluster")
system.time(t1 <- st_buffer(x, dist = 1))
#>  user  system elapsed 
#> 23.26    0.81   24.12

x = st_simplify(x, , dTolerance = 1) # 1177 vertices
system.time(t2 <- st_buffer(x, dist = 1))
#> user  system elapsed 
#> 8.64    0.16    8.83

par(mfrow = c(1, 2))
plot(t1, main = "Original data")
plot(t2, main = "Simplified data")

Rplot

@jniedballa
Copy link
Author

Thank you, this is good. Using the attached GPS tracklog, I see no improvement in processing time of st_buffer when simplifying with dTolerance = 0 (despite a 33% reduction in vertices). dTolerance = 1 and 5 (in meters) however bring massive improvements. geos_buffer is also much faster then st_buffer, even without simplification.

linestr <- st_read(dsn = "...",    
                   layer = "gps_tracklog_with_clusters")  # 5594 vertices

linestr_simp0 <- st_simplify(linestr, dTolerance = 0)  # 3780 vertices
linestr_simp1 <- st_simplify(linestr, dTolerance = 1)  # 1623 vertices 
linestr_simp5 <- st_simplify(linestr, dTolerance = 5)  # 433 vertices

system.time(
st_buffer(linestr,  dist = 20)
)
       User      System elapsed
    1201.70        1.00     1204.06 

system.time(
st_buffer(linestr_simp0, dist = 20)
)
       User      System elapsed
    1202.75        0.47     1204.00 

system.time(
 st_buffer(linestr_simp1, dist = 20)
)
       User      System elapsed
       3.97        0.05        4.01 

system.time(
st_buffer(linestr_simp5, dist = 20)
)
       User      System elapsed
       0.50        0.06        0.56 

system.time(
 geos_buffer(linestr, dist = 20)
)
       User      System elapsed
      18.64        0.71       19.36 

The results all look very similar, even the buffer around the highly simplified track is close to the others.

So st_simplify + st_buffer and geos_buffer both are viable solutions for buffering heavily clustered linestrings, and much faster than st_buffer on the original data. Thank you very much for your help, problem solved!

gps_tracklog_with_clusters.zip

@kadyb
Copy link
Contributor

kadyb commented Nov 14, 2022

system.time(sf::st_buffer(linestr,  dist = 20))
#>    User  System  elapsed
#> 1201.70    1.00  1204.06 

system.time(geos::geos_buffer(linestr,  dist = 20))
#>  User  System  elapsed
#> 18.64    0.71    19.36

Some performance differences between {sf} and {geos} are expected but it is very strange why there is such difference (~60 times on this dataset). But it is good that you managed to solve this issue.

@edzer
Copy link
Member

edzer commented Nov 14, 2022

Are the results identical?

@kadyb
Copy link
Contributor

kadyb commented Nov 14, 2022

The results are identical, but on my PC (I use Windows too) the timings are similar. Strange.

linestr = sf::read_sf("gps_tracklog_with_clusters.shp")
system.time(t1 <- sf::st_buffer(linestr, dist = 20))
#>  user  system elapsed 
#> 21.00    0.88   21.89

system.time(t2 <- geos::geos_buffer(linestr,  distance = 20))
#>  user  system elapsed 
#> 15.73    0.56   16.30

identical(sf::st_as_sfc(t1), sf::st_as_sfc(t2))
#> TRUE

sf::sf_extSoftVersion()
#> GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ
#> "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1"

@jniedballa
Copy link
Author

A fresh installation of sf fixed it by updating the external libraries from

> sf::sf_extSoftVersion()   # sf v1.0-8
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.1"        "3.4.3"        "7.2.1"         "true"         "true"        "7.2.1" 

to

> sf::sf_extSoftVersion()    # sf v1.0-9
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1" 

Now the timing is similar to geos_buffer:

system.time(
st_buffer(linestr, dist = 20)
)
       User      System elapsed
      23.70        1.35       25.08

So it was something in the external libraries. Thank you very much!

@dr-jts
Copy link

dr-jts commented Feb 3, 2023

Highly complex linestrings do cause the standard buffer algorithm to be slow, due to the number of buffer line segments that are generated and processed internally.

An approach which can be much faster is to split the line into sections (say, 10 vertices long), buffer the sections, and then union the results. This can be tens of times faster.

Ideally this heuristic improvement could be provided automatically by the buffer code, but it's hard to detect when it should be applied. It could easily be supplied as a separate buffer function, for use at user's discretion (see JTS prototype implementation).

@edzer edzer closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants