st_buffer very slow for complex linestrings #2039

jniedballa · 2022-11-12T19:31:44Z

st_buffer() can be very slow with complex linestrings, e.g. from GPS tracklogs. Especially clusters of points (e.g. when a GPS device didn't move but kept recording points, as shown below) can take very long to process:

In the track shown above st_buffer() took 1250 seconds, but in QGIS gdal:buffervectors took about 50 seconds.

Is there any way to improve the performance of st_buffer() in such clustered points?

As a reproducible example, below is a function for creating random linestrings, either linear or clustered, with adjustable number of points, and some benchmark results to illustrate the differences between linear and clustered linestrings.

library(sf)
library(ggplot2)


simulate_linestring <- function(n_pts, type, dist = 1) {
  if(type == "cluster") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[(n_pts/2) : n_pts,1] <- coords_tmp[(n_pts/2) : n_pts,1] + 10
  }
  
  if(type == "linear") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[,1] <- coords_tmp[,1] + seq(1, length.out = nrow(coords_tmp), by = 1)
  }
  
  linestr <- st_linestring(x = coords_tmp, dim = "XY")
  
  timing <- system.time(
    linestr_buff <- st_buffer(linestr, dist = dist)
  )
  
  timing_df <- cbind(as.data.frame(t(c(timing, 
                                       n_pts = n_pts))),
                     type = type)
  
  return(list(linestring = linestr,
              buffer = linestr_buff,
              timing = timing_df))
}

Simulate linetrings with varying number of points:

number_points <- c(10, 100, 500, 1000, 2000)
out_lin <- lapply(number_points, 
               simulate_linestring,
               type = "linear")

out_clu <- lapply(number_points, 
                  simulate_linestring,
                  type = "cluster")

names(out_lin) <- names(out_clu) <- number_points

Difference between linear and clustered linestrings (both with their buffer):

par(mfrow = c(1,2))
plot(out_lin$`500`$buffer, col = "grey", main = "linear")
plot(out_lin$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

plot(out_clu$`500`$buffer, col = "grey", main = "cluster")
plot(out_clu$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

st_buffer takes much longer to process the clustered points (~300x longer for 2000 vertices):

df_timings_lin <- do.call("rbind", lapply(out_lin, FUN = function(x) x$timing))
df_timings_clu <- do.call("rbind", lapply(out_clu, FUN = function(x) x$timing))

df_timings <- rbind(df_timings_lin, df_timings_clu)

ggplot(df_timings, aes(x = n_pts, y = elapsed, color = type)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  xlab ("Number of vertices") + ylab("Time elapsed [s]")

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.0 sf_1.0-8     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         rstudioapi_0.14    magrittr_2.0.3     units_0.8-0        munsell_0.5.0      tidyselect_1.2.0  
 [7] colorspace_2.0-3   R6_2.5.1           rlang_1.0.6        fansi_1.0.3        dplyr_1.0.10       tools_4.2.1       
[13] grid_4.2.1         gtable_0.3.1       KernSmooth_2.23-20 utf8_1.2.2         cli_3.4.1          e1071_1.7-12      
[19] DBI_1.1.3          withr_2.5.0        class_7.3-20       assertthat_0.2.1   tibble_3.1.8       lifecycle_1.0.3   
[25] farver_2.1.1       vctrs_0.5.0        glue_1.6.2         labeling_0.4.2     proxy_0.4-27       compiler_4.2.1    
[31] pillar_1.8.1       scales_1.2.1       generics_0.1.3     classInt_0.4-8     pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

edzer · 2022-11-12T20:23:23Z

Buffering happens in the upstream libraries, GEOS or s2geometry, see for instance here, for performance issues with these libraries this is the wrong place. What you could do in sf is for instance simplifying the line geometry, and buffering that.

kadyb · 2022-11-12T21:24:38Z

If you are using a planar coordinate system, you can try geos directly -- it's a bit faster. You can also limit the number of segments (nQuadSegs, quad_segs), but it doesn't make a significant difference. As Edzer wrote, it's probably best to simplify the geometry.

library("sf")
library("geos")
n = 1000
x = simulate_linestring(n, "cluster") # this returns `linestr`

t = bench::mark(
  check = FALSE,
  st_buffer(x, dist = 1, nQuadSegs = 30),
  st_buffer(x, dist = 1, nQuadSegs = 5),
  geos_buffer(x, distance = 1)
)
t[, 1:4]
#>   expression                                  min   median `itr/sec`
#> 1 st_buffer(x, dist = 1, nQuadSegs = 30)    3.77s    3.77s     0.265
#> 2 st_buffer(x, dist = 1, nQuadSegs = 5)     3.62s    3.62s     0.276
#> 3 geos_buffer(x, distance = 1)              2.75s    2.75s     0.364

jniedballa · 2022-11-14T09:02:34Z

Thank you both, and apologies for the out-of-scope question. Good to know about geos_buffer, that's helpful. st_simplify unfortunately doesn't seem to make a big difference (in terms of processing time of subsequent calls to st_buffer) on these point/line clusters.

For what it's worth, calling the GDAL buffervectors tool from R via the OSGEO4W shell with something like the code below is fast (even with the original data) and may be useful in some cases:

system("C:/OSGeo4W/OSGeo4W.bat qgis_process-qgis run gdal:buffervectors --distance_units=meters --area_units=m2 --ellipsoid=EPSG:7030 --INPUT=C:/path/to/file/input.shp --GEOMETRY=geometry --DISTANCE=20 --FIELD= --DISSOLVE=false --EXPLODE_COLLECTIONS=false --OPTIONS= --OUTPUT=C:/path/to/file/output.shp")

kadyb · 2022-11-14T10:35:21Z

st_simplify unfortunately doesn't seem to make a big difference

By removing 40% of vertices I see ~3x speedup on this example and the results look very similar.

n = 2000
set.seed(123)
x = simulate_linestring(n, "cluster")
system.time(t1 <- st_buffer(x, dist = 1))
#>  user  system elapsed 
#> 23.26    0.81   24.12

x = st_simplify(x, , dTolerance = 1) # 1177 vertices
system.time(t2 <- st_buffer(x, dist = 1))
#> user  system elapsed 
#> 8.64    0.16    8.83

par(mfrow = c(1, 2))
plot(t1, main = "Original data")
plot(t2, main = "Simplified data")

jniedballa · 2022-11-14T17:36:22Z

Thank you, this is good. Using the attached GPS tracklog, I see no improvement in processing time of st_buffer when simplifying with dTolerance = 0 (despite a 33% reduction in vertices). dTolerance = 1 and 5 (in meters) however bring massive improvements. geos_buffer is also much faster then st_buffer, even without simplification.

linestr <- st_read(dsn = "...",    
                   layer = "gps_tracklog_with_clusters")  # 5594 vertices

linestr_simp0 <- st_simplify(linestr, dTolerance = 0)  # 3780 vertices
linestr_simp1 <- st_simplify(linestr, dTolerance = 1)  # 1623 vertices 
linestr_simp5 <- st_simplify(linestr, dTolerance = 5)  # 433 vertices

system.time(
st_buffer(linestr,  dist = 20)
)
       User      System elapsed
    1201.70        1.00     1204.06 

system.time(
st_buffer(linestr_simp0, dist = 20)
)
       User      System elapsed
    1202.75        0.47     1204.00 

system.time(
 st_buffer(linestr_simp1, dist = 20)
)
       User      System elapsed
       3.97        0.05        4.01 

system.time(
st_buffer(linestr_simp5, dist = 20)
)
       User      System elapsed
       0.50        0.06        0.56 

system.time(
 geos_buffer(linestr, dist = 20)
)
       User      System elapsed
      18.64        0.71       19.36

The results all look very similar, even the buffer around the highly simplified track is close to the others.

So st_simplify + st_buffer and geos_buffer both are viable solutions for buffering heavily clustered linestrings, and much faster than st_buffer on the original data. Thank you very much for your help, problem solved!

gps_tracklog_with_clusters.zip

kadyb · 2022-11-14T19:58:43Z

system.time(sf::st_buffer(linestr,  dist = 20))
#>    User  System  elapsed
#> 1201.70    1.00  1204.06 

system.time(geos::geos_buffer(linestr,  dist = 20))
#>  User  System  elapsed
#> 18.64    0.71    19.36

Some performance differences between {sf} and {geos} are expected but it is very strange why there is such difference (~60 times on this dataset). But it is good that you managed to solve this issue.

edzer · 2022-11-14T20:58:43Z

Are the results identical?

kadyb · 2022-11-14T21:52:35Z

The results are identical, but on my PC (I use Windows too) the timings are similar. Strange.

linestr = sf::read_sf("gps_tracklog_with_clusters.shp")
system.time(t1 <- sf::st_buffer(linestr, dist = 20))
#>  user  system elapsed 
#> 21.00    0.88   21.89

system.time(t2 <- geos::geos_buffer(linestr,  distance = 20))
#>  user  system elapsed 
#> 15.73    0.56   16.30

identical(sf::st_as_sfc(t1), sf::st_as_sfc(t2))
#> TRUE

sf::sf_extSoftVersion()
#> GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ
#> "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1"

jniedballa · 2022-11-15T06:39:42Z

A fresh installation of sf fixed it by updating the external libraries from

> sf::sf_extSoftVersion()   # sf v1.0-8
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.1"        "3.4.3"        "7.2.1"         "true"         "true"        "7.2.1"

to

> sf::sf_extSoftVersion()    # sf v1.0-9
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1"

Now the timing is similar to geos_buffer:

system.time(
st_buffer(linestr, dist = 20)
)
       User      System elapsed
      23.70        1.35       25.08

So it was something in the external libraries. Thank you very much!

dr-jts · 2023-02-03T18:41:39Z

Highly complex linestrings do cause the standard buffer algorithm to be slow, due to the number of buffer line segments that are generated and processed internally.

An approach which can be much faster is to split the line into sections (say, 10 vertices long), buffer the sections, and then union the results. This can be tens of times faster.

Ideally this heuristic improvement could be provided automatically by the buffer code, but it's hard to detect when it should be applied. It could easily be supplied as a separate buffer function, for use at user's discretion (see JTS prototype implementation).

dbaston mentioned this issue Dec 10, 2022

HotPixelIndex: Use STRtree instead of KdTree libgeos/geos#769

Open

edzer closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

st_buffer very slow for complex linestrings #2039

st_buffer very slow for complex linestrings #2039

jniedballa commented Nov 12, 2022

edzer commented Nov 12, 2022

kadyb commented Nov 12, 2022 •

edited

Loading

jniedballa commented Nov 14, 2022

kadyb commented Nov 14, 2022

jniedballa commented Nov 14, 2022

kadyb commented Nov 14, 2022

edzer commented Nov 14, 2022

kadyb commented Nov 14, 2022

jniedballa commented Nov 15, 2022

dr-jts commented Feb 3, 2023 •

edited

Loading

st_buffer very slow for complex linestrings #2039

st_buffer very slow for complex linestrings #2039

Comments

jniedballa commented Nov 12, 2022

edzer commented Nov 12, 2022

kadyb commented Nov 12, 2022 • edited Loading

jniedballa commented Nov 14, 2022

kadyb commented Nov 14, 2022

jniedballa commented Nov 14, 2022

kadyb commented Nov 14, 2022

edzer commented Nov 14, 2022

kadyb commented Nov 14, 2022

jniedballa commented Nov 15, 2022

dr-jts commented Feb 3, 2023 • edited Loading

kadyb commented Nov 12, 2022 •

edited

Loading

dr-jts commented Feb 3, 2023 •

edited

Loading