Fix + improve performance for join with shared coords=false #179

theroggy · 2022-08-30T08:01:08Z

Code to find junctions with shared_coords=False was quite slow and according to some tests (both mine + stated in unit tests) didn't seem to give consistent results. Because of the slow performance I opted for a reimplementation rather than trying to fix. The new implementation is based on using spatial overlays (intersection) instead of looping over points.

A test on a 2.5 MB geopackage with (multi)polygons showed the following timings to create the topology:

shared_coords=True: ~10 s
shared_coords=False, original implementation (release version): ~160 s
shared_coords=False, new implementation in this PR: ~ 18 s
There are definitely more opportunities for performance improvements, but the "join" phase seemed to be the main bottleneck.

Regarding the resulting junctions, there were some choices to be made:

"real" junctions: the coordinates where two features touch for a distance > 1 point, are (should be) detected consistently now.
"artificial" junctions: the original javascript topojson implementation also adds all start and end points of linestring objects (not for polygons/rings) as junctions. As they state in the comment in the code linked above this is to simplify some code regarding rotating rings versus splitting later on in the cut step. Because this is a kind of obscure way to force the non-rotating behaviour of lines and because it is easy to implement in a more transparent way in cut I opted not to add these "artificial" junctions. Additionally, artificially increasing the number of junctions is bound to reduce performance in the cut step.
- PS: the original python code didn't seem to be very consistent in adding these end points either: when I added this behaviour, more unit tests showed up "red" than without.

I had to change quite some "asserts" in unit tests, for most of them I tried to deduct if the change was logical based on the input data and this seemed to be the case.

Obviously, feedback very welcome!

references #178

In comments, just for future reference

theroggy · 2022-08-30T08:01:22Z

For reference, the comment copy/pasted from javascript topojson implementation.

// Given an extracted (pre-)topology, identifies all of the junctions. These are
// the points at which arcs (lines or rings) will need to be cut so that each
// arc is represented uniquely.
//
// A junction is a point where at least one arc deviates from another arc going
// through the same point. For example, consider the point B. If there is a arc
// through ABC and another arc through CBA, then B is not a junction because in
// both cases the adjacent point pairs are {A,C}. However, if there is an
// additional arc ABD, then {A,D} != {A,C}, and thus B becomes a junction.
//
// For a closed ring ABCA, the first point A’s adjacent points are the second
// and last point {B,C}. For a line, the first and last point are always
// considered junctions, even if the line is closed; this ensures that a closed
// line is never rotated.

mattijn · 2022-08-30T22:50:05Z

Thanks for this PR. Much appreciated that you took a dive into the code! Also great that you tried a new approach using geopandas and pygeos. Currently these two packages are not hard dependencies and only shapely and numpy are.

But since geopandas is using shapely under the hood and the ufuncs of pygeos are getting integrated within shapely 2.0 and the 2.0a1 version of shapely is recently released I tried to reproduce your approach using shapely (with shapely=2.0a1) and numpy alone.

# return tree as well from `select_unique_combs`
idx_combs, tree = select_unique_combs(data["linestrings"])
# collect geoms from tree (this are views or copies?)
geom_combs = np.asarray([
    [tree.geometries.take(idx[0]), tree.geometries.take(idx[1])] 
    for idx in idx_combs])

# find intersection between linestrings 
# use relate pattern to filter out single point intersections
g1s = geom_combs[:,0]
g2s = geom_combs[:,1]
intersect_1dim = shapely.relate_pattern(g1s, g2s, pattern='1********')
segments = so.intersection_all(geom_combs[intersect_1dim], axis=1)
merged_segments = shapely.line_merge(segments)

# the start and end points of the merged_segments are the junctions
coords, index_group_coords = shapely.get_coordinates(merged_segments, return_index=True)
_, marker_groups = np.unique(index_group_coords, return_index=True)
idx_startend = np.append(marker_groups, marker_groups -1)

# junctions can appear multiple times in multiple segments, remove duplicates
idx_startend_group = np.unique(idx_startend)
junctions = geometry.MultiPoint(coords[idx_startend_group])

# result
print(len(junctions.geoms)),
print(junctions.wkt)

2
MULTIPOINT (522 1108, 530 1100)

Its an incomplete attempt since I was not yet able to profile the approaches, but hopefully this will results in similar timings.
Again, much appreciated! (I did not yet look to the other PR)

theroggy · 2022-08-31T07:24:37Z

Interesting.

I do understand that introducing geopandas as a hard dependency indeed needs conscious thought (which I didn't give it ;-))... and isn't obvious.

However, possibly using it more might help simplifying and speeding up topojson (on longer term)... For several data structures in topojson a GeoDataFrame sounds like a very good match: mainly the (temporary) lists of linestings, junctions and coordinates. Using a GeoDataFrame the index is already there and doesn't need to be recreated for each new lookup and a lot of looping in python can be avoided which reduces boiler-plate code and should be faster because it is moved to looping in a c library (eg. geos).

Also, as far as I know, the dependencies of geopandas have been slimmed down recently. Mainly fiona (and thus gdal) is now an optional dependency... so pandas and geopandas itself would be the extra dependencies.

But, it is also clear to me that is definitely not a must-have dependency in topojson... I might just simplify/speed up some things.

theroggy · 2022-08-31T07:46:19Z

For information: this is the file I used to test the performance.

Probably performance testing with different kinds of data would be a good idea:

one with data with many rows with few points/polygon (eg. cadastral parcels)
one with few rows but many points/polygon (eg. (detailed) country borders)

This file is a bit in between I think: few rows, with an "average" number of points.

topo1778-ferraris_05_309_BEFL-topo-1778-ferraris_raw.zip

mattijn · 2022-08-31T08:08:09Z

Fair enough. No problem either. You could not have known it.

In the beginning I used geopandas and shapely more intensively, but slowly the development has moved to hash tables and numpy array broadcasting.
The latter scale nice, but not always, sometimes simple loops/list comprehensions in combination with an strtree works even better.

I even think it is possible to do it without shapely. Single dependency, numpy, would be a nice achievement.

But now with shapely 2 coming, they are back in game again I think.

Thanks for the test file, let me see if I can do some timings. You are right that depending the file the timings can be very different. The geopandas.datasets.get_path('nybb') is the one that I timed against, which often had quite different timing behavior than 'naturalearth_lowres'.

theroggy · 2022-08-31T10:15:34Z

I even think it is possible to do it without shapely. Single dependency, numpy, would be a nice achievement.

As you stated already, you are not planning to do so, but because I think it is an interesting topic...

I'm not sure that only numpy would be a good idea. Optimisations specific to GIS data are quite important if you want to be able to achieve some efficiëncy and scalability and numpy doesn't offer those as far as I know.

Using a spatial index to find relevant features is one thing, but GIS libraries like geos (and thus shapely/pygeos) also (can) use e.g. spatial indexes within a geometry to speed up processing. To give a simple example: if you have a linestring of 15.000 points (eg. the border of flanders), and you need to loop over all points to check if a point is in the list... this will take a lot more time than an approach where the 15.000 points are spatially indexed...
The same (or even more so) for spatial operations like intersection,... like I used in my alternative implementation: these operations are heavily optimised in the underlying libraries (GEOS) which is the only (at least fastest) way to get decent performance for such things.

The optimisations geopandas offers are rather in use cases where there are a lot of features that need to be treated. Because the geometry data in geopandas is already stored in arrays in a way that can be used directly by geos without having to loop over all rows in python code, which is slow.

mattijn · 2022-08-31T11:27:02Z

I agree with the most, spatial indexes/strtree are very powerful and this package make and will make good use of these.

But not all geos functions are as optimized as you would think, see pygeos/pygeos#73. Up until now, with this PR (again, thank you for reaching out!), the main bottleneck of this package could not be optimized using pygeos.

This is also recently discussed at the geos repo: libgeos/geos#640.

theroggy · 2022-08-31T12:23:32Z

I agree with the most, spatial indexes/strtree are very powerful and this package make and will make good use of these.

But not all geos functions are as optimized as you would think, see pygeos/pygeos#73. Up until now, with this PR (again, thank you for reaching out!), the main bottleneck of this package could not be optimized using pygeos.

This is also recently discussed at the geos repo: libgeos/geos#640.

If I gave the impression any of the packages I mentioned was perfect, I stand corrected ;-), so I agree 100% with your comment.

I wonder how the performance of the sharedpaths function compares to intersection, as they basically do the same thing, so I would expect that "under the hood" they would share the same implementation, or do I miss something?

…formance-of-join-with-shared_coords=False

Based on implementation in #184

This reverts commit 6a7e514.

mattijn · 2022-09-03T04:50:26Z

Since these changes are dependent on shapely 2, which is still in alpha phase, how to approach this?

I can see two options, but open to others:

Trying to see if the core of the logic in this PR can be backported to shapely 1 and include that in this PR.
Setup a topojson 2 branch with a minimal required version of shapely 2 and replace all written logic that was necessary to get a bit more speed with shapely 1.

With the first option we could do an official release where everybody benefits.
But the second option is maybe easier to maintain?

theroggy · 2022-09-03T08:19:48Z

I'd been thinking about that as well... There is quite a lot of activity going on on shapely2, and most work on it has definitely been finished... but still it hard (for me) to judge when the first stable release will arrive...

I was also wondering about the shared_coords=True code path... is there a realworld use case for it or was it only introduced for performance reasons?

mattijn · 2022-09-03T09:19:29Z

In my opinion it's only there for performance reasons. So if the 'path-connected' approach is reasonably fast it can be the only one.

But based on this bullet, without reading much of the code, so I might be wrong, the approach currently introduced in GEOS, seems to be a 'coords-connected' one.
They use a union on the linestrings to get the coverage. Which is interesting as well

theroggy · 2022-09-03T10:14:53Z

I'm looking into avoiding the use of new features in shapely 2... and that doesn't seem to be an issue.

In my opinion it's only there for performance reasons. So if the 'path-connected' approach is reasonably fast it can be the only one.

OK, that would cleanup quite some code + the tests.

But based on this bullet, without reading much of the code, so I might be wrong, the approach currently introduced in GEOS, seems to be a 'coords-connected' one. They use a union on the linestrings to get the coverage. Which is interesting as well

Yes, I was thinking along those lines as well to speedup the cut step... I already gave it a quick try to do a union of the linestrings and the junctions, but this didn't seem to work at first sight. The next step was indeed to try to union the lines directly, which is apparently the approach chosen in geos.

Another option I was thinking about would be to calculate both the intersection and the symmetric difference (= "union" in ESRI speak) and save those in join already so the "cut" step has less lines to split...

There are many ways to Rome :-).

mattijn · 2022-09-03T12:23:27Z

There is a constant available SHAPELY_GE_20 that you can use for the parts that require shapely 2 features, like here: https://github.com/mattijn/topojson/pull/184/files#diff-3f40272a908f4690d8fe0530e35685547cdd07d499dcb45ff94bee078bc5abe2R556.

Before I also have considered to do multiple things in once, but on purpose have done the development step by step to make it a bit easier for testing and tracing.
With the current test-set in place, there are lots of corner cases tracked so if there can be an integration of steps, reducing code while improving speed, then its something to consider for sure.

theroggy · 2022-09-03T12:29:32Z

Performance suffered slightly: ~16 s for the test file we've been using vs ~14 s for the shapely 2 version.

But I personally don't think it is worth maintaining 2 versions for that difference... and I'd wait till it is oficially released before making a dependent version....

Possibly the performance degradation is larger for files with many rows because vectorized operations is the main change in shapely2.

A point of interest I want to raise is that this implementation gives another result for the number of arcs in topo.output['arcs'] on that same test file: 12.864 arcs vs. 13.620 arcs (shapely2 + geopandas version) for 4220 input polygons even though all implementations pass all the tests.

This might also have an impact on performance, one way or the other, because more or less junctions,.... can make a difference.

theroggy · 2022-09-03T12:39:00Z

I wonder, do you have some benchmarking files/code that you use to follow up the evolutions/differences in performance for different types of test files as we discussed before?

If so, it might be usefull to add them to the code base as well.

If not, this is an example on how this could look: https://github.com/geofileops/geofileops/tree/master/benchmark/results

mattijn · 2022-09-03T12:43:55Z

So you have reduced the shapely 1 implementation from 160 secs to 16 seconds?! Impressive!

So the current implementation using shapely 2 has created somehow more junctions. I remember I used linemerge of shapely before, but then implemented something different because of issues, maybe around there🤔?

I'm very open for this type of benchmarking! Before I just add some timings or profiling in the PR, but something normalized would be great.

mattijn · 2022-09-03T13:24:44Z

Upon second thought, I would first check the len() of what is returned from select_unique_combs, maybe some logic is missing there that you had before (eg. .query("index_1 != index_2"), here 4a018ff#diff-d9cc69527fda279cb84b6499a9e0e6499442f9f7b895f00f5046e60bd5c817cbR211)

theroggy · 2022-09-04T01:15:32Z

There was an issue in the extraction of (only) linestrings out of the intersection result in combination with the linemerge.

For info: apparently in shapely1 you need to make sure you only pass lines to linemerge, in shapely2 linemerge does this cleanup.

Timing to treat the test file is now 14.5 seconds...

topojson/ops.py

Co-authored-by: Mattijn van Hoek <mattijn@gmail.com>

mattijn · 2022-09-04T09:33:19Z

I think this pr is ready for merging. Agreed?

theroggy · 2022-09-04T09:41:01Z

Inspired by the change/improvement of linemerge in shapely2 I restructured some code so it mimics this behaviour. Potato-potato, but it simplifies the code in join, and looks a bit nicer like this I think?

theroggy · 2022-09-04T09:41:15Z

For me it is ready for merge!

mattijn · 2022-09-04T09:46:57Z

Love this! Thanks a lot. Really impressive work. Beautiful code. Superlatives etc.

theroggy added 5 commits August 29, 2022 22:58

Fix + improve performance of join

4a018ff

Avoid junctions for equal linestrings

3aac47c

Update tests: endpoints aren't junctions anymore

dc27ee2

Add start and end point as junctions of linestrings (commented)

7b73874

In comments, just for future reference

Remove code in comments again

8d5247c

theroggy mentioned this pull request Aug 30, 2022

Add support to rotate rings to avoid split #180

Merged

mattijn mentioned this pull request Sep 1, 2022

WIP based on #179 #184

Closed

theroggy added 2 commits September 2, 2022 13:03

Replace loopy code with vectorized

9023e5b

Update dependencies to fix tests

6a7e514

theroggy marked this pull request as draft September 2, 2022 18:13

theroggy added 3 commits September 2, 2022 21:08

Merge remote-tracking branch 'upstream/master' into Fix-+-improve-per…

2bc8cef

…formance-of-join-with-shared_coords=False

Implementation without geopandas dependency

94564fb

Based on implementation in #184

Revert "Update dependencies to fix tests"

4cbcd2b

This reverts commit 6a7e514.

theroggy marked this pull request as ready for review September 2, 2022 21:51

Clearer variable names

ecd2d82

Avoid dependency on shapely 2

78a0a89

theroggy added 2 commits September 4, 2022 01:31

Improve line extract

305d55d

Finalize shapely1 version

653d521

mattijn reviewed Sep 4, 2022

View reviewed changes

topojson/ops.py Outdated Show resolved Hide resolved

Update topojson/ops.py

59eee29

Co-authored-by: Mattijn van Hoek <mattijn@gmail.com>

Simplify code by implementing shapely2-like linemerge_ext

a0b70ed

mattijn merged commit c788d3c into mattijn:master Sep 4, 2022

theroggy deleted the Fix-+-improve-performance-for-cut-with-shared_coords=False branch September 4, 2022 09:48

theroggy mentioned this pull request Sep 4, 2022

shared_coords=True vs shared_coords=False #187

Closed

mattijn mentioned this pull request Sep 4, 2022

change default shared_coords from True to False #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix + improve performance for join with shared coords=false #179

Fix + improve performance for join with shared coords=false #179

theroggy commented Aug 30, 2022 •

edited

Loading

theroggy commented Aug 30, 2022

mattijn commented Aug 30, 2022

theroggy commented Aug 31, 2022

theroggy commented Aug 31, 2022 •

edited

Loading

mattijn commented Aug 31, 2022

theroggy commented Aug 31, 2022 •

edited

Loading

mattijn commented Aug 31, 2022

theroggy commented Aug 31, 2022 •

edited

Loading

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022 •

edited

Loading

theroggy commented Sep 3, 2022 •

edited

Loading

mattijn commented Sep 3, 2022 •

edited

Loading

mattijn commented Sep 3, 2022

theroggy commented Sep 4, 2022 •

edited

Loading

mattijn commented Sep 4, 2022

theroggy commented Sep 4, 2022 •

edited

Loading

theroggy commented Sep 4, 2022

mattijn commented Sep 4, 2022

Fix + improve performance for join with shared coords=false #179

Fix + improve performance for join with shared coords=false #179

Conversation

theroggy commented Aug 30, 2022 • edited Loading

theroggy commented Aug 30, 2022

mattijn commented Aug 30, 2022

theroggy commented Aug 31, 2022

theroggy commented Aug 31, 2022 • edited Loading

mattijn commented Aug 31, 2022

theroggy commented Aug 31, 2022 • edited Loading

mattijn commented Aug 31, 2022

theroggy commented Aug 31, 2022 • edited Loading

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022

mattijn commented Sep 3, 2022

theroggy commented Sep 3, 2022 • edited Loading

theroggy commented Sep 3, 2022 • edited Loading

mattijn commented Sep 3, 2022 • edited Loading

mattijn commented Sep 3, 2022

theroggy commented Sep 4, 2022 • edited Loading

mattijn commented Sep 4, 2022

theroggy commented Sep 4, 2022 • edited Loading

theroggy commented Sep 4, 2022

mattijn commented Sep 4, 2022

theroggy commented Aug 30, 2022 •

edited

Loading

theroggy commented Aug 31, 2022 •

edited

Loading

theroggy commented Aug 31, 2022 •

edited

Loading

theroggy commented Aug 31, 2022 •

edited

Loading

theroggy commented Sep 3, 2022 •

edited

Loading

theroggy commented Sep 3, 2022 •

edited

Loading

mattijn commented Sep 3, 2022 •

edited

Loading

theroggy commented Sep 4, 2022 •

edited

Loading

theroggy commented Sep 4, 2022 •

edited

Loading