Multiple closest street network segments with get_network_id #124

ghost · 2019-11-29T13:38:04Z

I was looking at the get_network_id function and found an issue.

During the nearest distance calculation

d = INFTY
for h in hits:
    new_d = p.distance(right.geometry.iloc[h])
    if d >= new_d:
        d = new_d
        nid = right[network_id].iloc[h]

information is lost if there are two geometries which have exactly the same distance from each other. I think this should at least be mentioned if that happens.

I stumbled over this problem by trying to improve the speed of the for loop by using pandas.DataFrame.apply and instead of pandas.DataFrame.iterrows. For the ´apply` I implemented two different functions, and that is where I found the issue.

My solution, which seems to be a bit faster for bigger dataframes looks like this:

def get_network_id(left, right, network_id, min_size=100, old_func=True):

    INFTY = 1000000000000
    MIN_SIZE = min_size
  
    left = left.copy()
    right = right.copy()

    buildings_c = left.copy()
    buildings_c["geometry"] = buildings_c.centroid  # make centroids
    
    # this creates the pbox from the previous code
    buildings_c['bounding_box'] = buildings_c.buffer(MIN_SIZE, cap_style=3).bounds.values.tolist()

    # Already implemented version, just as a function
    def f_old(x):
        print("Generating rtree...")
        idx = right.sindex
        d = INFTY
        nid = None
        hits = list(idx.intersection(x['bounding_box']))
        
        for h in hits:
            new_d = x['geometry'].distance(right.geometry.iloc[h])
            if d >= new_d:
                d = new_d
                nid = right[network_id].iloc[h]
        if nid is None:
            return np.nan
        else:
            return nid
    
    # New way, by using dictionaries
    def f_new(x):
        idx = right.sindex
        distances = {h:x['geometry'].distance(right.geometry.iloc[h]) for h in (idx.intersection(x['bounding_box']))}
        if distances:
            nid = right[network_id].iloc[min(distances, key=distances.get)]
            return nid
        else:
            return np.nan


    if old_func:
        series = buildings_c.apply(f_old, axis=1)
    else:
        series = buildings_c.apply(f_new, axis=1)


    return series

I was worried using the function with dictionaries was not working because the results were different than the old way, but when looking at the ´distances´ dict, I saw that it sometimes occurs that building centroids have exactly the same distance to the streets. Which makes sense, but that information should not be dropped.

The text was updated successfully, but these errors were encountered:

martinfleis · 2019-11-30T09:24:36Z

Hi @kvnkrmr,
thanks for the report. You are right about it. Which data did you use? It seems to be a very corner case which will not happen very often on the real world data. Anyway, it should at least raise a warning that this situation happened and which geometries were involved.

How faster is .apply in this case? I did a few benchmarks on different functions earlier and found out only a very minor difference. But I did not test this one. I did not focus on performance yet, a development was and still is focused on functionality primarily (this one especially, it is not a very polished code I must admit :)).

If you'd like to make a PR to tackle the warning and/or faster implementation, I'd be more than happy.

ghost · 2019-12-04T12:12:06Z

Hi @martinfleis,
I used the following data: From Geofabrik, I downloaded saarland-latest-free.shp.zip. From that zip file, I imported

gis_osm_roads_free_1.shp

as the roads and

gis_osm_buildings_a_free_1.shp

as the buildings. For this whole dataset, .apply is 1s faster. Not much, but I am trying to find out how the different functions scale with either more buildings or more roads. I will try to post my timing code and make a PR sometime in this week.

martinfleis · 2019-12-04T15:14:46Z

Thanks! That is a huge dataset. I haven't tried as it shows 700 hours on my machine at the moment 🤣. I assume the difference of 1s is basically equal to nothing with geodataframes as large as these. What was the total time?

Testing all three versions using built-in bubenec dataset (which is tiny), the difference is minimal, sometimes even .apply being a bit slower than existing approach.

I am curious about the benchmarks on the larger data.

One idea regarding the cause of this issue - as it is highly unlikely that two different street segments will be at the exactly the same distance from building centroid, I assume that those affected segments are, in fact, duplicated. Which happens sometimes in OSM. Anyway, we should print a warning with ids explaining what happened.

Looking forward for timings and PR! Thanks!

martinfleis · 2019-12-14T15:22:41Z

@kvnkrmr just checking the state of this. I want to release 0.1.1 with a few bugfixes, so I just want to understand if I should wait for this or leave for the next one. Thanks!

ghost · 2019-12-14T16:28:44Z

@martinfleis Hi, I couldn't find the time for this in the last week, so leave it for the next one. Sorry!

martinfleis · 2020-07-17T21:28:29Z

Just an update on this. GeoPandas will include sjoin_nearest, which will be then used within momepy, so this issue is now on hold to avoid duplication of work across packages.

martinfleis added this to the 0.1.2 milestone Dec 15, 2019

martinfleis modified the milestones: 0.1.2, 0.2.0 Jan 8, 2020

martinfleis removed this from the 0.2.0 milestone Jan 5, 2021

martinfleis added the later Waiting for something else to happen before label Jan 5, 2021

martinfleis mentioned this issue Oct 4, 2021

ENH: refactor get_network_id based on sindex.nearest_all #310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple closest street network segments with get_network_id #124

Multiple closest street network segments with get_network_id #124

ghost commented Nov 29, 2019

martinfleis commented Nov 30, 2019

ghost commented Dec 4, 2019

martinfleis commented Dec 4, 2019

martinfleis commented Dec 14, 2019

ghost commented Dec 14, 2019

martinfleis commented Jul 17, 2020

Multiple closest street network segments with get_network_id #124

Multiple closest street network segments with get_network_id #124

Comments

ghost commented Nov 29, 2019

martinfleis commented Nov 30, 2019

ghost commented Dec 4, 2019

martinfleis commented Dec 4, 2019

martinfleis commented Dec 14, 2019

ghost commented Dec 14, 2019

martinfleis commented Jul 17, 2020