Skip to content

Observations on adding walk links between stations

andrewbyrd edited this page Sep 13, 2010 · 7 revisions

created 9 Aug 2009 (andrewbyrd)

As I don’t use OSM data for the moment, I needed to create walking links between close stations. Transit path searches give unacceptable responses if stations are not connected by streets of some kind.

As of 7 August 2009, the compile_graph link option linked each station to all other stations within a hardcoded 0.05 degrees latitude / longitude. In my case, this was making a huge number of links with immense lengths, enough to rapidly exceed my physical memory and slow to a grinding halt.

A 0.1 degree ‘square’ bounding box is very big, and is not really square since longitude lines converge toward the poles. A more appropriate goal would be to link to all other stations within a radius of, say, 1km. Of course, if the links are meant to represent biking instead of walking this should be much bigger.

The problem is that searching through all pairs of stations and checking all their distances is O( n 2 ). I wasn’t totally sure if it would be much better to iterate over all the stations, calculating a correct bounding box and running an SQL query over the entire station table for every one. Here are my observations from testing a couple of different algorithms.

1) Brute force

The first approach is to iterate over every pair of stations, calculating the distance between them from their latitudes and longitudes. Those having pairwise distances under the threshold are linked, others are ignored. The naïve approach is nested loops over a station list. This is ugly in that it is O( n 2 ) in the number of stations, but it might work in a reasonable amount of time for all but the largest data sets. It has the additional disadvantage of finding every pair twice.

We can make a slight improvement by putting all the stations in a queue. We pop one off the queue, then calculate the distance to all those that remain, linking bidirectionally to stations closer than the threshold distance. Each pair is found only once, which makes (n * (n – 1)) / 2 comparisons. It’s still O( n 2 ) but it might take half as long.

Here’s some Python code (replacing that in pygs/compiler/tools.py) that should do just that, with the bells and whistles removed:

def link_nearby_stops_bruteforce(g, gtfsdb, radius=1000, obstruction=1):
    
    stoplist = gtfsdb.stops()
    while (stoplist) :
        (stop_id1, stop_name1, lat1, lon1) = stoplist.pop()
        for stop_id2, stop_name2, lat2, lon2 in stoplist :
            dd = obstruction * vincenty( lat1, lon1, lat2, lon2 )
            if dd < radius :
                # if reporter : 
                #     reporter.write( '%4i meters from %s to %s' % (dd, stop_id1, stop_id2) )
                g.add_edge( "sta-%s"%stop_id1, "sta-%s"%stop_id2, Street("walk", dd) )
                g.add_edge( "sta-%s"%stop_id2, "sta-%s"%stop_id1, Street("walk", dd) )

In practice, I only actually ran the naïve version (nested loops instead of a queue) on a large data set (Portand Tri-Met). Tests of both versions on the BART GTFS showed appropriate linking of stations. It took about 30 minutes to complete, compared to about 15 minutes to load the 5230 trip bundles into the graph. Theoretically the queue version would be at best twice as fast, taking about as long as loading trip bundles.

2) Bounding box and SQL

The alternate approach is to calculate a bounding box around the origin station, then run an SQL query to retain only close candidates. These are then further filtered by calculating exact distances.

Assuming the earth is sperical, we can use the spherical law of cosines to find the box’s coordinates. Chris Veness gives a good explanation of this at http://www.movable-type.co.uk/scripts/latlong-db.html (LGPL)

Here is some code for the bounding box option (to plug into pygs/compiler/tools.py) :

def link_nearby_stops(g, gtfsdb, radius=1000, obstruction=1):
    from math import degrees, radians, cos
    EARTH_RADIUS = 6371000.0 # meters. assume spherical geoid.
    stoplist = gtfsdb.stops()
    c = gtfsdb.conn.cursor()
    # indices should already exist, but make sure
    c.execute( "CREATE INDEX IF NOT EXISTS stops_stop_lat ON stops(stop_lat)" )
    c.execute( "CREATE INDEX IF NOT EXISTS stops_stop_lon ON stops(stop_lon)" )
    
    for (stop_id1, stop_name1, lat1, lon1) in stoplist :
           
        # degrees of latitude give relatively constant distances
        angular_distance_rad = radius / EARTH_RADIUS # include obstruction here?
        angular_distance_deg_lat = degrees(angular_distance_rad)
        min_lat = lat1 - angular_distance_deg_lat
        max_lat = lat1 + angular_distance_deg_lat

        # compensate for degrees longitude getting smaller with increasing latitude
        angular_distance_deg_lon = degrees( angular_distance_rad / cos( radians(lat1) ) )
        min_lon = lon1 - angular_distance_deg_lon
        max_lon = lon1 + angular_distance_deg_lon
        c.execute( "SELECT * FROM stops WHERE stop_lat BETWEEN ? AND ? AND stop_lon BETWEEN ? AND ?", (min_lat, max_lat, min_lon, max_lon) )
    
        for stop_id2, stop_name2, lat2, lon2 in c :
            if stop_id1 == stop_id2 : continue
            dd = obstruction * vincenty( lat1, lon1, lat2, lon2 )
            # if link distance is zero, target could probably be removed 
            # from stoplist to avoid redundancy.
            if dd < radius :
                g.add_edge( "sta-%s"%stop_id1, "sta-%s"%stop_id2, Street("walk", dd) )

And the resuts for the bounding box algorithm: on the same Portland data set, using an indexed SQLite db, roughly the same 170000 links were created in about one minute. Linking the SF Muni dataset also took less than a minute on 4400 stops. Test path searches on the linked graph give similar results to the TriMet trip planner website.

I haven’t tried on an unindexed database, but I assume that’s what is giving the huge improvement in performance by reducing the complexity.