Skip to content

Creating database for a new city

Michael Saugstad edited this page Nov 21, 2024 · 50 revisions

This is intended to be a nearly comprehensive guide for how to download a new city's data from OpenStreetMap, transform it to be in the format we need for Project Sidewalk, and put it into a fresh database. It is a work in progress, and is only based off of my (Mikey's) experience creating the database for Newberg, OR; Columbus, OH; and Mexico City thus far.

Prepare the road network using QGIS

Downloading road network data

There are a few options for downloading road network data from OSM. Below are a few I've found, in increasing order of complexity. However the easier ones have upper limits to how much data can be downloaded at once, so if you're downloading data for a very large area you may need to use a more complicated option.

  1. https://extract.bbbike.org/ is the easiest option. Just make a bounding box that encompasses the entire city (but try to make it as small as possible while still getting the whole city). You can see the boundaries of a city using lots of different tools, including Google Maps. I generally use OSM XML 7z format, though the file can be quite large for bigger cities. An ESRI Shapefile is more compact, but I usually find weird errors in Shapefiles from bbbike. Right after importing the street data into QGIS, I usually export it again as a Shapefile locally to improve performancein QGIS. Note that on the bbbike status page you can get a link back to your map so that you can download it in a different format (really useful if you carefully sculpted around city boundaries).
  2. If the city is too large to use bbbike, another option that accepts a wider range would be the HOT Export Tool. You can make a many-sided polygon to cover the area you're looking for. Or you can import a geojson file containing a polygon that you want to query for.
  3. If your area is too large for either of those options, you can download data for the entire planet, or ideally for the region of the world you care about. You'll then need to filter the data using a command line tool, instructions below.
    1. You can download the data for North America here. Should be easy enough to find where to download data for other regions. I'd suggest the .pbf format because I believe that the file size is relatively small, and instructions below assume that format.
    2. Install the Osmosis command line utility here.
    3. Run the following command below to use the utility to filter for the data that you care about. You'll need to figure out bounding box lat/lngs for your city; the lat/lngs below are for LA. We also filter for the streets we care about here, check the street filtering section of this page for more information on that.
      osmosis \
        --read-pbf <your-file>.pbf \
        --tag-filter accept-ways highway=trunk,primary,secondary,tertiary,residential,unclassified,pedestrian,living_street \
        --bounding-box top=34.3389 left=-118.67174 bottom=33.698 right=-118.15395 \
        --write-pbf <output-file>.pbf
      

Downloading data on city limits / neighborhoods

  • First try and find the data available for download on the city's website. This was super easy for DC, as they have a big open data initiative, but other places (especially smaller places like Newberg, OR) may not have that data readily available for download online. If you can't find it online, it is best to contact the city to ask for this information, because I can't imagine how they wouldn't have those files somewhere.
  • If you need to get in touch with the city, you can try and find the boundary data elsewhere on the internet in the meantime. This is the site I've used for this: https://osm-boundaries.com/. I've found that the boundary data can also be out of date (can cross-reference with what the boundary looks like in Google Maps). If the boundary in this open dataset seems to match up with Google Maps exactly, I wouldn't worry so much about getting the exact data from the city government.

Install QGIS

https://www.qgis.org/en/site/forusers/download.html

Importing a shapefile into QGIS

Go to Layer -> Add layer -> Add vector layer, and find the .shp file

Make sure projection is EPSG:4326, WGS 84

  • Check the CRS (coordinate reference system) by going to the properties for each layer you add and checking the Information tab
  • If the CRS is not EPSG:4326, WGS 84, exit the properties, right click on the layer and choose Export -> Save Features As... Pick the correct CRS there. This will tranform the shapefile and save a new version in the correct CRS.

Create neighborhoods if we have none, or split up neighborhoods if they are too large

You'll only need to do this if you can't track down a dataset with neighborhoods for the city or if the neighborhoods you have turn out to be way too large (in the sum of the lengths of the streets). For neighborhoods that are too large, I typically wait until I get to the end of the process, then check the sum of the lengths of the streets in the neighborhood. Then I will start from the beginning, splitting up the neighborhoods that are too large on the 2nd attempt. I start over just to prevent us from splitting up neighborhoods that we don't need to. Click the arrow to expand if you need to do this.
  • Install the digitizing tools plugin
  • Make sure you can snap vertices together: Project -> Snapping Options -> set units to degrees & tolerance to ~0.001
  • Toggle editing for the polygons you want to split up
  • Click on "Split Features" in the Digitizing toolbar
  • To do the split, left click either on a vertex of the polygon (you know it's on a vertex because a purple box shows up on the cursor) or fully outside the polygon. Then continue to left click to add more vertices to a linestring that indicates where the cut should be.
  • After adding your final vertex (again on a vertex of the polygon or completely outside the polygon), you can right click to complete the cut.
  • You should now be able to see that the polygon has been split in two, both with identical attributes in the attributes table. If the neighborhood was called "Blah", then you'll probably want to rename the two parts into something like "Blah North" and "Blah South".
  • If you want to split the polygon into more parts, you can continue to do so! Just follow the same process again to split one of those smaller polygons into two even smaller ones.

Cleaning up neighborhood geometries

NOTE: This section is new and is a result of some experimenting that I've been doing. It requires a lot of refinement. My goal while creating databases for the next few cities is to refine an algorithmic process for this. I'm really just listing a few tools that can be helpful below.

The problem: The neighborhood boundary data that we download often has small errors in it. Neighborhoods may overlap with each other slightly, or they may have small gaps between them, etc. Something like a small gap between two neighborhoods can be an issue if a road geometry happens to run through that gap; if it runs parallel to the gap, a large portion of the street could be excluded from our database.

Some tools that can help to check for errors in the neighborhoods:

  1. "Check Validity" tool. Try with both the GEOS and QGIS options, each of them can catch different errors in my experience. Might want to use in conjunction with the "Fix geometries" tool.
  2. "Topology checker" plugin. Install the plugin and then go to Vector -> Topology Checker. I used this one more than anything else. I think that this is from a plugin that you have to specifically add, but it's easy to do from the plugin menu. I usually added checks for "must not overlap", "must not have gaps", "must not have duplicates", and "must not have invalid geometries".
  3. "Check Geometries" plugin. Install the plugin and then go to Vector -> Check Geometries. Probably want to check all the "geometry" options. Assuming some of the "topology" options could be useful too.

Some tools to try out, where the order to use them in may be close to correct:

  1. "Remove duplicate vertices" tool. You can leave the tolerance very small (~0.000001); setting any higher can create more gaps.
  2. "Snap geometries to layer" tool. This should help to remove some gaps. Keep the tolerance pretty small, but try running at a few different levels to see what gives the best result. I think that the "prefer algining nodes, insert extra vertices where required" setting makes sense, but I haven't tested much with other options.
  3. "Fix geometries" tool. I'd probably use this if another tool is complaining about invalid geometries.
  4. "v.clean" tool with the "snap" option should theoretically help to remove gaps. In my first test city, LA, I never got it to work. May have been specific to other errors in the dataset though.
  5. "v.clean" tool with the "rmdupl" option. This one can help to get rid of overlaps between neighborhoods. I thought that when I did this the first time that it just did exactly what I wanted, but the second time I used it, it actually made one or more new polygons out of the overlapping areas instead of dealing with them in a more automated way. The easiest way to deal with these cases is to delete all of the tiny sliver polygons, then use the vertex tool to remove a bunch of vertices around the area, and hopefully the polygons will line up correctly again. There are also tools I didn't try that could help to remove sliver polygons. Needs more testing.

Create a new unique index for the regions

  • Go to the properties for the region layer -> Fields -> Field Calculator
  • Give the output field the name region_id, choose row number to output and click Ok.

Import the street data

If it is in shapefile format, you should be able to just import the roads.shp file. If it is in the OSM format, you should be able to select just the "lines" (shouldn't need the multilinestrings).

Remove streets outside the city

This isn't strictly necessary, but you will definitely want to do it for large cities (or if you just requested the road network for a very large bounding box). It will drastically improve the running time of the algorithm that splits streets at intersections. You can do this by selecting features by area and deleting them.

Just make sure to check every time you highlight some streets to verify that none of them are actually part of the road network of the city. A caveat: if a street only intersects a neighborhood polygon just a tiny bit near it's endpoint, it is actually best to just delete that street. That little tiny piece of a street would end up being a street that someone has to audit, and it is a bit of a confusing experience for the user.

Filter the streets

  • In QGIS, right click on the polylines layer and click on "Filter"
  • Add this line
    "highway" IN ('trunk', 'primary', 'secondary', 'tertiary', 'residential', 'unclassified', 'pedestrian', 'living_street')
    
  • If you also want to include alleys, you filter would look like this
    "highway" IN ('trunk', 'primary', 'secondary', 'tertiary', 'residential', 'unclassified', 'pedestrian', 'living_street')
        OR ("highway" = 'service' AND other_tags LIKE '%"service"=>"alley"%')
    
  • NOTE if you downloaded data in another format, the attribute might be called "type" instead of "highway".

Check that we aren't missing important streets

  • In QGIS, add a Google Maps layer by right-clicking on XYZ Tiles -> New Connection. Name it "Google Maps". The URL is https://mt0.google.com/vt/lyrs=m&x={x}&y={y}&z={z}. Set Max. Zoom Level to 19.
  • Make sure that the streets are overlaid on Google Maps, and look around for streets that are not in our road network that look like they are streets according to Google.
  • As you find such streets, take a look at those streets in GSV. If the streets are clearly alleyways that we wouldn't want users to audit, you can move on to another street.
  • Try to sample around the city and look for streets that seem like they would be large enough that they should be included. It's also helpful to look for areas where the streets that are included or not look sort of random.
  • TODO what exactly should we do in this situation? It might depend on the scope of the problem and size of the city. We might just manually include streets that we realize should be there. Maybe we go back and fix those streets in OSM and do the data import again later. Or maybe we just include all the "service road" or "alley" streets in cities where the data looks particularly messed up.

Split lines along intersections

  • Processing -> Toolbox -> Vector overlay tools -> Split with lines (use roads for input and split layers)

Filter within city limits and apply region_id to streets

  • Go to Vector -> Geoprocessing tools -> Intersection
  • Input layer should be the newly split roads, overlay layer the neighborhood polygons

Make sure you have single-linestrings not multi-linestrings

Use the Processing -> Toolbox -> Vector geometry -> Multipart to Singleparts

Create a new unique index for the streets

  • Go to the properties for the streets layer -> Fields -> Field Calculator
  • Give the output field the name road_id, set output type to integer, set the expression to @row_number, and click Ok.

Create a blank schema in your db:

Make sure you have psql version 16.2. Our schema was created using that version psql. We may not be able to restore database dumps created using a vastly different version of psql.

Note: if this is not someone internal who's making the db, then makes sure to initialize the sidewalk_login schema by running pg_restore -U sidewalk -Fc -d sidewalk /opt/sidewalk_init_users-dump in the db container.

From the root SidewalkWebpage directory, run the following (where new_city_name is underscore_separated):

make create-new-schema name=sidewalk_<new_city_name>

Put the data in the database

  • Connect your database to QGIS by going to Layer -> Add Layer -> Add PostGISLayers... Use "sidewalk" as the user and pw.
  • Go to Database -> DB Manager -> DB Manager -> PostGIS -> sidewalk -> sidewalk_<new-city-name>
  • Click on Import Layer/File, choose the layer with your roads, put it in the sidewalk_<new-city-name> schema, call it qgis_road, fill in the primary key with road_id, make sure to check "Do not promote to multi-part".
  • Click on Import Layer/File, choose the layer with your polygons, put it in the sidewalk_<new-city-name> schema, call it qgis_region, fill in the primary key region_id, but DO NOT check "Do not promote to multi-part".

Getting the database tables filled correctly

I've written a handy script to do this! Simply run make fill-new-schema. It will prompt you for a few parameters. Some notes on those below:

  • Region data source: try to use a URL or an email address if someone manually created them
  • Region name column: if you don't have region names ready, you could use the region_id column for now and update manually later
  • Regions to include/exclude: I'll be adding scripts to do these after the fact in the future and will link to them from here

Optional: Filter for streets in boundary separate from neighborhoods

We often start off working on just a subset of streets around transit stations, for example. In this case, we'll be given a different set of region boundaries for it, and we want to only reveal those streets.

  • Use the "extract by location" tool back in qgis. Extract features from your finished street layer, comparing to features from your given boundary. Check the "intersect" box.
  • Manually examine the streets that were included. Depending on what you're looking for, you might remove some streets that are mostly outside the boundary or all streets that are not fully in the boundary. Sometimes I'll use aerial imagery to help decide.
  • Add this layer of streets to your database as well. Call it qgis_road_filtered.
  • Mark all streets except those in your smaller boundary as deleted:
    UPDATE street_edge SET deleted = TRUE;
    
    UPDATE street_edge
    SET deleted = FALSE
    FROM qgis_road_filtered
    WHERE street_edge.street_edge_id = qgis_road_filtered.road_id;
    
  • Delete the entries in the street_edge_priority table for all deleted streets:
    DELETE FROM street_edge_priority
    USING street_edge
    WHERE street_edge_priority.street_edge_id = street_edge.street_edge_id
        AND deleted = TRUE;
    

Run the site for the first time!

  • If you're Mikey, load the dev env for a different city and sign in to your Project Sidewalk account.
  • Check a few configs in the database. Below are the entries that should be updated to see most of the functionality for the website:
    • double check the following, which should have been set reasonably from the fill-new-schema script: open_status (fully or partially), city_center_lat/lng, and southwest/northeast_boundary_lat/lng
    • excluded_tags: Copy from the most similar city (we're using Pittsburgh's set by default). Unfortunately, you'll need to take into account any changes to the tags since the dump was created, which you can do by checking which evolution file we're at and seeing what new ones have been added). Later on, you'll want to check the Explore page and look at the list for every label type to make sure that no tags are being included incorrectly.
  • Add the English name for the city in the conf/messages file. Add the state and country names if they are not already there. Then enter each name into Google Translate to see if a specific entry should be added to the files for any other languages.
  • There are just a few configs to add in the conf/cityparams.conf file. Mostly just things like the URL, etc. Instructions for the Google Analytics ID are below, and can be skipped for now.
  • In the docker-compose.yml, set the DATABASE_USER to sidewalk_<new-city-name> and the SIDEWALK_CITY_ID to the new one that you created in the cityparams.conf. Run make dev and npm start.
  • Visit the landing page using the same browser where you're signed in in your dev environment, but don't visit the Explore page quite yet! Verify that the map is panned to the correct city, and that the neighborhood names and such look correct. Now is the time to mess with the default_map_zoom in the config table to get the maps looking just right.

Removing streets with no imagery

  • Create a CSV from the following query, naming it street_edge_endpoints.csv. Note that if you are starting with only a subset of the neighborhoods, you probably want to filter for just those regions in this query.
    SELECT street_edge.street_edge_id, region_id, x1, y1, x2, y2, geom
    FROM street_edge
    INNER JOIN street_edge_region ON street_edge.street_edge_id = street_edge_region.street_edge_id
    WHERE street_edge.deleted = FALSE
        AND street_edge.street_edge_id <> (SELECT tutorial_street_edge_id FROM config);
    
  • Run the check_streets_for_imagery.py file (python2 check_streets_for_imagery.py). Depending on the city, this could take a very long time. You can leave it to run in the background. But it will occasionally fail to connect with the Google Maps servers and quit. Just restart the script and it will start from where it left off. When it is completely finished, you will have a CSV named db/scripts/streets_with_no_imagery.csv.
  • Run the following script to mark the streets with no imagery as deleted, remove their entries from the street_edge_priority table, and wipe the region_completion table so that it will repopulate with the correct distances.
    make hide-streets-without-imagery
    
  • Note that the total street distance (which is used to calculate percentage of the city that is complete on the landing page) does not update automatically when the streets are marked as deleted. If you are doing this on your dev environment, you can just restart the web server to invalidate the cache. If this is being done on a live server, you can use the 'Clear Play cache' button on the admin page.

If you want to start with a subset of neighborhoods

  • Figure out the region_ids of the regions you want to start with. For the example queries, I'll use regions 1, 2, and 3.
  • Make sure that tutorial street is in one of the regions that is not deleted by modifying it's entry in the street_edge_region table.
  • Mark the other regions as deleted (UPDATE region SET deleted = TRUE WHERE region_id NOT IN (1, 2, 3);).
  • Mark the streets outside those regions as deleted:
    UPDATE street_edge
    SET deleted = TRUE
    FROM street_edge_region
    WHERE street_edge.street_edge_id = street_edge_region.street_edge_id
        AND street_edge_region.region_id NOT IN (1, 2, 3);
    
  • Delete the entries in the street_edge_priority table for streets outside those regions:
    DELETE FROM street_edge_priority
    USING street_edge
    WHERE street_edge_priority.street_edge_id = street_edge.street_edge_id
        AND street_edge.deleted = TRUE;
    
  • Truncate the region_completion table (TRUNCATE TABLE region_completion;).

Final changes to config/cityparams.conf file and other configuration changes

  • You'll need to create a new Google Analytics ID for this city (if you're working on our team). Log in to Google Analytics, click on "Admin" in the bottom left, create a new account and call it "Project Sidewalk <new-city-name>". Try to copy the same setup parameters as for the other accounts. Once you create the account you should get a new ID that you can add to the cityparams.conf file. You'll need to make a separate property for the prod and test server. When you create a property, you'll then create a data stream; the data stream is what will give you your Google Analytics ID.
  • You'll need to log in to the Google Cloud Console, click on APIs & Services -> Credentials. Click on the main API key, and add the future URLs for the test and prod servers to the list of websites that can be used with the API key.
  • Once you've made the changes above, you can visit the audit page and test to make sure the entire site works.
  • Try to audit the tutorial street by going to /audit/street/<tutorial-street-id>. This is just to avoid end users having the DC tutorial street pop up for them.
  • Finally you'll then want to fill in any remaining params in the config table.
    • update-offset-hours can be set to any integer, we just use it to spread out work on the server.
    • All of the API page params will need to be filled in at some point. If you have time now, you can do some auditing in an area, manually run clustering, then center all the APIs over the data you just collected so that there is something to show. If in a hurry, at a to-do item to fix these later, once others have collected more data.

Get a server set up with this database

This is internal documentation for setting up the server at UW. We do not have experience setting up a dedicated server elsewhere, so we won't be able to assist with this step, unfortunately!

Start by creating a dump of the database:

pg_dump -Fc -U sidewalk -d sidewalk -n '<new-schema-name>' > <new-dump-name>

Then upload the dump to the makelab1 server:

scp /path/to/dump saugstad@makelab1.cs.washington.edu:/www/sidewalk/new-city-dumps/

Then email CS support so that Matt or Jason can set up a new server for this city. Here is a template email:

Hi all,

We are ready to roll out a new test server in Seattle! Here are the specs:

  • URLs: sidewalk-sea-test.cs.washington.edu and sidewalk-sea.cs.washington.edu
  • We would like a redirect from sidewalk-seattle-test.cs.washington.edu / sidewalk-seattle.cs.washington.edu to the above.
  • SIDEWALK_CITY_ID: seattle-wa
  • DATABASE_USER: sidewalk_seattle

Setting up pano scraper (optional)

This is section is only necessary if the goal of setting up this database is to create a dataset for training computer vision models. You can skip if your goal is just to collect accessibility data through Project Sidewalk. Much of these instructions will need to be adapted for your own situation either way.

  • Make sure the production server is set up before setting up the pano scraper.
  • Log on to the sftp UW server (ask Mikey how to get on it) and create a new directory for the city.
  • In your own file system, create a file called log.csv with the same headers we have for the scraper logs for other cities. Make sure the file does not end in a newline with the command truncate -s -1 log.csv.
  • On the SFTP server with the panos, create a new directory sidewalk_panos/Panoramas/scrapes_dump_<city-name>.
  • Optionally, chmod 775 <new-dir-name> to match the others.
  • In that directory, run put /<local>/<path>/<to>/log.csv to upload that file to the remote server.
  • Log on to the EC2 Instance (ask Mikey how to get on it). Edit the /var/spool/cron/crontabs/ubuntu and add a new entry for the new city. Just copy one of the other commands and change the storage location on the sftp server and the new city's URL.
  • Change the hour that the scraper runs for each city (2nd column in the crontab file) to space them out evenly throughout the day.

Setting up Uptime Robot to ping servers (optional)

Clone this wiki locally