-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate Santa Clara County social distancing protocol business database #23
Comments
Pelias customized to import Santa Clara County addresses courtesy of @impiaaa: https://github.com/codeforsanjose/pelias-project-scc/ |
Scraper and scraped data courtesy of @stgibson: https://github.com/stgibson/social_distance_web_scraping/ |
Scraped data geocoded by Pelias: socialdistance.geojson.zip |
At tonight’s hack night, @impiaaa, Kevin, and I discussed next steps for this project. Having scraped and geocoded the data once, we need to massage it and figure out the logistics of entering it into OSM. Some discussion points from tonight: Contact information and geocoding
Scraping
Tagging
Mapping
|
At last night’s hack night, @impiaaa and I focused on possible solutions to this problem, as well as the related problem of pinpointing a business within a strip mall or professional center:
At a glance, Mapillary coverage in the South Bay doesn’t look as bad as we had presumed, considering that most of the businesses would be in business districts or along arterial streets rather than in residential areas. But we do need more thorough coverage of office parks. Some areas like Milpitas, Berryessa, and South San José also have very little coverage. Some next steps:
|
This fork of the scraper has continuing work including some tweaks to work better with the geocoder. |
|
|
Tier 3 revisionCDPH moved Santa Clara County to Tier 3 (Orange, Moderate) on October 13. The county public health department issued a revised order that required every business to complete a revised social distancing protocol form within 14 days. The revised form looks very similar to the previous revision from September. The SDP business database last updated on October 12. The COVID19Prepared.org site took down its link to the business database around that time, leaving this note:
Assuming the SDP site does start updating again soon, it doesn’t make sense to go forward with the current business listing. If for some reason the site doesn’t start updating, we may need to get in touch with TSS to ask for access to the raw dataset. In the meantime, this delay gives us time to take care of other remaining tasks:
Wrangling the “Other” categoryWe expect the current database to overlap considerably with the revised database, but there will probably be businesses that spell their names, addresses, or “Other” description slightly differently from one form to another. Unless the SDP site starts listing the business type description in plain text, we’ll need to scrape the linked PDFs for that information. Over 5,700 entries may be too many to tag by hand, especially if we need to keep tagging more by hand as the database updates. One possible solution might involve training a Bayesian classifier on the text, labeling them with presets. |
The SDP site is updating again, with the latest entries from November 5. The site currently lists 21,536 entries across the same categories as before. It no longer includes submissions of the previous revision of the form. |
@impiaaa, Lindsay, and I met tonight to discuss the state of the project: Tier 2?There are rumors that Santa Clara County may soon move back to Tier 2 (red), as other nearby counties have, just a few weeks after moving to Tier 3 (orange) and right after the SDP website got back up and running. Tier 2 allows only essential businesses to stay open, so we’re unsure what that means for the SDP database: will they keep collecting SDPs from nonessential businesses in anticipation of an eventual transition back to Tier 3, remove submissions from nonessential businesses, or stop updating or advertising the site? I’m hopeful we won’t completely repeat the database reset from last month, because the county hasn’t issued a new public health order ahead of any tier change like last time, and the revised SDP form seems to be tier-agnostic. (It no longer asks the submitter for any hard numbers around capacity.) Given this uncertainty, we could replace More likely, we’d avoid making any representations about a business’s opening status. After all, OSM probably already has POIs that haven’t opened since the initial stay-at-home order began. It means we wouldn’t be able to facilitate a COVID-19-specific application, but we’d still accomplish the larger goal of jumpstarting OSM’s POI coverage in the area. ToolingThe main downside to MapRoulette is that it doesn’t prepopulate the point feature and its tags in iD, since this import requires much more manual intervention than a collaborative mapping challenge. We can make sure iD opens up to maximum zoom level 19, which is good enough to easily distinguish standalone businesses, but it would be pretty ambiguous in a strip mall or downtown area. RapiD could be a good alternative to MapRoulette for our use case, as long as there’s a way for the user to not only accept a feature but also change its feature type and change its tags before saving. Ideally we could get permission to add our own challenge to this tasking manager instance, then periodically upload GeoJSON data from the SDP database. Otherwise, we’d have to reach out to the RapiD team about loading our data. There is a new Esri ArcGIS integration, but it would be rather indirect for us since our dataset isn’t in ArcGIS to begin with. Proposal processWe aren’t sure about the county’s status come Tuesday and are still deciding between MapRoulette and RapiD, so we need to wait until at least early next week before posting a request for comments about this import proposal on the imports mailing list. The proposal needs a few tweaks:
The request for comments needs to emphasize that this is a labor-intensive organized editing project originating from an external dataset, not a conventional automated import, but we’re going to adhere to some of the import guidelines anyways as a courtesy. We won’t ask participants to use dedicated import accounts, because that overhead would discourage participation while not really making the mapped features easier to identify and roll back. Time and peopleThe other day, I did some back-of-the-napkin math to estimate how long this import would take:
To get the average time per task down to a minute, we can encourage mappers to only map the businesses as point features and not areas. As much as possible, we’re trying to avoid making mappers trawl through street-level imagery, but it might occasionally be necessary to choose the right unit in a strip mall or avoid mapping a home office. Focusing each challenge on a single category and providing crisp instructions will go a long way too. To get the necessary level of participation, we’ll recruit mappers among Code for San José volunteers who haven’t been attending the OSM map nights. We’ll also recruit among the broader OSM community. As far as I can tell, this import will be just the third POI import in the U.S., after the nationwide GNIS import and a POI import in Puerto Rico. I’m hopeful that the import’s novelty will attract non-local mappers who wouldn’t be interested in a run-of-the-mill building import. I had originally calculated the required time thinking that we’d try to complete the import before the county leaves Tier 3 and the SDP database gets reset again. But the possibility of going back to Tier 2 so soon changes the calculus: if we don’t map anything time-sensitive like |
The county moved to Tier 1 (purple) today. This poster explains the impact on social distancing protocols:
SDPs prior to October 11 have already been removed from the SDP site. This wording makes it sound unlikely that the SDP site would be taken offline, but it means today is probably the high water mark for the site in terms of new submissions. |
@impiaaa and I found a reliable way to grab the “Other, please specify” business type description from each PDF’s headers:
This could save us the trouble and time of downloading the whole PDF for the “Other, please specify” category. However, we were also looking to have mappers consult the “Facility/Worksite visited by public” checkbox in the PDF to avoid mapping businesses that aren’t open to the public. It is possible to extract this information from the PDF automatically, but to avoid excessive requests and processing time, perhaps we could limit it to certain categories we’re particularly concerned about (like professional services, but not restaurants). |
challenge_geojson.zip as of November 16 |
Some outstanding tasks, in no particular order:
The more I spot-check the SDPs we’ve downloaded, the less confidence I have in the “Facility/Worksite visited by public” checkbox. Even if it’s accurate, there are plenty of cases where “No” is an appropriate response for a non-retail site that nonetheless should be mapped. At most, it would be just one signal alongside the reference zoning polygons, but that makes parsing the downloaded PDFs a lower priority. |
I sent a request for comments to the talk-us-sfbay, imports-us, and imports mailing lists. (It’s probably stuck in the imports list’s moderation queue.) I also mentioned the request for comments in the #imports channel of OSMUS Slack. We can continue to refine the proposal on the wiki in the coming days based on feedback that we receive. I’m hoping we can move forward in about a week’s time, in time to do some armchair mapping over the Thanksgiving weekend. Thanks to @impiaaa and Lindsay for workshopping the request for comments this evening. |
The MapRoulette project is now live with an initial batch of 49 challenges. Challenges with 500 or more tasks are hidden for now until we get a chance to see how smoothly we can get through the smaller challenges. |
Wednesday night, @frhino invited me to present the import at Code for San Francisco’s general hack night meeting. CfSF has been spearheading the Bay Area Brigades’ COVID-19 pandemic dashboard project. This import can complement the dashboard as another area for cross-bay collaboration. Josh graciously offered to pair on the MapRoulette workflow before sharing it with the rest of the brigade. Unluckily, we ran into the Recreation challenge, which turns out to be mostly composed of nondescript offices of recreation organizations. I’ve changed that challenge’s difficulty level to Expert to steer new mappers away from it. |
Last night, @impiaaa, Kevin, Lindsay, and I met to take stock of the import a week into it: PromotionWith help from some friends and acquaintances, we’ve been spreading the word about the import in various places, including but not limited to:
As time goes on, we’ll have to keep being creative and possible revisit some of these communication channels to keep up the momentum. ProgressFor the first week of the import, we had enabled only the 49 smaller challenges in case any adverse feedback came through the mailing lists. Measuring progress is a bit tricky because MapRoulette normally excludes both completed and undiscoverable challenges, so it was showing the project 5% completed. Including both completed and undiscoverable challenges, we were at a little over 200 of 17,441 tasks, or 1%. Even several days after weeklyOSM mentioned the import proposal, no feedback came in, so we’re more or less in the clear as far as the import guidelines are concerned. After the meeting, we enabled the remaining challenges except for the Construction challenge. That brings our progress back down to 1%, but it’s more accurate that way, and hopefully people will find the new categories like Restaurant and Retail to be more interesting to map. HiccupsThe Construction challenge remains undiscoverable, because most of the submissions in that challenge appear to be minor work sites (like reconfiguring interior walls at an office building), not the sort of thing we’d map as construction in OSM. Lindsay got unlucky working on the Grocery Stores and Pharmacy challenges due to poor geocoding or inadequate street-level imagery resolution. We ended up changing the difficulty level of the Grocery Stores challenge to Expert due to the prevalence of these issues. The Pharmacy challenge was already well on its way to completion, so Lindsay finished the job, other than a couple extra-tough cases. @impiaaa and I differ on what to do about businesses in strip malls or office buildings, where it isn’t immediately feasible to determine which corner of the building the business occupies. We could either mark such businesses as Too Hard for now and wait to survey them in person, or we could place a point randomly within the building, perhaps with a Time managementMapRoulette currently reports an average time per task of 6 minutes, 14 seconds. That’s far, far above the back-of-the-napkin assumptions in #23 (comment). However, this metric includes situations where a mapper has gotten carried away doing legitimate mapping around the POI, as well as when a mapper forgets to unlock a task after getting distracted by something else. The average has been trending down, so it also probably reflects some initial feeling-around as we got used to the workflow. We’ll keep an eye on the metric, but the most important thing at this point is to bring more contributors into the project. |
On Friday, we figured out why many of the addresses got geocoded way out in Sacramento County or San Benito County (example: the Pelias instance was getting confused by Santa Clara (city) and Santa Clara County sharing the same name. It’s similarly very difficult to search for addresses along El Camino Real in Santa Clara (city) in Nominatim. @impiaaa fixed the issue in Pelias by renaming the county from On Saturday, @impiaaa rescraped the site and reuploaded all the tasks. We’re up to 23,004 tasks total. |
The “visited by public” checkbox sometimes helps, but it’s pretty unreliable because business owners are also unclear on its meaning. We’ve only mapped about 2% of the SDPs so far, but we’ve already encountered plenty of cases that have forced us to consider the privacy of private residences:
I think our decisions so far are roughly in line with the OSM community consensus as expressed by this summary. Protecting privacy is important to us, as is on-the-ground verifiability to some extent. When in doubt, we’ve deferred the task for later review. Depending on the circumstances, we may want to contact some of these businesses to determine their expectations around being listed. |
As of December 21, we reached 4% across all challenges, including 43% of high-priority tasks, 7% of medium-priority tasks, and 2% of low-priority tasks: There have been cases where both the SDP and sign outside the building had the wrong address. On December 24, I added a section to the detailed instructions document explaining how to configure iD to show the Santa Clara County parcel layer as a background layer to more easily associate addresses in SDPs with buildings in OSM:
We finished half the Religious Institutions challenge by December 27 and finished the nursing home challenge on January 2 (thanks Will!). Camille and @sutter-dave joined us on January 7 to help with the POI import and introduce us to Apogee as a possible tool for future imports. As of January 12, we finished two-thirds of the high-priority tasks, enough for the time series chart to show some movement: We finished half of the laundromat/dry cleaning challenge by January 15. Unfortunately, around this time we discovered that a MapRoulette user unfamiliar with OSM editing had begun completing tasks completely incorrectly; their edits had to be reverted and 12 tasks reset in the banks challenge. On January 22, @impiaaa reran the scraper, pulling in lots of new tasks that set our completion rate back to less than 4%. On the bright side, the update brought in improvements to geocoding, due in part to the new addresses we’ve been adding as part of the POI import. Additionally, the priorities have changed so that outlying, typically poorly geocoded tasks no longer stubbornly show up any time you try to get a random task. We refinished the pharmacies challenge on January 25 and got gas stations back up to halfway on January 31. As of February 3, we’re about 6% complete, having fully recovered from the latest update from the SDP website. |
This import is one of the more extensive projects on the MapRoulette platform. The site has been serving us well, but certain things like gathering statistics do take a bit longer, understandable considering the large number of challenges and the sheer size of some of those challenges. Unfortunately, MapRoulette has been experiencing performance problems and the team is considering making some changes that will adversely impact the import. maproulette/maproulette3#1536 would limit the number of challenges per project, and maproulette/maproulette3#1535 would limit the number of tasks per challenge. Taken together, these changes would force us to split the import project into several projects, possibly arbitrarily, making it more difficult for us to gauge our progress, attract and onboard new mappers, ensure equitable coverage throughout the county, and manage synchronization with the SDP database. If these changes go into effect as planned, we may need to consider an alternative platform for the import. We don’t have great options. to-fix is unmaintained, Sophox Editor is offline, the OSMUS Tasking Manager is ill-suited to microtasking, and RapiD only integrates with ArcGIS services (which would make rescrapes impractical). If we stick with MapRoulette, adhering to the new caps would mean splitting apart the larger challenges like Retail and Other into dozens of challenges. How would we split the challenges? If we split them by ZIP code or city, certain areas will inevitably enjoy more attention than others. But anything more arbitrary would prevent us from consolidating tasks into bulk changesets. |
The caps are still being discussed (see linked tickets above) and having community input like yours is very valuable to us. We do need to strike a balance between performance and flexibility, and are trying to determine what that right balance is. I would hate for y'all to move away from MapRoulette because of this, if you find the platform otherwise useful. I'll have a chat with @1ec5 to learn more about the way you use MapRoulette. |
Thanks so much for reaching out, @mvexel! MapRoulette has been key to this import project – #23 (comment) shows that there’s really no alternative that matches MapRoulette in ease of use when the data source can update dynamically. From the looks of it, any new limit to the number of tasks per challenge or the number of challenges per project would comfortably accommodate this import’s project, so we should be in good shape. |
Lots more happened since the last time I updated this issue: TimelineThe laundry/dry cleaner challenge returned to 50% on February 7. By February 15, we reached 7% overall: On February 24, we retagged all the Fry’s locations in the county, including all the locations that had filed SDPs, as On February 25, @impiaaa overtook me on the leaderboard to claim first place: On March 1, we completed the maintenance services challenge. On March 3, the county moved back to Tier 2 (red). On March 4, we reran the scraper incrementally. We remained at 7%: The childcare challenge reached 50% complete on March 17. As seen here on March 15, when we had reached 40% complete and 25% fixed, childcare and kindergarten facilities are much more evenly distributed throughout the county compared to before: As of today, we’ve completed 10% of the entire import: Also, @impiaaa and I submitted a joint talk proposal for State of the Map 2021 about this import. 🤞 Address bloopersSome examples we’ve seen of SDP addresses that threw off the geocoder:
Tips for mappers
Further afield
|
So far, it looks like we’ve gotten improvements to names, addresses, and opening hours from StreetComplete users. StreetComplete doesn’t ask about some things that are often missing from the SDPs we’re importing, such as cuisine (streetcomplete/StreetComplete#103), medical specialty (streetcomplete/StreetComplete#1020), and religious denomination (streetcomplete/StreetComplete#1737). |
|
Task completion milestones:
Other notable events:
|
We have a new 3rd-place mapper! As of June 10: 15% complete overall, including 30% of high-priority tasks
Current progress: Some tidbits:
|
I submitted a poster to the State of the Map 2021 poster competition: |
I’ve uploaded some of the files I used to create this report to the osm-southbay-poi-coverage repository. |
An updated report as of August 5, 2023: OpenStreetMap POI coverage in Santa Clara County August 2023.pdf |
We should incorporate Santa Clara County Social Distancing Protocol data into a community asset map and ultimately into the larger OpenStreetMap database.
Background
Since the COVID-19 pandemic began, most point of interest data in OSM in the South Bay has been at risk of going stale due to temporary or permanent closures or changes in opening hours or services. In #21, we attempted to put together a spreadsheet of open businesses based on business association listings, but this listing is skewed toward certain kinds of businesses, and the copyright situation is unclear (or at least not clear enough to rely on in OSM).
The Santa Clara County Public Health Department has created a listing of businesses and institutions that have submitted social distancing protocols for approval. At the time of writing, the listing includes 29,324 establishments. These are the businesses and institutions most likely to be open during the COVID-19 pandemic.
Unfortunately, the county hasn’t published a structured dataset corresponding to this listing. Moreover, the listing is geared towards checking for compliance and isn’t particularly usable by consumers as a business directory: it allows searching by business name and city or filtering by category, but there’s no way to limit search results by proximity or get directions.
Rationale
The 2020 National Day of Civic Hacking included a call for community asset mapping. We brainstormed several ideas before settling on the social distancing protocol listing as something that would make a government dataset significantly more accessible to the general public while avoiding overlap with projects such as Bay Area Community Resources.
The short-term goal is to process the listings into a mappable format and displaying the data directly on an asset map. People need to know which nearby businesses they can safely patronize and which brick-and-mortar community services are currently available.
The long-term goal is to add these businesses and institutions to OSM along with some COVID-19-specific tagging. This would help to jumpstart OSM’s local efforts to update POIs post-lockdown. It would also enable projects such as Bay Area Community Resources to use OSM as one source for POI data or at least have more confidence in OSM as its basemap. Both projects would make this data more accessible and usable to the general public than the current listing.
Implementation details
We expect this listing to grow significantly over time, so it’s important to take an automated, repeatable approach.
The social distancing protocol site provides only unstructured, inconsistently formatted addresses, so we’ll need to use a geocoder to convert the addresses to coordinates to make them mappable. An open-source geocoder would be preferable to a proprietary one, because we expect this data to eventually go into OSM. The import in #4 adds addresses but only in San José, whereas the county data is countywide. So we’ll need to use the county master address file. We only need to set up the geocoder on a local machine for one-off batch geocoding tasks, but eventually we may want to set up something on a server for future projects.
The site also links each business to an electronically completed PDF for details about its social distancing protocol. It’s feasible but inconvenient to scrape these PDFs, so we’re going to ignore them for now. Unfortunately, it means we won’t be able to automatically clarify the businesses in the “Other” category.
When it comes time to add the businesses to OSM, we could set up a MapRoulette challenge that asks the mapper to identify the shop inside the building using aerial and street-level imagery. We won’t want to blindly add every result en masse, because we’re concerned that some of the listings may be home-based businesses – identifying signage will be key.
Tasks
To make the asset map:
To get the data into OSM:
Additional notes
This brainstorming document turned up several other datasets worth scraping and getting into Bay Area Community Resources or OSM.
The text was updated successfully, but these errors were encountered: