covid-vaccine-scrapers

Overview

This is the open-source portion of the back-end, website scraping software that powers www.macovidvaccines.com. Technologies used are Node JS and Puppeteer. In production, this code is run every minute via AWS Lambda, posting its results to a JSON file in an AWS S3 bucket.

Our project board is the definitive source for work in progress and work that needs to be done.

Setup

If you do not have NodeJS installed, please go to the NodeJS Downloads Page to download and install it. Choose the latest version corresponding to the major version number defined in .nvmrc.

If you use nvm to manage installed versions of node and npm, switch to the node version defined in .nvmrc via nvm use. Run nvm install if you get a "not yet installed" error.

NOTE: If you use a different major version of node, you may experience errors when running commands from this repo with node or npm. Additionally, please ensure you're using npm version 7.0.0 or greater by running npm --version.
Download a recent version of Chromium locally: https://download-chromium.appspot.com/

Create a .env file in the root directory of the project with the following:

DEVELOPMENT=true
CHROMEPATH="path/to/chromium/that/you/downloaded"
# e.g. /Applications/Chromium.app/Contents/MacOS/Chromium
# PROPRIETARY_SITE_SCRAPERS_PATH="./../proprietary/site-scrapers" (optional, example)
# SLACKWEBHOOKBOTCHANNEL=https://hooks.slack.com/services/your/token/here
# AWSACCESSKEYID=yourAccessKey
# AWSSECRETACCESSKEY=yourSecretKey

Install prettier and eslint; make sure you run them before making any commits.

Using this code

There are two separate lambdas -- scrapers that use Puppeteer browsers, and scrapers that don't use a browser.

In your terminal, install dependencies with npm install

Puppeteer Scrapers

To run all scrapers: node scraper.js To run an individual scraper, specify the base filename from site-scrapers, e.g.: node scraper.js MAImmunizations to run site-scrapers/MAImmunizations
If you have your own scrapers you want to add, mimic the structure of ./site-scrapers/ inside a folder structure named proprietary/site-scrapers. In your .env file, have the field PROPRIETARY_SITE_SCRAPERS_PATH set ./../proprietary/site-scrapers. This naming is recommended since the .gitignore lists the folder proprietary.

No-Browser Scrapers

To run all scrapers: node scrapers-no-browser.js To run an individual scraper, specify the base filename from no-browser-site-scrapers, e.g.: node scrapers-no-browser.js Color to run no-browser-site-scrapers/Color
If you have your own scrapers you want to add, mimic the structure of ./no-browser-site-scrapers/ inside a folder structure named proprietary/no-browser-site-scrapers. In your .env file, have the field PROPRIETARY_NO_BROWSER_SITE_SCRAPERS_PATH set ./../proprietary/no-browser-site-scrapers. This naming is recommended since the .gitignore lists the folder proprietary.

Continuous Deployment

This project runs off of the AWS SAM (serverless architecture model). A GitHub action runs on pushes to master and runs the following commands:

sam build
sam validate
sam deploy (with parameters set by using GitHub secrets)

If you run this yourself with the default settings and with the proper permissions, you can create a staging environment in AWS that runs this code. Be careful though, because the scraper code runs every minute which we don't want happening in a staging environment for longer than necessary.

Adding a scraper

To scrape data from a site, you either need to:

figure out which API calls are made (Chrome devtools > Network tab) and make these calls yourself to fetch availability
interact with the site (clicking buttons, inspecting HTML elements, etc) using puppeteer

Implement GetAvailableAppointments in your scraper to return a list of locations with pertinent location and availability data. For example, a dictionary specifying a CVS location might look like this:

{
  city: 'North Andover',
  name: 'CVS (North Andover)',
  hasAvailability: false,
  availability: {},
  timestamp: 2021-04-07T18:27:37.498Z,
  siteTimestamp: 2021-04-07T17:23:33.046Z,
  signUpLink: 'https://www.cvs.com/immunizations/covid-19-vaccine?icid=cvs-home-hero1-banner-1-link2-coronavirus-vaccine',
  latitude: 42.6757741,
  longitude: -71.12855859999999
},

Some values, such as latitude and longitude, may be calculated based on the presence of other values such as street, city and zip. Look at existing scrapers for examples of how to populate the most important values.

It's often the case that at the time you're trying to write a scraper, there is no availability to scrape. This makes it hard to know what to do in the case of availability. To start, you can just return hasAvailability = true or false and not report specific available appointments.

Then, in the case where you expect to have availability, you can add code to write data (HTML, screenshots, .har files) to S3 for later inspection, or you can even send a Slack notification so that you know to check the site soon! There are examples of these throughout our codebase (search for slack or s3.savePageContent)

Lastly, it's a good idea to write tests for any new scraper added.

Here are some examples of scrapers to use as examples:

EastBostonNHC (makes API calls and has unit tests)
FamilyPracticeGroup (uses puppeteer to interact)

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
.github/workflows		.github/workflows
alerts		alerts
data		data
lib		lib
no-browser-site-scrapers		no-browser-site-scrapers
one-off-scripts		one-off-scripts
site-scrapers		site-scrapers
test		test
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.yaml		.prettierrc.yaml
LICENSE		LICENSE
README.md		README.md
getGeocode.js		getGeocode.js
getScraperData.js		getScraperData.js
package-lock.json		package-lock.json
package.json		package.json
samconfig.toml		samconfig.toml
scraper.js		scraper.js
scraper_common.js		scraper_common.js
scraper_config.js		scraper_config.js
scrapers_no_browser.js		scrapers_no_browser.js
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

covid-vaccine-scrapers

Overview

Setup

Using this code

Puppeteer Scrapers

No-Browser Scrapers

Continuous Deployment

Adding a scraper

Copyright 2021 Olivia Adams/Ora Innovations, LLC. All rights reserved.

About

Releases

Packages

Contributors 15

Languages

License

livgust/covid-vaccine-scrapers

Folders and files

Latest commit

History

Repository files navigation

covid-vaccine-scrapers

Overview

Setup

Using this code

Puppeteer Scrapers

No-Browser Scrapers

Continuous Deployment

Adding a scraper

Copyright 2021 Olivia Adams/Ora Innovations, LLC. All rights reserved.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 15

Languages

Packages