This is the open-source portion of the back-end, website scraping software that powers www.macovidvaccines.com. Technologies used are Node JS and Puppeteer. In production, this code is run every minute via AWS Lambda, posting its results to a JSON file in an AWS S3 bucket.
Our project board is the definitive source for work in progress and work that needs to be done.
-
If you do not have NodeJS installed, please go to the NodeJS Downloads Page to download and install it. Choose the latest version corresponding to the major version number defined in
.nvmrc
.If you use nvm to manage installed versions of
node
andnpm
, switch to the node version defined in.nvmrc
vianvm use
. Runnvm install
if you get a "not yet installed" error.NOTE: If you use a different major version of
node
, you may experience errors when running commands from this repo withnode
ornpm
. Additionally, please ensure you're usingnpm
version7.0.0
or greater by runningnpm --version
. -
Download a recent version of Chromium locally: https://download-chromium.appspot.com/
-
Create a
.env
file in the root directory of the project with the following:DEVELOPMENT=true CHROMEPATH="path/to/chromium/that/you/downloaded" # e.g. /Applications/Chromium.app/Contents/MacOS/Chromium # PROPRIETARY_SITE_SCRAPERS_PATH="./../proprietary/site-scrapers" (optional, example) # SLACKWEBHOOKBOTCHANNEL=https://hooks.slack.com/services/your/token/here # AWSACCESSKEYID=yourAccessKey # AWSSECRETACCESSKEY=yourSecretKey
-
Install
prettier
andeslint
; make sure you run them before making any commits.
There are two separate lambdas -- scrapers that use Puppeteer browsers, and scrapers that don't use a browser.
- In your terminal, install dependencies with
npm install
- To run all scrapers:
node scraper.js
To run an individual scraper, specify the base filename from site-scrapers, e.g.:node scraper.js MAImmunizations
to runsite-scrapers/MAImmunizations
- If you have your own scrapers you want to add, mimic the structure of
./site-scrapers/
inside a folder structure namedproprietary/site-scrapers
. In your .env file, have the fieldPROPRIETARY_SITE_SCRAPERS_PATH
set./../proprietary/site-scrapers
. This naming is recommended since the.gitignore
lists the folderproprietary
.
- To run all scrapers:
node scrapers-no-browser.js
To run an individual scraper, specify the base filename from no-browser-site-scrapers, e.g.:node scrapers-no-browser.js Color
to runno-browser-site-scrapers/Color
- If you have your own scrapers you want to add, mimic the structure of
./no-browser-site-scrapers/
inside a folder structure namedproprietary/no-browser-site-scrapers
. In your .env file, have the fieldPROPRIETARY_NO_BROWSER_SITE_SCRAPERS_PATH
set./../proprietary/no-browser-site-scrapers
. This naming is recommended since the.gitignore
lists the folderproprietary
.
This project runs off of the AWS SAM (serverless architecture model). A GitHub action runs on pushes to master
and runs the following commands:
sam build
sam validate
sam deploy
(with parameters set by using GitHub secrets)
If you run this yourself with the default settings and with the proper permissions, you can create a staging environment in AWS that runs this code. Be careful though, because the scraper code runs every minute which we don't want happening in a staging environment for longer than necessary.
To scrape data from a site, you either need to:
- figure out which API calls are made (Chrome devtools > Network tab) and make these calls yourself to fetch availability
- interact with the site (clicking buttons, inspecting HTML elements, etc) using puppeteer
Implement GetAvailableAppointments
in your scraper to return a list of locations with pertinent location and availability data. For example, a dictionary specifying a CVS location might look like this:
{
city: 'North Andover',
name: 'CVS (North Andover)',
hasAvailability: false,
availability: {},
timestamp: 2021-04-07T18:27:37.498Z,
siteTimestamp: 2021-04-07T17:23:33.046Z,
signUpLink: 'https://www.cvs.com/immunizations/covid-19-vaccine?icid=cvs-home-hero1-banner-1-link2-coronavirus-vaccine',
latitude: 42.6757741,
longitude: -71.12855859999999
},
Some values, such as latitude
and longitude
, may be calculated based on the presence of other values such as street
, city
and zip
. Look at existing scrapers for examples of how to populate the most important values.
It's often the case that at the time you're trying to write a scraper, there is no availability to scrape. This makes it hard to know what to do in the case of availability.
To start, you can just return hasAvailability = true
or false
and not report specific available appointments.
Then, in the case where you expect to have availability, you can add code to write data (HTML, screenshots, .har files) to S3 for later inspection, or you can even send a Slack notification so that you know to check the site soon! There are examples of these throughout our codebase (search for slack
or s3.savePageContent
)
Lastly, it's a good idea to write tests for any new scraper added.
Here are some examples of scrapers to use as examples:
- EastBostonNHC (makes API calls and has unit tests)
- FamilyPracticeGroup (uses puppeteer to interact)