This is the open-source portion of the back-end, website scraping software that powers www.macovidvaccines.com. Technologies used are Node JS and Puppeteer. In production, this code is run every 5 minutes via AWS Lambda, posting its results to a JSON file in an AWS S3 bucket.
-
Download a recent version of Chromium locally: https://download-chromium.appspot.com/
-
Create a
.env
file with the following:DEVELOPMENT=true CHROMEPATH="path/to/chromium/that/you/downloaded" # e.g. /Applications/Chromium.app/Contents/MacOS/Chromium PROPRIETARY_SITE_SCRAPERS_PATH="./../proprietary/site-scrapers" (optional, example)
-
Install
prettier
andeslint
; make sure you run them before making any commits.
- In your terminal, install dependencies with
npm install
- To run all scrapers:
node main.js
To run an individual scraper, specify the base filename from site-scrapers, e.g.:node main.js MAImmunizations
to runsite-scrapers/MAImmunizations.js
- If you have your own scrapers you want to add, mimic the structure of
./site-scrapers/
inside a folder structure namedproprietary/site-scrapers
. In your .env file, have the fieldPROPRIETARY_SITE_SCRAPERS_PATH
set./../proprietary/site-scrapers
. This naming is recommended since the.gitignore
lists the folderproprietary
. - When you're ready to deploy via AWS Lambda, run
npm run predeploy
which will generatelambda.zip
for you. This needs to stay under 50 MB for you to upload it manually. - Your production environment needs to have the environment variables
AWSS3BUCKETNAME
,AWSACCESSKEYID
, andAWSSECRETACCESSKEY
so that it can publish to S3. If you are inserting your own scrapers, setPROPRIETARY_SITE_SCRAPERS_PATH
in production as well. If you have any scrapers that need to solve reCAPTCHAs, you will also need aRECAPTCHATOKEN
from the 2captcha service.