A web scraper that is used to download assets (minutes, agenda) from committees of various municipalities.
Run the following to install the required dependencies:
pip install -r requirements.txt
To runs the all web scrapers (see list of supported cities below), perform the following command:
python scraper.py
Result:
- writes out a timestamped csv file of all the assets to the
assets
directory - downloads all assets to the
download
directory. replaces old files with new files.
Note: a few links are broken and do not point to a valid asset Issue 5.
2948 files / 10 minute scrape time / 13 minute download time
- Issue 1 : For
City Council
, years 2002 - 2004 are not getting picked up by Selenium (the rows are there but the text is empty) - Issue 2 : For
Zoning Administrator
committee, years 2019 - 2010 not getting picked up by Selenium (the rows are there but the text is empty). - Issue 3: under the committee: "Processing Quality of Life Forum", there are multiple agendas on the same date. They appear as separate entities in the csv file. However, when they download these resources, they are given the same name and only one of them is saved.
1930 files / 2.5 minute scrape time / 6 minute download time
- Issue 8 : Stale Exception when clicking the year tab. Some of the years are skipped.
200 files / 6 minute scrape time / 45 second download time
- Create a new file with the name of the city (e.g. ClaytonScraper).
- Follow the template in
example_scraper.py
to create a class.
- The class should inherit from
CivicScraper
- Initialize with the correct municipality and url
- Implement the
scrape
method where your final result updates themunicipality
instance variable
- In
scraper.py
, add the new scraper to thescrapers
list variable.
Now, when you run python scraper.py
, the new scraper will be included.
The entity model is written up in records.py
. Here are the four categories which form a containment hierarchy.
Municipality
: the city (e.g. Walnut Creek)CommitteeData
: the committee (e.g. Arts Commission)CommitteeMeeting
: a specific meeting from a committee from a particular date that contains an agenda and/or minutes (e.g. Arts Commission Regular Meeting on Jul 22, 2024)CommitteeFile
: An agenda or minutes for a specific file. contains the file name and url. (e.g. Minutes for Arts Commission Regular Meeting on JUl 22, 2024)
- Files are downloaded in parallel which provides an approximate 2X speedup for the file download time
- Walnut Creek blocks headless browsers so need to make the chrome driver not headless
- Support for downloading both html and pdf files based on content type of response rather than file extension