Texas Health and Human Services is responsible for compiling data on Texas DFPS "non-compliance" cases at child placing agencies and residential treatment centers for foster children. An operation can be considered out of compliance for a wide range of violations - from expired fire extinguishers to sexual abuse of foster children.
HHSC provides this data publically on the TX open data portal. Unfortunately, this provided dataset is incomplete. Specifically, it doesn't record data from child placing agency branch facilities (only their main facility). Thus, this scraper exists to generate a more thorough dataset of TX non-compliance cases.
The operations responsible for caring for and managing foster children in Texas – child placing agencies, etc – are all assigned an operation ID by DFPS. It appears that IDs are assigned sequentially, however, there is no logical pattern that allows us to infer under which numbers IDs exist. For example, operations may be assigned IDs under the numbers 50003, 50004 and 50005 followed by a seemingly arbitrary gap in assignment that resumes 2,000 sequential numbers later at 52003. What we do know is that, as of this writing, all IDs lie between the numbers 50,000 and 1,400,000.
This scraper works by attempting to visit the "Operation Details" pages for each possible ID you ask it to investigate. When a facility is found, if it has been cited for non-compliance cases, its details are scraped from this page. Afterwards, the facility's compliance history page is opened and each non-compliance case is scraped from the table.
All non compliance cases are aggregated into a .csv
file with the name "hhsc-deficency-data". One row represents one case and includes case-specific and facility-specific columns. This file is saved to disc and pushed to the Express-News account on data.world. Before beginning the scrape, if an existing "hhsc-deficency-data" file exists on data.world, it is pulled from the sever and the newly scraped cases are added to the existing spreadsheet before being pushed back to the server.
Another file titled "hhsc-deficency-data-backup", which is just a copy of "hhsc-deficency-data" in it's state prior to the current scrape, is also pushed to the server and saved to disc to protect existing data in case something went wrong. If the shell script, run.sh
, is used to make multiple scrapes at a time, each version will also be saved to disc in the temp
directory.
Here is a data dictionary describing the format of each row in the "hhsc-deficency-data" csv
file:
- uniq_id <
number
>: Generated by the scraper, each row in the data has a unique ID - operation_id <
number
>: The facility's unique ID - operation_number <
string>
: Each facility also has a unique number, which often includes dashes - operation_type <
string
>: Either Child Placing Agency or General Residential Operation - operation_name <
string
>: Obvious - programs_provided <
string
>: Child Placement Agency, Child Care Services Only, Residential Treatment Center or Multiple Services - location_address<
string
>: Obvious - phone <
string
>: Obvious - county <
string
>: Obvious - website <
string
>: Obvious - email <
string
>: Obvious - type_of_issuance <
string
>: The facility's license - issuance_date <
date
>: When the facility was assigned its license - open_foster_homes <
number
>: Will be populated if this is a CPA - open_branch_offices <
number
>: Will be populated if this is a CPA - num_admin_penalties <
number
>: Will be populated if this is a CPA - is_main_branch <
boolean
>: If the facility is a GRO or the main branch of a CPA, returns true. Otherwise false. - corrective_action <
boolean
>: Has corrective action been taken against the facility - adverse_action <
boolean
>: Has adverse action been taken against the facility - temporarily_closed <
boolean
>: Has the state temporarily closed the facility? - num_deficiencies_cited <
number
>: Total number of non-compliance cases
- activity_date <
date
>: This is the date that deficiency was cited - standard_number_description <
string
>: The code used to classify the deficency - activity_type <
string
>: How did DFPS learn about this incident? - standard_risk_level <
string
>: How bad was this offense? - corrected_at_inspection <
boolean
>: Was the deficiency corrected at inspection - corrected_date <
date
>: When was the problem corrected - date_correction_verified <
date
>: When did the state verify the correction? - technical_assistance_given <
boolean
>: Did the state give assistance with correction? - narrative <
string
>: A description of the non-compliance case
- Clone the repo:
git clone https://github.com/sa-express-news/hhsc-scraper.git
- Install the dependencies:
npm run install
- Configure the data.world API connection (members of the Express News data.world team only need to complete step three, configuring
.env
):- Create a dataset in data.world
- Open
src/pushToServer
andsrc/pullFromServer
and configure the API URL's to match your paths - Grab you API key from data.world and save it to a
.env
file in the repo's root
- Run
npm run test
to check if everything is working. Note that a couple end to end tests may fail if a facility has accrued more deficencies since the test was written. You might need to update or ignore accordingly. All tests are saved underspec.ts
in their respective directories - Run
tsc
to compile the TypeScript.
The scraper can be run with the command npm run scraper
. If no options are passed, this command will grab the attempted-ids.json
file (details on this below) from the server, figure out the last scraped ID from that list and then begin scraping IDs sequentially starting at the next possible ID above the last attempted ID. It will attempt to scrape 10,000 sequential possible IDs before shutting down again.
All of these are optional.
throttle
: This is the number of IDs to attempt simultaneously. The default is 10. Keep in mind that the HHSC website is extremely buggy and attempting too many IDs at once can lead to errors, which will be logged inlogs/errors.log
.scope
: This is the number of IDs to attempt in this scrape from the starting ID. For example ifstart
is set to 10 andscope
is set to1000
, the attempted IDs will be 10-1009. This option will be ignored iffinished
is passed. Default is 10,000.start
: The ID number to start the scrape at.finish
: The ID number to stop the scrape at.specific
: This is a list of comma-separated IDs that, if specified, will be the only IDs the scraper attempts to scrape.
run.sh
allows you to set the scraper to make many scrapes in a row, saving the results to the temp
directory under sequential file names to back them up. If you open the file, the batch
variable can be modified to specify the number of times you want to run the scraper.
To make a full scrape, make sure attempted-ids.json
is not reporting anything above 50,000 under last_attempted
, set the batch variable to 135 and then let 'er rip. This should take between 24 and 48 hours to complete.
attempted-ids.json
is a log file that keeps track of what you've been doing with the scraper over time. If the scraper is run without a specific start
ID, it will also be used to configure what ID to pick up where it left off from (if this is the first time you have run the scraper, it will start at 50,000). The file is saved in the data.world dataset and locally to the logs
directory. It has the following properties:
last_successful
<number
>: The last ID under which a facility was found and successfully scrapedlast_attempted
<number
>: The last attempted ID. Will be used by the scraper on restart to pick up where it left offtotal_from_last_scrape
<number
>: How many non-compliance cases were found in the last scrapetotal_in_database
<number
>: How many rows are in the current iteration of "hhsc-deficency-data"facility_scraped_deficencies_rejected
<Array<number>
>: If an ID was found at a facility page but something went wrong trying to scrape the corresponding deficency page, the facility ID will be saved to this array and reattempted next time the scraper starts upfacility_timeout_or_alert_page
<Array<number>
>: If we found what looked like a facility page at this possible ID but couldn't scrape the facility details, the ID is saved to this array and attempted again at the next scrape
The scraper is written in TypeScript. All source code is in the src
directory with modules separated in sub-directories therein. Each module has a spec.ts
file which includes the unit tests, written with Tape.
The facility pages are requested using Request Promise and scraped using Cheerio. The deficency pages are complex to navigate and filled with JavaScript. Thus, they are scraped with Google's Headless Chromium browser coupled with the Puppeteer library. Deficency page scrapes are only attempted when a facility is found and successfully scraped, thus minimizing usage of the less efficent headless browser approach.
Winston is used for error handling and logging.
data.world, via their API, is used as a database, allowing us to push updates in JSON format but query them in SQL via the SQL tool on their platform.
If you'd like to use this scraper external from data.world, this could be done by rewriting src/pullFromServer
and src/pushToServer
to suit your needs. Both of those modules accept arrays of objects as their input arguments which you could then handle as needed.
Contact lwhyte
AT express-news.net
for more details.