This is an example of an analysis "script" (agnostic to a particular tool) that you can use to replicate the analysis of this data. As you're walking through it, think about how you would document your analysis to share with colleagues. Are there any important steps that are missed in this script? Are there additional questions you would ask as you "interview" the data?
Each row represents a lead level (in parts per billion) test result for one test, at one location, in an Illinois water system. The data comes from the EPA and is part of the agencies monitoring of lead and copper in drinking water.
Fields:
PWSID
: Water system IDPWSNAME
: Water system nameCOLLECTION_DATE
: Date the test was takenRESULT
: amount of lead (in parts per billion)
Load the water test data from this CSV file and the water system data from this CSV file.
- Are all the dates (
COLLECTION_DATE
) valid? - Are all the results (
RESULT
) values valid numbers? - Are the water system names (
PWSNAME
) and IDS (PWSID
) consistent for all records?
How many records are there in the data set?
How many water systems are there in the data?
What's the earliest test date?
What's the last test date?
What's the lowest lead level? In which system?
What's the highest lead level? In which system?
Group each water systems tests by water system and calendar year. Then, calculate the 90th percentile value for each water system, for each year.
Which water systems had a 90th percentile value above 15 parts per billion? In which years? How many systems exceeded the standard in one year? In two years? In three years ...?
Which water systems had at least one result above the EPA standard of 15 parts per billion? In which years? What percentage of water systems had a test over the threshold?
Which water systems had at least one result above the EPA standard of 40 parts per billion? In which years? What percentage of water systems had a test over the threshold?
For this analysis, consider the Chicago area as the following counties:
- Cook
- DuPage
- Kane
- Lake
- McHenry
- Will
Plot the coordinates of the water system's contact address for systems that have exceeded the EPA standard of a 90th percentile value over 15 parts per billion? Does any part of the state seem to have a disproportionate number of water systems exceeding the standard?