This project utilizes historical data from The Covid Tracking Project by The Atlantic to explore potential correlations between mask-wearing mandates and hospitalization rates due to COVID-19 across several U.S. states. By leveraging SQL for detailed data manipulation and analysis, the project aims to understand if states with mask mandates experienced statistically different hospitalization trends compared to those without. Findings from this analysis could provide valuable insights into the effectiveness of public health policies during the pandemic.
The data for this project was sourced primarily from The COVID Tracking Project at The Atlantic providing detailed records of COVID-19 cases and hospitalizations between Mar-2020 and Mar-2021. Additional data regarding mask-wearing policies were incorporated from a New York Times dataset, which documented state-level mask mandates over time.
The core of the data analysis was conducted using SQL, chosen for its robustness in handling and querying large datasets. Key SQL techniques used include:
- Common Table Expressions (CTEs): Utilized to segment the data weekly and compute hospitalization rates per 1,000,000 people, enhancing the readability and modularity of the SQL queries.
- Aggregate Functions: To derive minimum, maximum, and average values which were crucial in understanding the range and average of hospitalizations over time.
Example of a primary SQL query used:
WITH Hosp1 AS (
SELECT strftime('%W', dt) AS week, strftime('%Y', dt) AS year, min(dt) AS dt,
CAST(hospitalizedcurrently AS float)/ population * 1000000 AS first_hosp
FROM daily
WHERE state = 'CA'
GROUP BY year, week
),
Hosp2 AS (
SELECT strftime('%W', dt) AS week, strftime('%Y', dt) AS year, min(dt) AS dt,
CAST(hospitalizedcurrently AS float)/ population * 1000000 AS second_hosp
FROM daily
WHERE state = 'FL'
GROUP BY year, week
)
SELECT Hosp1.dt AS "Date", Hosp1.first_hosp AS "California Hospitalized", Hosp2.second_hosp AS "Florida Hospitalized"
FROM Hosp1
JOIN Hosp2 ON Hosp1.dt = Hosp2.dt
ORDER BY Hosp1.dt;
Initial explorations and data validations were performed using Spatialite_gui and QGIS. These tools helped in visually confirming geographic and temporal data alignments before deeper SQL-based analysis.
An attempt was made to transition approximately 30% of the project to PostgreSQL to leverage advanced functionalities like PostGIS. Although not fully implemented, this effort highlighted the potential for scaling up the analysis with more sophisticated geographic data handling capabilities but these efforts are temporarily on the back burner.
Interactive Documentation: The project is documented in two HTML files, day01.html and day02.html, each structured like a Jupyter notebook. These documents feature about 30 cells per day, including:
- Markdown Cells: These cells provide detailed explanations, outline the objectives for each step, and contextualize the code and results. They also include reflections and insights, enhancing the readability and educational value of the documentation.
- Code Cells: These include the SQL queries and any other scripts used to perform the analysis. Viewing the code alongside the explanations allows for a deeper understanding of the methods and techniques applied.
- Visualization Cells: Contain tables and charts generated from the analysis results, providing visual evidence of the findings and making complex data more accessible and understandable.
Comprehensive Coverage: The documentation covers the entire analytical process from data cleaning to advanced analysis, offering a thorough walkthrough of each day's work. This approach not only showcases technical skills but also demonstrates a commitment to transparency and thorough documentation in data science projects.
The analysis primarily focused on comparing hospitalization rates due to COVID-19 between states with and without mask mandates over the period from March 2020 to March 2021. Key findings include:
- Significant statistical differences in hospitalization rates were observed between states with various mask mandate policies and those without, as indicated by t-test results. This suggests that mask mandates may have had a discernible impact on public health outcomes, although other contributing factors should also be considered.
- States like California and Florida showed similar trends despite differing mask mandate policies, indicating a complex interaction of multiple variables affecting hospitalization rates.
Data visualization played a crucial role in interpreting the complex relationships within the data. Using Excel, line charts were created to depict hospitalization rates per million people, segmented by week:
- Line Charts: Each state’s weekly hospitalization rates were plotted to visually compare trends over time. These charts were instrumental in highlighting similarities and differences in trends across states, facilitating a straightforward comparison of epidemiological impacts. I used Florida as a baseline of comparison because they banned mask mandates.
Florida vs Utah: The p-value is 0.001643134. This indicates a statistically significant difference between Florida and Utah. The difference is likely due to factors related to mask guidance in these states. Florida banned mandates while Utah had no guidelines. They should be the same but because they are different, this indicates further investigation.
Florida vs California: The p-value is 0.188031763. This p-value is relatively high, suggesting that there is no statistically significant difference between Florida and California. This is strange in light of the fact that California was on mandates until March of 2021 which was after the data in this project. So it suggests that mandates may have had an effect, but again, further investigation is necessary.
Florida vs Nevada: The p-value is 0.002218502. Similar to Florida vs Utah, this indicates a statistically significant difference between Florida and Nevada regarding mask guidance. Nevada, recommended masks, but did not mandate them.
Florida vs Arizona: The p-value is 0.005352116. This suggests a statistically significant difference between Florida and Arizona in terms of mask mandates/recommendations. Arizona also banned mask mandates, but is different in the results. This indicates further investigation is necessary.
Florida vs Oregon: The p-value is 8.02427E-06 (which is 0.00000802427). This very low p-value indicates a statistically significant difference between Florida and Oregon regarding hospitalizations (likely related to the impact of mask mandates/recommendations).
When I looked at the charts, they did not look very different between the states. To confirm my hypothesis I ran a t-test between the data in each chart. Florida was used as a baseline because they had no mask mandates and were also opposed by leadership. The p-values indicate that mask mandates may not be a good indicator of hospitalizations. However, more research needs to be explored before a more comprehensive recommendation can be provided. Perhaps in a future Python project, whereas this project focused more on SQL.
State 1 | State 2 | p-value | Mandated |
---|---|---|---|
Florida | Utah | 0.001643134 | No Guidance |
Florida | California | 0.188031763 | Recommended but not mandated |
Florida | Nevada | 0.002218502 | Some Counties Recommended |
Florida | Arizona | 0.005352116 | Mandates Banned |
Florida | Oregon | 8.02427E-06 | Mandated |
- Excel Power Query: Enabled dynamic comparisons by allowing easy adjustments of the states included in the visual analysis, enhancing the flexibility of the data exploration process.
The visualizations provided not only a clear depiction of the data but also a powerful tool for storytelling, making the data accessible and understandable to a broader audience.
During the project, several challenges were faced, including:
- Data Integration: Combining data from multiple sources, such as The Covid Tracking Project and the New York Times mask mandate dataset, required careful alignment of different data formats and timelines.
- SQL Complexity: The advanced SQL queries used for the analysis, particularly the CTEs and aggregation functions, initially posed difficulties in ensuring accuracy and performance.
- Exploratory Tools Limitations: While tools like Spatialite_gui and QGIS were invaluable for preliminary visual checks, they presented a learning curve and had limitations in handling large datasets efficiently.
From these challenges, several key learnings emerged:
- Enhanced SQL Proficiency: Overcoming the complexities of SQL queries improved my ability to handle large datasets and perform intricate data manipulations more effectively.
- Data Cleaning Skills: The necessity of aligning and cleaning data from different sources honed my data preprocessing skills, a critical aspect of any data analysis workflow.
- Adaptability with Tools: Learning to utilize and transition between different tools, from SQL databases to Excel and GIS software, enhanced my flexibility and problem-solving skills in data analysis.
These experiences have not only enriched my technical capabilities but also my approach to tackling data-related problems in future projects.
This project has provided valuable insights into the correlation between mask mandates and hospitalization rates during the COVID-19 pandemic, highlighting the complex interplay of public health policies and health outcomes. The findings suggest that while mask mandates might not significantly impact hospitalization rates, other factors certainly do, warranting further investigation.
- Geographical Expansion: Consider expanding the analysis to include more states or even international data to see if the trends hold true across different populations and health infrastructures.
- Temporal Extension: Extend the time frame beyond March 2021 to include data on vaccination impacts and newer variants of the virus.
- Methodological Enhancements: Explore the use of more sophisticated statistical techniques or machine learning models to predict trends or outcomes based on a wider set of variables.
This project serves as a foundation for further research and analysis in the field of public health data analytics, and I am eager to explore these future directions to deepen our understanding of pandemic response strategies.
County COVID: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
County data: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-total.html
Boundaries (state, county, urban): https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
If you have any feedback, please reach out to me on LinkedIn.
A1: The main goal was to analyze the correlation between mask mandates and COVID-19 hospitalization rates across different states to understand if there was a statistical impact of public health policies on hospital rates.
A2: SQL was chosen for its robustness in handling and querying large datasets efficiently, which was crucial for managing the extensive data provided by The Covid Tracking Project and integrating it with additional data sources.
A3: Data discrepancies were addressed through careful data cleaning and preprocessing steps, which included aligning date formats, standardizing state identifiers, and resolving missing or inconsistent data entries.
A4: While the findings provide insights into the states analyzed, caution should be exercised when generalizing to other regions without considering local differences in policy implementation, population density, and healthcare infrastructure.
A5: Future improvements could include incorporating more granular data, such as county-level analysis, extending the time frame to include more recent data, and employing advanced analytical techniques like machine learning for predictive modeling.
A6: Contributors are welcome to extend the dataset, enhance the analytical methods, or apply the framework to other public health issues. Suggestions and collaborations can be discussed via [GitHub issues or pull requests on the project repository].
Original project made for Salisbury University and Udemy by