This repository is provided as supplementary material for the paper "XAI Tools in the Public Sector: A Case Study on Predicting Combined Sewer Overflows" by Nicholas Maltbie, Nan Niu, Reese Johnson, and Matthew VanDoren.
These are the notes for the CSO case study, how the data is prepared, ML models are tuned and created, and the final interpretability analysis.
This repository contains instructions on how to use the code required to create models for the dataset and then how to apply these models to a sample dataset and gather expandability results for our research.
The data in this repository is randomized as the data used for the research is proprietary to our stakeholder.
This project is designed to run on a Linux system with an available NVIDIA GPU as well as a minimum of 4GB of RAM and disk space for libraries, datasets, and results (fairly small and should only require 8GB of disk storage for all libraries and installations).
Here is a description for each item in this project
- README.md - this description file
- REQUIREMENTS.md - system, hardware, and software requirements to operate this project
- STATUS.md - kinds of badges being applied to as part of the project.
- LICENSE.txt - License associated with this project
- INSTALL.md - Installation instructions
- FSE21_XAI_Tools.pdf - Copy of the accepted paper in PDF format.
run_lstm_hparam.py
- A python script to generate a LSTM model for a given set of hyper parametershparams_search.sh
- A script to automate searching through hyper parameters.
These are jupyter notebook files that document how to run the project and have example visualizations and information.
Data_Preparation.ipynb
- Prepare the data from the original sensors into a synchronized and interpolated formInterpretability.ipynb
- Apply interpretability tools to the various models.Paper Charts.ipynb
- Notebook with code to generate various charts used in the paper
Datasets
- This is a representation of the data used in the project. We are not able to release the proprietary data we used from our stakeholder as part of the case study, but this is randomized data to help show how this code operates and how to use this in future projects.Datasets-Synchronized
- This is the synchronized and interpolated dataset generated from the sensor outputDataset-Analysis
- This folder holds results of tuning models (or a sample of model tuning) from the Datasets-Synchronized data. This folder is generated by therun_lstm_hparam.py
script.
This project requires python 3.8 (installation guide), anaconda (installation guide).
Steps for installing and setting up the project can be found in the INSTALL file.
The project contains copies of all the files generated using the randomized data as part of the project. The files are all derived from the csv files in the Datasets
folder.
The project uses assets in the following order
Data_Perparation.ipynb
to prepare and clean the datahparams_search.sh
to find a set of tuned hyper parametersrun_lstm_hparam.py
to create final LSTM based modelsInterpretability.ipynb
to complete an analysis using XAI toolsPaper Charts.ipynb
to create the charts based on results and analysis
A more detailed description of how to use these tools is written next.
This project should be run in the format of first following setup instructions to setup an environment. Then the Data_Perparation.ipynb
notebook can be used to read in the raw sensor data from Datasets folder to create synchronized and interpolated datasets into the Datasets-Synchronized folder.
Next the hparams_search.sh
can be used to search various hyper parameters. This uses the run_lstm_hparam.py
to generate a LSTM based model. The final set of hyper parameters we ended up using (2 layers with 24 nodes) can be generated through this command:
python run_lstm_hparam.py \
--end_offset 1 --start_offset 0 --seq_len 12 \
--num_units 24 --dropout 0 --num_layers 2 \
--class_weight 2 --learning_rate 0.001 \
--batch_size=1024
To visualize the results of the training, you can use TensorBoard:
tensorboard --logdir Dataset-Analysis/lstm_hparams/logs
Once the model has been generated, results for each run can be found using either tensorboard under the hparams
menu or by looking up the genreated results in Dataset-Analysis/lstm_hparams/logs/complete/{model_name}
. This will have the results for both the validation subset as well as the training subset of the data.
Now that we have the functional results of the model, we can move onto generating the interpretability analysis of the model. To do this use the Interpretability.ipynb
notebook. (this notebook has to be run using jupyter notebook and NOT jupyter lab due to limitations of the tools).
The final notebook Paper Charts.ipynb
has code to generate various charts used in the paper.