Welcome to this Deep Learning project: Chest X-ray images are classified being normal or pneumonia ones by using Convolutional Neural Networks. Means, given a converted .jpeg compressed image of a chest X-ray DICOM image, the algorithm will identify an estimate of the image status showing a pneumonia or not.
First image source, second image source
The international Digital Imaging and Communications in Medicine standard, DICOM standard for short, delivers the processes and interfaces to transmit, store, retrieve, print, process, and display medical imaging information between relevant modality components of an hospital information system.
The used Kaggle dataset delivers already labelled images as training, validation and testing samples. As mentioned, these images are already converted to the .jpeg image format. In other words, private individual data information sets don't exist.
After viewing such images it has been identified, that posterior-anterior or anterior-posterior X-ray image orientation is available and that mostly children images are selected. No X-ray images of all human age categories and X-ray lateral orientation have been found. But this could only be analysed more properly by reconverting the images to the .dcm DICOM format having the associated DICOM tags available. Doing this, regulatory data protection aspects have to be taken into account (e.g. Health Insurance Portability and Accountability Act, HIPAA), therefore this has not been done. It would be a HIPAA compliance breach.
As an introduction to the projects way of working and implementation, read this report documentation.
More general, this project is linked as example part to the Medium blog post 'AI in Healthcare Not Only Changing Doctors Diagnostic Workflow'.
- Download GraphViz and on Windows install it to 'C:/Program Files (x86)/Graphviz2.38/'. Afterwards add the 'C:/Program Files (x86)/Graphviz2.38/bin/' directory to the PATH environment variable. This path information is part of the 'step 0' chapter of the python project file too. So, don't change it.
Pydot and GraphViz are used together to plot the neural network architecture. GraphViz is now licensed on an open source basis, only under The Common Public License.
- Download the chest image dataset. Unzip the folder and place the delivered 'chest_xray' directory in your repository, at location
path/to/chest-classifier-project/data
.
Have a look to the new directories and delete all the '.DS_store' files, they are not needed for this algorithm and would throw errors by using this coding.
Using the original chest X-ray image separations to the directories train, test, val and their associated subdirectories caused the neural network to unreliable results. Its distribution does not fit to the 80/20 or 70/30 rule of thumb according training and testing data. This original distribution has been changed to other ratios, like e.g. 80/20 set (60/20/20 distribution for training/validation/testing) and further information is mentioned in the chest-class_app.ipynb file. Playing around with different distributions - especially the amount of validation samples - showed big changes in the prediction metric results. Having a look to some model architectures, their prediction performance isn't good anymore, e.g. having a ROC AUC even worse than random prediction. Bias and overfitting appeared in some cases. So, for the final project different python notebook files are stored, each one using a different data distribution. The associated ratio is part of the file name.
Some of the best weights training results are stored in the saved_models directory, where are CNN architecture .png files are stored as well.
Regarding the bottleneck features for the transfer learning models, only the npz files of the ResNet50 from the different data distributions could be stored in this repository as a zip file. They can be downloaded and unpacked in a subdirectory called bottleneck_features. The ones created for the InceptionV3 models are too big for the GitHub repository. The same issue appeared for the best model weights files of the fine-tuned ResNet models for each data distribution.
As a future to do: Better hyperparameter values (batch and epoch sizes together with the initialisation) must be found by having a better environment for machine learning algorithms. Hyperparameter tuning from Scikit-Learn with GridSearchCV or RandomizedCV as an alternative has not been done, because it is computational expensive having such a lot of parameters for the neural network architectures. It was not possible with the existing environment (own hardware or AWS EC2 service).
- If you are running the project on your local machine (and not using AWS) create and activate a new environment. First, move to directory
path/to/chest-classifier-project
.
- Windows
conda create --name chest-class-project python=3.6
activate chest-class-project
pip install -r requirements/requirements.txt
- If you are running the project on your local machine (and not using AWS), create an IPython kernel for the
chest-class-project
environment.
python -m ipykernel install --user --name chest-class-project --display-name "chest-class-project"
- Open the notebook.
jupyter notebook chest-class_app.ipynb
- If you are running the project on your local machine (and not using AWS), before running code, change the kernel to match the chest-class-project environment by using the drop-down menu (Kernel > Change kernel > chest-class-project).
This project coding is released under the MIT license.