This is my submission for the PREPARE challenge, a competition created by the NIH and the National Institute of Aging to encourage the analysis of voice data in the study of Alzheimer's disease and related dementias.
The challenge ended on Dec 28th, 2024.
The data consists of
All the recordings are labeled as belonging to one of three classes: control
cases, cases with mild cognitive impairment mci
, and cases with dementia due to Alzheimer's disease adrd
.
Our goal is to build a classifier that assigns these labels to previously unseen recordings.
Unfortunately, the study participants are not all recorded while completing the same task:
- Most English speakers are heard describing the Cookie Theft picture from the Boston diagnostic aphasia examination, but in some cases we have recording of spontaneous speech (an unstructured phone call with an interviewer) or even short commands given to an electronic home assistant ("When is thanksgiving? What's the weather today?").
- Most Spanish speakers are reading the beginning of Don Quixote
- Most Chinese speakers are recorded during the Animal Naming Test.
As a consequence, the nature of the recorded speech is not uniform. This means that it is reasonably safe to use some acoustic voice features (but not pauses, for example), while we cannot use linguistic features.
Warning: all Chinese speakers are labeled as
mci
, which is probably a consequence of dataset construction. I suspect those participants were sampled from a study where mild cognitive impairment was part of the inclusion criteria.
In this repository I'm only showing the code for the best performing model, an LSTM followed by a fully connected head:
flowchart LR
A["Voice Recording"] --> B["Extract MFCC"]
B --> C["LSTM"]
D["Fully Connected Head"] --> E["Prediction"]
C --> D
F["Demographics"] --> D
I've also tried some time series classifiers from sktime and pytorch tabular, a GRU, a 1D CNN, and a 2D CNN on the Mel spectrogram. They all performed reasonably well, but worse than the LSTM approach.
The features fed into the LSTM module are the Mel spectrogram, extracted using librosa
.
The mel spectrogram was precomputed for all recordings to save computation time during training, the increased I/O work is still a lot faster than recomputing the spectrogram on the fly. The raw spectrogram values were log-transformed with librosa.power_to_db
to improve contrast and limit the dynamic range.
Here is an example spectrogram: (this is not data from the challenge, it's a recording of my own voice)
To minimize the effect of sudden noises and accidental variations in volume, I normalized each recording with 2.5%-97.5% quantile scaling.
I used 96 Mel frequencies, a window width of 25ms with 10ms steps, and dropped all frequencies below 10Hz or above 16kHz. The upper frequency bound is imposed for technical reasons, the voice recordings are provided as compressed mp3 files, with all frequencies above 16kHz zeroed out. I chose the lower frequency bound to remove subsonic noise, the lowest frequency sounds typically found in human voices are around 50-100Hz, so I don't expect to lose any significant part of the signal.
While some literature2 recommends using about 20 Mel frequencies, other studies using deep learning methods show good results with up to 200. I settled on 96 frequency after og-spaced hyperparameter search.
Some of the recordings had noticeable background noise, so I tried preprocessing them with DeepFilterNet but did not observe a significant improvement in the overall performance, so I ended up skipping this preprocessing step.
While PyTorch supports making batches of variable length sequences (via the pack_sequence
method), I did not use it: during training I extracted from each recording a contiguous random section of length 2 seconds (for a total 200 features, given the 10ms steps) and packed them into a minibatch. This works as a data augmentation strategy, and it does not lose any significant amount of information because any part of the signal slower than 0.5Hz (the inverse of 2s) was filtered out anyway.
The classification scores are passable. As expected, the control class -the most common- has the best performance.
Precision | Recall | F1-score | Support | |
---|---|---|---|---|
control | 0.649 | 0.924 | 0.763 | 911 |
mci | 0.67 | 0.355 | 0.464 | 217 |
adrd | 0.657 | 0.296 | 0.408 | 517 |
The confusion matrix shows that the model struggles with identifying ADRD and especially MCI cases. The optimization target for the challenge was log-loss, so all errors are weighted equally. In a real world scenario, where it is more important to correctly identify rare classes (especially MCI, so that management can start early), I would have weighted the loss function to lower the effect of control cases.
prepare_audio_features.py
to compute the Mel spectrograms from your audio filestrain.py
to train the modelevaluate.py
to make predictions
The challenge was sponsored by the National Institute on Aging (NIA), an institute of the National Institute of Health (NIH), with support from NASA.
The MHAS (Mexican Health and Aging Study) is partly sponsored by the National Institutes of Health/National Institute on Aging (grant number NIH R01AG018016) in the United States and the Instituto Nacional de Estadística y Geografía (INEGI) in Mexico. Data files and documentation are public use and available at www.MHASweb.org.
The references for DementiaBank are:
- Becker, J. T., Boller, F., Lopez, O. L., Saxton, J., & McGonigle, K. L. (1994). The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Archives of Neurology, 51(6), 585-594. Note: Please also acknowledge this grant support for the Pitt corpus -- NIA AG03705 and AG05133.
- Lanzi, A. M., Saylor, A. K., Fromm, D., Liu, H., MacWhinney, B., & Cohen, M. (2023). DementiaBank: Theoretical rationale, protocol, and illustrative analyses. American Journal of Speech-Language Pathology. doi.org/10.1044/2022_AJSLP-22-00281
Reference for the competition and the DrivenData platform: https://arxiv.org/abs/1606.07781