PREPARE Challenge - Phase 2: Acoustic Track

This is my submission for the PREPARE challenge, a competition created by the NIH and the National Institute of Aging to encourage the analysis of voice data in the study of Alzheimer's disease and related dementias.

The challenge ended on Dec 28th, 2024.

The Dataset

The data consists of $N=1646$ voice recordings, in English¹ (79%), Spanish (19%), and Chinese (2%). Most of the recordings (70%) are exactly 30 seconds long, with the remaning ones roughly uniformly distributed in length down to 1.15s. Basic demographic information -age and sex- is available for all participants.

All the recordings are labeled as belonging to one of three classes: control cases, cases with mild cognitive impairment mci, and cases with dementia due to Alzheimer's disease adrd. Our goal is to build a classifier that assigns these labels to previously unseen recordings.

Unfortunately, the study participants are not all recorded while completing the same task:

Most English speakers are heard describing the Cookie Theft picture from the Boston diagnostic aphasia examination, but in some cases we have recording of spontaneous speech (an unstructured phone call with an interviewer) or even short commands given to an electronic home assistant ("When is thanksgiving? What's the weather today?").
Most Spanish speakers are reading the beginning of Don Quixote
Most Chinese speakers are recorded during the Animal Naming Test.

As a consequence, the nature of the recorded speech is not uniform. This means that it is reasonably safe to use some acoustic voice features (but not pauses, for example), while we cannot use linguistic features.

Warning: all Chinese speakers are labeled as mci, which is probably a consequence of dataset construction. I suspect those participants were sampled from a study where mild cognitive impairment was part of the inclusion criteria.

Architecture

In this repository I'm only showing the code for the best performing model, an LSTM followed by a fully connected head:

flowchart LR
    A["Voice Recording"] --> B["Extract MFCC"]
    B --> C["LSTM"]
    D["Fully Connected Head"] --> E["Prediction"]
    C --> D
    F["Demographics"] --> D

I've also tried some time series classifiers from sktime and pytorch tabular, a GRU, a 1D CNN, and a 2D CNN on the Mel spectrogram. They all performed reasonably well, but worse than the LSTM approach.

Features, Augmentation, and Preprocessing

The features fed into the LSTM module are the Mel spectrogram, extracted using librosa. The mel spectrogram was precomputed for all recordings to save computation time during training, the increased I/O work is still a lot faster than recomputing the spectrogram on the fly. The raw spectrogram values were log-transformed with librosa.power_to_db to improve contrast and limit the dynamic range.

Here is an example spectrogram: (this is not data from the challenge, it's a recording of my own voice)

To minimize the effect of sudden noises and accidental variations in volume, I normalized each recording with 2.5%-97.5% quantile scaling.

I used 96 Mel frequencies, a window width of 25ms with 10ms steps, and dropped all frequencies below 10Hz or above 16kHz. The upper frequency bound is imposed for technical reasons, the voice recordings are provided as compressed mp3 files, with all frequencies above 16kHz zeroed out. I chose the lower frequency bound to remove subsonic noise, the lowest frequency sounds typically found in human voices are around 50-100Hz, so I don't expect to lose any significant part of the signal.

While some literature² recommends using about 20 Mel frequencies, other studies using deep learning methods show good results with up to 200. I settled on 96 frequency after og-spaced hyperparameter search.

Some of the recordings had noticeable background noise, so I tried preprocessing them with DeepFilterNet but did not observe a significant improvement in the overall performance, so I ended up skipping this preprocessing step.

While PyTorch supports making batches of variable length sequences (via the pack_sequence method), I did not use it: during training I extracted from each recording a contiguous random section of length 2 seconds (for a total 200 features, given the 10ms steps) and packed them into a minibatch. This works as a data augmentation strategy, and it does not lose any significant amount of information because any part of the signal slower than 0.5Hz (the inverse of 2s) was filtered out anyway.

Performance

The classification scores are passable. As expected, the control class -the most common- has the best performance.

	Precision	Recall	F1-score	Support
control	0.649	0.924	0.763	911
mci	0.67	0.355	0.464	217
adrd	0.657	0.296	0.408	517

The confusion matrix shows that the model struggles with identifying ADRD and especially MCI cases. The optimization target for the challenge was log-loss, so all errors are weighted equally. In a real world scenario, where it is more important to correctly identify rare classes (especially MCI, so that management can start early), I would have weighted the loss function to lower the effect of control cases.

Usage

prepare_audio_features.py to compute the Mel spectrograms from your audio files
train.py to train the model
evaluate.py to make predictions

Acknowledgements

The challenge was sponsored by the National Institute on Aging (NIA), an institute of the National Institute of Health (NIH), with support from NASA.

The MHAS (Mexican Health and Aging Study) is partly sponsored by the National Institutes of Health/National Institute on Aging (grant number NIH R01AG018016) in the United States and the Instituto Nacional de Estadística y Geografía (INEGI) in Mexico. Data files and documentation are public use and available at www.MHASweb.org.

The references for DementiaBank are:

Becker, J. T., Boller, F., Lopez, O. L., Saxton, J., & McGonigle, K. L. (1994). The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Archives of Neurology, 51(6), 585-594. Note: Please also acknowledge this grant support for the Pitt corpus -- NIA AG03705 and AG05133.
Lanzi, A. M., Saylor, A. K., Fromm, D., Liu, H., MacWhinney, B., & Cohen, M. (2023). DementiaBank: Theoretical rationale, protocol, and illustrative analyses. American Journal of Speech-Language Pathology. doi.org/10.1044/2022_AJSLP-22-00281

Reference for the competition and the DrivenData platform: https://arxiv.org/abs/1606.07781

Notes

I automatically detected the language of the recording using Whisper's detect_language() method. ↩
https://doi.org/10.1109/ISPA.2017.8073600 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
prepare_audio_features.py		prepare_audio_features.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PREPARE Challenge - Phase 2: Acoustic Track

The Dataset

Architecture

Features, Augmentation, and Preprocessing

Performance

Usage

Acknowledgements

Notes

About

Releases

Packages

Languages

License

mbellitti/alzheimers-voice-challenge-2024

Folders and files

Latest commit

History

Repository files navigation

PREPARE Challenge - Phase 2: Acoustic Track

The Dataset

Architecture

Features, Augmentation, and Preprocessing

Performance

Usage

Acknowledgements

Notes

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages