Predicting Age Through Plasma Protein Composition

Scientific Background

Inflammaging is defined as an age-related increase in levels of pro-inflammatory markers in the blood and tissues, and is a strong risk factor for many disease that are highly prevalent in the older population. This process is a "chronic, sterile, low-grade inflammation that contributes to the pathogenesis of age-related diseases". Although there is no specific way to reduce inflammaging, changing one's way of life (ie. exercising, healthy diet) can be highly important to obtaining a minimal inflammaging phenotype. A healthy diet can be especially crucial due to the central role of the gut microbiota in both metaflammation (metabolic inflammation) and inflammaging. In order to study how inflammaging can be minimized or even prevented, it is crucial to parse out information obtained from various tissues and plasma. Here I will be looking at plasma proteins, specifically the inflammatory cytokines, to create a model that will be able to predict the age of the subject.

References

Lind, L., Sundström, J., Larsson, A., Lampa, E., Ärnlöv, J., & Ingelsson, E. (2019). Longitudinal effects of aging on plasma proteins levels in older adults - associations with kidney function and hemoglobin levels. PloS one, 14(2), e0212060: PLoS One
Toshiko Tanaka, Nathan Basisty, Giovanna Fantoni, Julián Candia, Ann Z Moore, Angelique Biancotto, Birgit Schilling, Stefania Bandinelli, Luigi Ferrucci (2020) Plasma proteomic biomarker signature of age predicts health and life span eLife 9:e61073: eLife
Franceschi, C., Garagnani, P., Parini, P. et al. Inflammaging: a new immune–metabolic viewpoint for age-related diseases. Nat Rev Endocrinol 14, 576–590 (2018). Nature

Project Summary

Goal:

There is a higher risk for older patients for many chronic diseases and in order to curate treatments and therapies, it is essential to identify important genes and protein biomarkers that causes this phenotype. For example, it is well known that certain proteins like C-reactive protein and certain growth-differentiation factors are increased with aging (link) Here, I will be using past data from plasma protein composition dataset and age of animals to predict how protein changes with aging, and therefore be able to predict the 'age' of the subject. The Diagram (diagram created with biorender) below shows the overall goal of the project: using protein composition data and feed it into a machine learning model to predict the age number that is associated with that protein pattern.

Metrics:

I will be testing the data against a classification and regression estimator (supervised learning) and seeing which are the more suitable model for this experiment.

The models that I will be testing for classification are:

LogisticRegression
RandomForestClassification
SVC
XGBoostClassifier

The models that I will be testing for regression are:

SVR
RandomForestRegressor
LinearRegression
KNN
XGBoostRegressor

Data Used

Compilation of past experimental data looking at plasma protein composition in various models of neurodegeneration and aging.

Here I have parsed out only the relevant features per projecs

Table 1. Features

Next Steps:

Use the LDA analysis method but on a dataset where the ages have not been reduced to two classes

Deployed:

Currently a working model is deployed using HuggingFace Gradio framework. password input necessary* Spaces Link

NOTE:

The data used will not be made available until permission from the company.

Instructions to Use Predictor App

Gradio.app

For this project, I specifically used the Gradio an open-source Python Library for easy deployment of my machine learning model.

Requirements: Please install gradio using pip:

pip install gradio

Here is a link for more information!

Additional Information

The Initial Analysis folder (at this link) contains a jupyter notebook of the first rendention of the unsupervised algorithm methodology used for clustering analysis to validate model effects. This has helped build the foundation for developing a protocol to analyze high dimensional plasma protein data, and was used for the age_predictor analysis as well. Some things to note:
- Instead of standardization, I opted for the normalization of data using MinMaxScaler() since a "negative" concentration of proteins do not exist in the plasma. In order to keep the values reflective of the natural setting, while keeping the scale of the features similar, normalization was used.
- Currently, I am using a fewer dimensions dataset of 10-20 dimensions with ~50-70 observations, which is why I have chosen to use KMeans ('k-means++' for initialization of centroids) for the preliminary analysis, however for the actual human plasma protein data of >10,000 dimensions, KMeans will not be the optimal model since the Euclidean distance often fails at this scale, "curse of dimensionality" (unless dimensionality reduction is used for compensation). For the future, I will look at other clustering algorithms, such as hierarchical method with HDBSCAN.
- So far, I have used principal component analysis for dimensionality reduction (checking the data loss using cumulative variance) and visualized by setting n_components = 2. However, it might even be a good idea to set the n_components = 3 and visualize it in 3-D using the mpl_toolkits.mplot3d.axes3d. For the next steps, I will see if other dimensionality reduction techniques such as UMAP, tSNE, especially for higher dimensional data.
The relevant_data_example_modeling folder (at this link) contains a .ipynb file of a quick project I did for predicting gliomas using data from UCI Machine Learning Repository. Instead of protein expression however, I wanted to see if the same concept would apply to genetic expression. Since expressed genes do not necessary correlate with increased protein levels due to post-translational modifications.

Example image of brain gliomas: link

Project Information: Predicting Gliomas

Here I look at three different classifiers:

LogisticRegression
KNeighborsClassifier
RandomForestClassifier

Conclusion:

Out of these three models, the best model was the LogisticRegression with a hypertuned recall score of 93.67%.

Next Steps For the next round of investigation, I will look at these models below as well:

ElasticNet
NaiveBayes
SVC
XGBoost

Next in the Utils directory I created a processing tools script that contains useful functions for automating visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Apps		Apps
Initial Analysis		Initial Analysis
Utils		Utils
relevant_data_example_modeling		relevant_data_example_modeling
LICENSE		LICENSE
README.md		README.md
age_diagram.png		age_diagram.png
plasma_protein_age_predictor_anonymized.ipynb		plasma_protein_age_predictor_anonymized.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Age Through Plasma Protein Composition

Table of Contents:

Scientific Background

References

Project Summary

Goal:

Metrics:

Data Used

Table of Contents

Next Steps:

Deployed:

NOTE:

Instructions to Use Predictor App

Gradio.app

Additional Information

Project Information: Predicting Gliomas

Conclusion:

About

Releases

Packages

Languages

License

johnnys7n/Predictive-Protein-Analysis

Folders and files

Latest commit

History

Repository files navigation

Predicting Age Through Plasma Protein Composition

Table of Contents:

Scientific Background

References

Project Summary

Goal:

Metrics:

Data Used

Table of Contents

Next Steps:

Deployed:

NOTE:

Instructions to Use Predictor App

Gradio.app

Additional Information

Project Information: Predicting Gliomas

Conclusion:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages