This project aims to predict Autism Spectrum Disorder (ASD) using machine learning models for both adults and toddlers. By analyzing responses to screening questions, the project leverages Random Forest classifiers to assess ASD risk. Visualizations and performance metrics help in interpreting the models' effectiveness.
Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition. Early diagnosis is crucial for providing timely intervention and support. This project builds machine learning models, specifically Random Forest classifiers, to predict ASD in both adults and toddlers based on responses to a set of screening questions.
The project also provides a set of visualizations to explore the characteristics of the dataset, as well as the performance of the models.
Project tree - The dataset includes two parts:
- Adults: Consisting of ASD screening results for adults.
- Toddlers: Consisting of ASD screening results for toddlers.
Each dataset includes features like:
- Age
- Gender
- Ethnicity
- Family history of ASD
- Screening results
- Previous use of the screening app
- History of jaundice (Jundice)
Several visualizations were created to explore the dataset and help understand the distribution and relationships between key features.
Purpose: To visualize the distribution of ages in the adult and toddler datasets.
- Method:
- Create histograms for adults and toddlers to show the frequency of different age groups.
- Annotations: Add labels, titles, and grid lines for better interpretation.
- Saving the Plot: Save the histogram as
age_distribution.png
.
Purpose: To compare the gender distribution between adults and toddlers.
- Method:
- Use a horizontal bar plot to represent the count of each gender.
- Annotations: Add labels, titles, and grid lines for better interpretation.
- Saving the Plot: Save the bar chart as
gender_distribution.png
.
Purpose: To visualize the relationship between age and the screening result for adults and toddlers.
- Method:
- Create a scatter plot with age on the x-axis and result on the y-axis.
- Annotations: Add labels, titles, and grid lines for better interpretation.
- Saving the Plot: Save the scatter plot as
age_vs_result.png
.
Purpose: To analyze the distribution of different ethnicities in the adult and toddler datasets.
- Method:
- Count the occurrences of each ethnicity using
value_counts()
. - Create a vertical bar plot with separate bars for adults (blue) and toddlers (orange).
- Rotate x-axis labels to avoid overlapping.
- Annotations: Add labels, titles, and grid lines for better interpretation.
- Saving the Plot: Save the bar plot as
ethnicity_distribution.png
- Count the occurrences of each ethnicity using
Purpose: To analyze the distribution of jaundice history in the adult and toddler datasets.
- Method:
- Count the occurrences of jaundice history (Yes/No).
- Create separate pie charts for adults and toddlers, each representing the proportion of individuals with or without a history of jaundice.
- Annotations: Add titles to each pie chart segment.
- Saving the Plot: Save the pie charts as
jundice_distribution.png
.
Purpose: To analyze the distribution of previous app usage in the adult and toddler datasets.
- Method:
- Count the occurrences of previous app usage (Yes/No).
- Create a horizontal bar plot with separate bars for adults (blue) and toddlers (orange).
- Annotations: Add labels, titles, and grid lines for better interpretation.
- Saving the Plot: Save the bar plot as
used_app_before_distribution.png
.
The Random Forest classifier is used to predict ASD based on the input data. The classifier was trained separately for adults and toddlers, and its performance was evaluated using standard metrics such as accuracy, precision, recall, and F1 score.
- Accuracy: 86%
- Precision: 89%
- Recall: 86%
These results show that the model is effective in predicting ASD for the adult population, with strong precision and recall metrics indicating its reliability in identifying true positives and minimizing false negatives.
- Accuracy, Precision, Recall, F1 Score: 97%
The toddler model achieved perfect results, which may be due to the size and simplicity of the dataset. Further testing is required to confirm the model's robustness in larger and more complex datasets.
A confusion matrix was generated for both models, providing deeper insight into their performance by visualizing:
- True Positives (TP)
- True Negatives (TN)
- False Positives (FP)
- False Negatives (FN)
This analysis helps identify any potential biases or errors in the model's predictions.
Feature importance analysis was conducted to identify which features were most influential in predicting ASD. This analysis can help in understanding key indicators of ASD and may guide future research.
This project demonstrates the potential of using machine learning, particularly Random Forest classifiers, to predict Autism Spectrum Disorder. The models show promising performance, especially for the toddler dataset. However, more data and testing are required to generalize these findings to broader populations.
Future work includes:
- Refining the models with larger datasets.
- Exploring more complex machine learning algorithms.
- Investigating the implications of feature importance in understanding ASD risk factors.