Currently, techniques such as microarrays can give large data about gene expression with limited samples. We choose brain cancers to study due to its low incidence which is 6.3 per 100,000 men and women per year, and use feature selection to find optimal features for multiclass classification.
Dataset: "Brain_GSE50161.csv"
Feature Selections:
- our pipeline with variance: "feature_selection_with_variance.ipynb"
- input: "Brain_GSE50161.csv"
- output: "df_w_var.csv"
- our pipeline without variance:
"feature_selection_with_variance.ipynb"
- input: "Brain_GSE50161.csv"
- output: "df_wo_var.csv"
- LASSO: "feature_selection_with_lasso.ipynb"
- input: "Brain_GSE50161.csv"
- output: "df_lasso.csv"
Classifications:
- Run multiclass classification with the dataset generated by the three feature selections scripts: "Classification.ipynb"
- input: "df_w_var.csv" or "df_wo_var.csv" or "df_lasso.csv"
- output: accuracy, F1 score, confusion matrices
- Perform PCA and then run multiclass classification: "PCA.ipynb"
- input: "Brain_GSE50161.csv"
- output: accuracy, F1 score, confusion matrices