School of Computer Science and Engineering
Nanyang Technological University
Lab: SC14
Group : 6
Members:
- Bernice Koh Jun Yan (@bernicekjy)
- Nepal Aaradh (@ardnep)
- Veeraraghavan Srivathsan Nithyasri (@Veeraraghavan-S-Nithyasri)
This repository contains all the Jupyter Notebooks, datasets, images, video presentations, and the source materials/references we have used and created as part of the Mini Project for SC1015: Introduction to Data Science and AI.
This README briefly highlights what we have accomplished in this project. If you want a more detailed explanation of things, please refer to the the Jupyter Notebooks in this repository. They contain more in-depth descriptions and smaller details which are not mentioned here in the README. For convenience, we have divided the notebooks into 5 parts which broadly relate to the 5 main sections of this project.
- Problem Formulation
- Data Preparation and Cleaning
- Exploratory Data Analysis
- Dimensionality Reduction
- Clustering
- Data Driven Insights and Conclusion
- References
Our Dataset: Stack Overflow Developer Survey 2020 on Kaggle
Our Question: Does Being Unconventional Determine Success?
Success: Determined using Salary and Job Satisfaction
Unconventional Individuals: Outliers/anomalies found after clustering individuals based on the technologies they use such as Web frameworks, Programming languages, Operating systems, etc.
Rationale: We believe that this dataset as well as the question we pose is very relevant to the SCSE community at NTU. Being students of SCSE, once we graduate, we might become developers ourselves. By learning what kinds of developers tend to be more successful, we might be able to understand what it takes to be successful in the software development world.
In this section of the project, we prepped and cleaned the dataset to help us analyze our data better and also to help us use our data for the purposes of machine learning in the later sections.
We performed the following:
- Preliminary Feature Selection:
8
relevant variables out of61
were selected. - Dropping
NaN
s: All theNaN
values in these8
variables were dropped. - Splitting Dataset in Two: The
8
variables were then split in 2 DataFrames. One with6
variables relating to conventionality and the other with2
relating to success. - Encoding Categorical Variables: The categorical variables in both the DataFrames were encoded appropriately.
Then, we explored each of our two DataFrames further using Exploratory Data Analysis to answer questions like are there any patterns we are noticing? What do our success variables look like? What about the conventionality variables? Are there any underlying relationships between them? Can we make any inferences for our question at this stage?
To achieve this we did the following:
- Explored
ConvertedComp
: This variable is the annual compensation in USD (a.k.a Salary). Median of around $54k was seen. A lot of outliers with high salaries were present. - Explored
JobSat
: This variable is the job satisfaction (0-4
scale). Most frequent ratings were2
and4
. The mean rating was at2.3
. - Explored Relationships Between
JobSat
andConvertedComp
: Weak correlation was seen betweenJobSat
andConvertedComp
. - Explored Variables Related to Conventionality: Studied which options in the
6
variables were more frequently selected by respondents.
For further findings and explanations, please refer to the Jupyter Notebook on EDA.
Our DataFrame with 6
variables after encoding was converted to a DataFrame with 94
which is a very high dimensional data.
This meant a few problems (curse of dimensionality):
- It would probably not result in nicely formulated clusters.
- High dimensional data is difficult to work with because of space and time increases when running algorithms on them.
- High dimensional data is difficult to visualize.
So, Multiple Correspondence Analysis (MCA) was used to reduce these dimensions. The reason we chose MCA was that the general convention with dimensionality reduction is Principal Component Analysis (PCA), however it does not work well with categorical data which is what we have. MCA works well with multiple columns of categorical data.
Using MCA, the dimensions were reduced from 94
columns to just 42
!
5. Clustering
With these 42
columns, we then performed clustering. We chose the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBCAN).
The reasons for this are:
- This is a density based clustering algorithm which means it does not need the specification of the number of clusters. Essentially, this algorithm will not force any point into a cluster. Instead, points which do not really belong to any cluster are labeled "noise". This is clearly useful for us since we are doing anomaly detection (outlier/noise detection).
- Because this is density based, the shape of our data does not matter which is useful since we are working with high dimensional data and it is not possible for us to visualize and understand what kind of shapes our data might have.
- With a non-hierarchical DBSCAN, there are certain hyperparameters that are difficult to tune. HDBSCAN removes the need to tune some of these parameters.
- Because HDBSCAN is a hierarchical clustering algorithm even if we have a high dimensional data, we can use dendrograms to somewhat visualize the clusters and make inferences.
More details on HDBSCAN and its parameters are presented in the Jupyter Notebook on Clustering.
In this section, we performed the following:
- Clustering with Random Parameters
- Hyperparameter Tuning with GridSearchCV using DBCV Score
- Readjusting Parameters (GridSearchCV does not work well in this case)
- Clustering with New Parameters
Our final clustering resulted in a total of 3
clusters and 6206
outliers (out of 19362
total points).
Here, we re-combined our variables related to success and the clustered variables related to conventionality to see if there are any differences between outliers and non-outliers. We performed a comparative Exploratory Data Analysis on the outliers vs. non-outliers to see if we can infer anything from the similarities and differences.
In this section, we also looked at the characteristics of the individuals in our 3
clusters using the variables related to conventionality. The findings have been presented in the Jupyter Notebook on Data Driven Insights.
Most notably, however, we found that there were no difference in the distribution of the Salary or the Job Satisfaction among Outliers and Non-outliers (Conventional individuals and non-conventional individuals). So, we concluded that unconventionality might NOT be an indicator of success.
- https://bookdown.org/brian_nguyen0305/Multivariate_Statistical_Analysis_with_R/what-is-mca.html
- https://pca4ds.github.io/mechanics.html
- https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/multiple-correspondence-analysis/interpret-the-results/all-statistics-and-graphs/
- https://github.com/MaxHalford/prince
- https://www.researchgate.net/post/What-should-the-minumum-explained-variance-be-to-be-acceptable-in-factor-analysis
- https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
- https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
- https://github.com/christopherjenness/DBCV
- https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html
- https://www.youtube.com/watch?v=dGsxd67IFiU
- https://towardsdatascience.com/tuning-with-hdbscan-149865ac2970