- pandas == 0.22.0
- chi2 (2 stars)
- ANOVA (2 stars)
- T-test (2 stars)
- IV (3 stars)
- KS (3 stars)
- Collinearity
- TBD
- Multicolinearity
- Variance Inflation Factor (3 stars)
- PSI (3 stars)
- Dataframe comparison (unit tests-covered)
- Numeric
- Numeric-Categorical
- String-Categorical
- Time
- Tukey's method
- z-test
- Residual threshold method
- Local outlier factor
- HiCS
- Continous
- mean
- truncated mean
- median
- bin-nan
- Categorical
- most frequent class
- stringify
- onehot_split
- Purity (need unit tests)
- Accuracy (need unit tests)
- Equal pupulation binning (3 stars)
- Equal value binning (3 stars)
- Monotonic binning
- ChiMerge
- Dummy (2 stars)
- WOE (2 stars)
- Tree leaves encoding
- Spectral clustering
- Subspace clustering
- Multi-sourced clustering
- Multi-aspect clustering
- Multi-task clustering
- AE + K-means
- AE + Spectral clustering
- AE + Subspace clustering
- BiGAN
- infoGAN
- AAE
Exploring/Summarize the data distribution
different processing methods for differnet types of data
- continuous: Data that can take on any value in an interval
- discrete: Data that can only take on integer values
Data that can only take on a specific set of values
- Binary: special case of categorical data, can only take two values
Categorical data that has an explicit ordering
- mean
- truncated mean
- weighted mean
- median
- outliers
- variance: N-1
- standard deviation
- range: min/max values
- percentiles
- Interquartile Range(IQR): 75th percentile - 25th percentile
- Boxplot
- Frequency table
- histogram
- density plot: kernal density estimate
- Mode: the most commonly category/value
- Expected value: similar as weighted mean
- Bar charts:The frequency or proportion for each category plotted as bars
- Pie charts:The frequency or proportion for each category plotted as wedges in a pie =======
- Entity embeddings