DSTK, Data Science Toolkit

Dependencies

pandas == 0.22.0

Package structure

Inspection

Bi-variant inspection

chi2 (2 stars)
ANOVA (2 stars)
T-test (2 stars)
IV (3 stars)
KS (3 stars)

Check collinearity

Collinearity
- TBD
Multicolinearity
- Variance Inflation Factor (3 stars)

OOT inspection

PSI (3 stars)
Dataframe comparison (unit tests-covered)

Data type detector (3 stars)

Numeric
Numeric-Categorical
String-Categorical
Time

Outlier detection

Univariate

Tukey's method
z-test

Multivariate

Residual threshold method
Local outlier factor
HiCS

Preprocessing

Imputing (unit tests-covered)

Continous
- mean
- truncated mean
- median
- bin-nan
Categorical
- most frequent class
- stringify

MISC

onehot_split

Metric

Response related metrics

Clustering metrics

Purity (need unit tests)
Accuracy (need unit tests)

Transformation

Binning

Equal pupulation binning (3 stars)
Equal value binning (3 stars)
Monotonic binning
ChiMerge

Encoding

Dummy (2 stars)
WOE (2 stars)
Tree leaves encoding

Clustering

K-based clustering

Density-based clustering

Hierarchical clustering

Advanced clustering

Spectral clustering
Subspace clustering
Multi-sourced clustering
Multi-aspect clustering
Multi-task clustering

Deep learning-based clustering

AE + K-means
AE + Spectral clustering
AE + Subspace clustering

Feature learning

Adversarial representation learning

BiGAN
infoGAN
AAE

statistical description of raw data

Exploring/Summarize the data distribution

data type

different processing methods for differnet types of data

numeric

continuous: Data that can take on any value in an interval
discrete: Data that can only take on integer values

categorical

Data that can only take on a specific set of values

Binary: special case of categorical data, can only take two values

ordinal

Categorical data that has an explicit ordering

Numeric data statistical description

Estimates of Location

mean
truncated mean
weighted mean
median
outliers

Estimates of Variability

variance: N-1
standard deviation
range: min/max values
percentiles
Interquartile Range(IQR): 75th percentile - 25th percentile

data distribution exploration

Boxplot
Frequency table
histogram
density plot: kernal density estimate

Categorical data statistical description

Mode: the most commonly category/value
Expected value: similar as weighted mean
Bar charts:The frequency or proportion for each category plotted as bars
Pie charts:The frequency or proportion for each category plotted as wedges in a pie =======

MISC

Entity embeddings

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.idea		.idea
anomaly		anomaly
dstk		dstk
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
Untitled.ipynb		Untitled.ipynb
binning_demo.ipynb		binning_demo.ipynb
conftest.py		conftest.py
create_train_test.ipynb		create_train_test.ipynb
setup.cfg		setup.cfg
setup.py		setup.py
statistics_description.md		statistics_description.md

License

savourylie/dstk

Folders and files

Latest commit

History

Repository files navigation

DSTK, Data Science Toolkit

Dependencies

Package structure

Inspection

Bi-variant inspection

Check collinearity

OOT inspection

Data type detector (3 stars)

Outlier detection

Univariate

Multivariate

Preprocessing

Imputing (unit tests-covered)

MISC

Metric

Response related metrics

Clustering metrics

Transformation

Binning

Encoding

Clustering

K-based clustering

Density-based clustering

Hierarchical clustering

Advanced clustering

Deep learning-based clustering

Feature learning

Adversarial representation learning

statistical description of raw data

data type

numeric

categorical

ordinal

Numeric data statistical description

Estimates of Location

Estimates of Variability

data distribution exploration

Categorical data statistical description

MISC

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages