Encoding-categorical-data

Aug 13, 2024

99b54bf · Aug 13, 2024

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Create README.md	Aug 10, 2024
cars.csv	cars.csv	Added notebook and datasets	Aug 10, 2024
covid_toy.csv	covid_toy.csv	Added notebook and datasets	Aug 10, 2024
customer.csv	customer.csv	Added notebook and datasets	Aug 10, 2024
encoding-categorical-data.ipynb	encoding-categorical-data.ipynb	Added notebook and datasets	Aug 10, 2024
sklearn-ColumnTransformer.ipynb	sklearn-ColumnTransformer.ipynb	encoding categorical data with sklearn ColumnTransformer	Aug 13, 2024

README.md

To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1)
- Retains the order of categories when encoding ordinal data, which can be ranked or ordered. For example, education levels (high school, bachelor's, master's, Ph.D.) or temperature categories (cold, warm, hot).
Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.
- Considers the presence or absence of a feature when encoding nominal data, which has categories with no intrinsic order or ranking. For example, colors (red, blue, green), types of animals (mammal, fish, reptile, amphibian, or bird), brand names (Coca-Cola, Pepsi, Sprite), or pizza toppings (pepperoni, mushrooms, onions).
LabelEncoder encode target labels with value between 0 and n_classes-1

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.