Skip to content

Files

Latest commit

99b54bf · Aug 13, 2024

History

History

Encoding-categorical-data

Encoding categorical features

  1. To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1)

    • Retains the order of categories when encoding ordinal data, which can be ranked or ordered. For example, education levels (high school, bachelor's, master's, Ph.D.) or temperature categories (cold, warm, hot).

  2. Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

    • Considers the presence or absence of a feature when encoding nominal data, which has categories with no intrinsic order or ranking. For example, colors (red, blue, green), types of animals (mammal, fish, reptile, amphibian, or bird), brand names (Coca-Cola, Pepsi, Sprite), or pizza toppings (pepperoni, mushrooms, onions).

  3. LabelEncoder encode target labels with value between 0 and n_classes-1

sklearn OrdinalEncoder sklearn OneHotEncoder sklearn LabelEncoder

Column Transformer

  • This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

sklearn ColumnTransformer