Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider frequency when ordering categories in LabelEncoder #611

Open
npatki opened this issue Jan 12, 2023 · 0 comments
Open

Consider frequency when ordering categories in LabelEncoder #611

npatki opened this issue Jan 12, 2023 · 0 comments
Labels
feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Jan 12, 2023

Problem Description

If I have unordered categorical data (aka nominal data), then it doesn't theoretically matter how the LabelEncoder decides to order the categories.

However in practice, certain order are better than others. In particular, an ascending-descending pattern of frequency will allow the data to more closely resemble a bell-curve, which is useful for data science.

Expected behavior

Add another option for the order_by parameter called 'frequency_inverted_v' (name TBD).

When set, the transformer should

  1. Compute the frequencies of each category
  2. Sort the frequencies in an ascending then descending pattern, such that the most common value is in the middle and the overall pattern is an inverted "V" shape (most similar to a bell-shaped curve)

Additional context

Empirically, this seems to produce drastically better results than the default.

Default ordering: Order is assigned first-come, first-serve
image

V-shaped ordering: Order is assigned by frequency, in an inverted V shape to resemble a bell-shaped distribution.
image

One way to accomplish this is by sorting the categories by frequency and then assigning them in an alternating fashion from the middle out.

def get_category_order(data, column_name):

  sorted_categories = list(data[column_name].value_counts().index)
  evens = [sorted_categories[i] for i in range(len(sorted_categories)) if i % 2 == 0]
  odds = [sorted_categories[i] for i in range(len(sorted_categories)) if i % 2 == 1]
  odds.reverse()

  return odds + evens
@npatki npatki added the feature request Request for a new feature label Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant