Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable results for FrequencyEncoder(add_noise=True) #528

Closed
npatki opened this issue Jul 18, 2022 · 0 comments · Fixed by #534
Closed

Unpredictable results for FrequencyEncoder(add_noise=True) #528

npatki opened this issue Jul 18, 2022 · 0 comments · Fixed by #534
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jul 18, 2022

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • RDT version: 1.1.0
  • Python version: 3.9
  • Operating System: Colab Notebook

Error Description

When there is added noise, the observed composition identity of this transformer is False even though it is listed as True. This is causing some problems with conditional sampling in the SDV

There seem to be 2 related issues:

  1. The forward transform can noise the values outside of the allowable range for a category, and
  2. Some the reverse transformed values are not following the intervals

Steps to reproduce

To replicate, download and use the student_placements dataset.

import numpy as np
import pandas as np
import rdt

data = pd.read_csv('student_placements.csv')

ht = rdt.HyperTransformer()
ht.detect_initial_config(data)
ht.update_transformers(column_name_to_transformer={
    'gender': rdt.transformers.categorical.FrequencyEncoder(add_noise=True),
    'high_spec': rdt.transformers.categorical.FrequencyEncoder(add_noise=True)
})

np.random.seed(seed=33)

transformed = ht.transform(data)
reversed = ht.reverse_transform(transformed)

Observe that the original data and the reverse transformed data do not have the same values for two of the rows

correct_rows = data['gender'] == reversed['gender']
correct_rows.value_counts()

True     213
False      2
Name: gender, dtype: int64

Forward Transform

Observe that M is supposed to be mapped to the interval (0, 0.6465116279069767). Sometimes the forward transform is mapping outside that range (eg. 0.655336)

data[~correct_rows]['gender']
105    M
187    M
Name: gender, dtype: object

transformed[~correct_rows]['gender.value']
105    0.655336
187    0.606273
Name: gender.value, dtype: float64

ht._transformers_tree['gender']['transformer'].intervals
{'F': (0.6465116279069767, 1.0, 0.8232558139534883, 0.05891472868217054),
 'M': (0, 0.6465116279069767, 0.32325581395348835, 0.10775193798449612)}

Reverse Transform
Observe that M is supposed to be mapped to the interval (0, 0.6465116279069767). But some values inside it -- like 0.606273 -- are reversed transformed to F.

transformed[~correct_rows]['gender.value']
105    0.655336
187    0.606273
Name: gender.value, dtype: float64

reversed[~correct_rows]['gender']
105    F
187    F
Name: gender, dtype: object

ht._transformers_tree['gender']['transformer'].intervals
{'F': (0.6465116279069767, 1.0, 0.8232558139534883, 0.05891472868217054),
 'M': (0, 0.6465116279069767, 0.32325581395348835, 0.10775193798449612)}

This is likely due to this line -- we are taking the diff of the value with the average from that category and choosing the min distance. This doesn't make sense when it's noised. We should instead be checking to see if each value is within the correct interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants