Unpredictable results for `FrequencyEncoder(add_noise=True)` #528

npatki · 2022-07-18T20:26:10Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

RDT version: 1.1.0
Python version: 3.9
Operating System: Colab Notebook

Error Description

When there is added noise, the observed composition identity of this transformer is False even though it is listed as True. This is causing some problems with conditional sampling in the SDV

There seem to be 2 related issues:

The forward transform can noise the values outside of the allowable range for a category, and
Some the reverse transformed values are not following the intervals

Steps to reproduce

To replicate, download and use the student_placements dataset.

import numpy as np
import pandas as np
import rdt

data = pd.read_csv('student_placements.csv')

ht = rdt.HyperTransformer()
ht.detect_initial_config(data)
ht.update_transformers(column_name_to_transformer={
    'gender': rdt.transformers.categorical.FrequencyEncoder(add_noise=True),
    'high_spec': rdt.transformers.categorical.FrequencyEncoder(add_noise=True)
})

np.random.seed(seed=33)

transformed = ht.transform(data)
reversed = ht.reverse_transform(transformed)

Observe that the original data and the reverse transformed data do not have the same values for two of the rows

correct_rows = data['gender'] == reversed['gender']
correct_rows.value_counts()

True     213
False      2
Name: gender, dtype: int64

Forward Transform

Observe that M is supposed to be mapped to the interval (0, 0.6465116279069767). Sometimes the forward transform is mapping outside that range (eg. 0.655336)

data[~correct_rows]['gender']
105    M
187    M
Name: gender, dtype: object

transformed[~correct_rows]['gender.value']
105    0.655336
187    0.606273
Name: gender.value, dtype: float64

ht._transformers_tree['gender']['transformer'].intervals
{'F': (0.6465116279069767, 1.0, 0.8232558139534883, 0.05891472868217054),
 'M': (0, 0.6465116279069767, 0.32325581395348835, 0.10775193798449612)}

Reverse Transform
Observe that M is supposed to be mapped to the interval (0, 0.6465116279069767). But some values inside it -- like 0.606273 -- are reversed transformed to F.

transformed[~correct_rows]['gender.value']
105    0.655336
187    0.606273
Name: gender.value, dtype: float64

reversed[~correct_rows]['gender']
105    F
187    F
Name: gender, dtype: object

ht._transformers_tree['gender']['transformer'].intervals
{'F': (0.6465116279069767, 1.0, 0.8232558139534883, 0.05891472868217054),
 'M': (0, 0.6465116279069767, 0.32325581395348835, 0.10775193798449612)}

This is likely due to this line -- we are taking the diff of the value with the average from that category and choosing the min distance. This doesn't make sense when it's noised. We should instead be checking to see if each value is within the correct interval.

The text was updated successfully, but these errors were encountered:

npatki added the bug Something isn't working label Jul 18, 2022

npatki mentioned this issue Jul 18, 2022

Conditional sampling using GaussianCopula inefficient when categories are noised sdv-dev/SDV#910

Closed

fealho mentioned this issue Aug 11, 2022

Ensure reversibility FrequencyEncoder #534

Merged

fealho closed this as completed in #534 Aug 13, 2022

amontanez24 assigned fealho Aug 16, 2022

amontanez24 added this to the 1.2.0 milestone Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpredictable results for `FrequencyEncoder(add_noise=True)` #528

Unpredictable results for `FrequencyEncoder(add_noise=True)` #528

npatki commented Jul 18, 2022 •

edited

Loading

Unpredictable results for FrequencyEncoder(add_noise=True) #528

Unpredictable results for FrequencyEncoder(add_noise=True) #528

Comments

npatki commented Jul 18, 2022 • edited Loading

Environment Details

Error Description

Steps to reproduce

Unpredictable results for `FrequencyEncoder(add_noise=True)` #528

Unpredictable results for `FrequencyEncoder(add_noise=True)` #528

npatki commented Jul 18, 2022 •

edited

Loading