Ensure reversibility `FrequencyEncoder` #534

fealho · 2022-08-11T21:54:07Z

Resolve #528.

codecov-commenter · 2022-08-11T21:56:07Z

Codecov Report

Merging #534 (f9c021a) into master (b8e7170) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #534   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines         1452      1453    +1     
=========================================
+ Hits          1452      1453    +1

Impacted Files	Coverage Δ
rdt/transformers/categorical.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pvk-developer

LGTM

amontanez24

I just have a small comment and a question to better understand the operations being done. Besides that this looks good!

amontanez24 · 2022-08-12T19:06:41Z

rdt/transformers/utils.py

 import string

 import numpy as np
+import sre_parse


I think this causes builds to fail depending on the machine. I think on sdv we just ended up ignoring it

I'll fix it in my other RDT PR 👍

amontanez24 · 2022-08-12T19:07:49Z

rdt/transformers/categorical.py

+        diffs = (data >= starts)[:, ::-1]
+        indexes = num_categories - np.argmax(diffs, axis=1) - 1


What exactly are these two like doing?

I printed each line for an example dataset to clarify things:

Original Data 0 0.875 1 0.625 2 0.375 3 0.125 dtype: float64 Broadcasted Data [[0.875 0.875 0.875 0.875] [0.625 0.625 0.625 0.625] [0.375 0.375 0.375 0.375] [0.125 0.125 0.125 0.125]] Starts [[0. 0.25 0.5 0.75] [0. 0.25 0.5 0.75] [0. 0.25 0.5 0.75] [0. 0.25 0.5 0.75]] Diffs [[ True True True True] [False True True True] [False False True True] [False False False True]] Indexes [3 2 1 0]

Basically, Diffs marks the first interval our data fits in (the smallest value of starts which our data is greater than) with True. Indexes just finds this first True value and converts its position into the category it corresponds.

cool, can we rename diffs to interval_starts_data_is_greater_than or something like that? and indexes to interval_indexes?

amontanez24

Thanks for addressing the comment!

fealho added 4 commits August 10, 2022 19:51

Select intervals using starts instead of means (fixes the bug)

d0b3071

Fix forward transform noise

e28ae55

Add integration test + fix bug

7f6c48d

Fix lint

de3cf50

Remove integration test change

f59fc42

pvk-developer approved these changes Aug 11, 2022

View reviewed changes

fealho added 2 commits August 11, 2022 18:36

Make it always deterministic

e709269

Remove unnecessary test

6ca09a6

fealho marked this pull request as ready for review August 12, 2022 01:58

fealho requested a review from a team as a code owner August 12, 2022 01:58

fealho requested review from amontanez24 and removed request for a team August 12, 2022 01:58

amontanez24 reviewed Aug 12, 2022

View reviewed changes

fealho requested a review from amontanez24 August 12, 2022 22:58

Address feedback

f9c021a

amontanez24 approved these changes Aug 13, 2022

View reviewed changes

fealho merged commit 597a3c8 into master Aug 13, 2022

fealho deleted the issue-528-unpredictable-frequence-encoder branch August 13, 2022 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure reversibility `FrequencyEncoder` #534

Ensure reversibility `FrequencyEncoder` #534

fealho commented Aug 11, 2022

codecov-commenter commented Aug 11, 2022 •

edited

Loading

pvk-developer left a comment

amontanez24 left a comment

amontanez24 Aug 12, 2022

fealho Aug 12, 2022 •

edited

Loading

amontanez24 Aug 12, 2022

fealho Aug 12, 2022

amontanez24 Aug 12, 2022

amontanez24 left a comment

		diffs = (data >= starts)[:, ::-1]
		indexes = num_categories - np.argmax(diffs, axis=1) - 1

Ensure reversibility FrequencyEncoder #534

Ensure reversibility FrequencyEncoder #534

Conversation

fealho commented Aug 11, 2022

codecov-commenter commented Aug 11, 2022 • edited Loading

Codecov Report

pvk-developer left a comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

amontanez24 Aug 12, 2022

Choose a reason for hiding this comment

fealho Aug 12, 2022 • edited Loading

Choose a reason for hiding this comment

amontanez24 Aug 12, 2022

Choose a reason for hiding this comment

fealho Aug 12, 2022

Choose a reason for hiding this comment

amontanez24 Aug 12, 2022

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

Ensure reversibility `FrequencyEncoder` #534

Ensure reversibility `FrequencyEncoder` #534

codecov-commenter commented Aug 11, 2022 •

edited

Loading

fealho Aug 12, 2022 •

edited

Loading