Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure reversibility FrequencyEncoder #534

Merged
merged 8 commits into from
Aug 13, 2022

Conversation

fealho
Copy link
Member

@fealho fealho commented Aug 11, 2022

Resolve #528.

@codecov-commenter
Copy link

codecov-commenter commented Aug 11, 2022

Codecov Report

Merging #534 (f9c021a) into master (b8e7170) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #534   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines         1452      1453    +1     
=========================================
+ Hits          1452      1453    +1     
Impacted Files Coverage Δ
rdt/transformers/categorical.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

@fealho fealho marked this pull request as ready for review August 12, 2022 01:58
@fealho fealho requested a review from a team as a code owner August 12, 2022 01:58
@fealho fealho requested review from amontanez24 and removed request for a team August 12, 2022 01:58
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a small comment and a question to better understand the operations being done. Besides that this looks good!

import string

import numpy as np
import sre_parse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this causes builds to fail depending on the machine. I think on sdv we just ended up ignoring it

Copy link
Member Author

@fealho fealho Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it in my other RDT PR 👍

Comment on lines 201 to 202
diffs = (data >= starts)[:, ::-1]
indexes = num_categories - np.argmax(diffs, axis=1) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are these two like doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I printed each line for an example dataset to clarify things:

Original Data
0    0.875
1    0.625
2    0.375
3    0.125
dtype: float64
Broadcasted Data
[[0.875 0.875 0.875 0.875]
 [0.625 0.625 0.625 0.625]
 [0.375 0.375 0.375 0.375]
 [0.125 0.125 0.125 0.125]]
Starts
[[0.   0.25 0.5  0.75]
 [0.   0.25 0.5  0.75]
 [0.   0.25 0.5  0.75]
 [0.   0.25 0.5  0.75]]
Diffs
[[ True  True  True  True]
 [False  True  True  True]
 [False False  True  True]
 [False False False  True]]
Indexes
[3 2 1 0]

Basically, Diffs marks the first interval our data fits in (the smallest value of starts which our data is greater than) with True. Indexes just finds this first True value and converts its position into the category it corresponds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, can we rename diffs to interval_starts_data_is_greater_than or something like that? and indexes to interval_indexes?

@fealho fealho requested a review from amontanez24 August 12, 2022 22:58
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comment! :shipit:

@fealho fealho merged commit 597a3c8 into master Aug 13, 2022
@fealho fealho deleted the issue-528-unpredictable-frequence-encoder branch August 13, 2022 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpredictable results for FrequencyEncoder(add_noise=True)
4 participants