-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure reversibility FrequencyEncoder
#534
Conversation
Codecov Report
@@ Coverage Diff @@
## master #534 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 16 16
Lines 1452 1453 +1
=========================================
+ Hits 1452 1453 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a small comment and a question to better understand the operations being done. Besides that this looks good!
rdt/transformers/utils.py
Outdated
import string | ||
|
||
import numpy as np | ||
import sre_parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this causes builds to fail depending on the machine. I think on sdv we just ended up ignoring it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix it in my other RDT PR 👍
rdt/transformers/categorical.py
Outdated
diffs = (data >= starts)[:, ::-1] | ||
indexes = num_categories - np.argmax(diffs, axis=1) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are these two like doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I printed each line for an example dataset to clarify things:
Original Data
0 0.875
1 0.625
2 0.375
3 0.125
dtype: float64
Broadcasted Data
[[0.875 0.875 0.875 0.875]
[0.625 0.625 0.625 0.625]
[0.375 0.375 0.375 0.375]
[0.125 0.125 0.125 0.125]]
Starts
[[0. 0.25 0.5 0.75]
[0. 0.25 0.5 0.75]
[0. 0.25 0.5 0.75]
[0. 0.25 0.5 0.75]]
Diffs
[[ True True True True]
[False True True True]
[False False True True]
[False False False True]]
Indexes
[3 2 1 0]
Basically, Diffs marks the first interval our data fits in (the smallest value of starts which our data is greater than) with True. Indexes just finds this first True value and converts its position into the category it corresponds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, can we rename diffs to interval_starts_data_is_greater_than
or something like that? and indexes
to interval_indexes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comment!
Resolve #528.