Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should float columns be binned in the K-marginal score? #31

Open
AwesomeLemon opened this issue Dec 2, 2024 · 0 comments
Open

Should float columns be binned in the K-marginal score? #31

AwesomeLemon opened this issue Dec 2, 2024 · 0 comments

Comments

@AwesomeLemon
Copy link

Hi,

I'm interested in the K-marginal score. In the implementation (https://github.com/usnistgov/SDNist/blob/main/sdnist/metrics/kmarginal.py), I'm not seeing that the float values are binned. Are they binned in some other place in the code, or not at all? If they are not binned anywhere, why is that?

My reasoning is that without binning, creating marginals via df.groupby leads to unique points for any marginal that includes a float value: e.g., records (0, 1, 0.05) and (0, 1, 0.05000001) would be considered distinct and count as errors in synthetic data, decreasing the metric value. This seems undesirable to me since the probability of getting exactly equal float values is vanishingly small. Binning the float values would address this issue.

I would appreciate if if you could help me better understand this matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant