Should float columns be binned in the K-marginal score? #31

AwesomeLemon · 2024-12-02T15:33:11Z

Hi,

I'm interested in the K-marginal score. In the implementation (https://github.com/usnistgov/SDNist/blob/main/sdnist/metrics/kmarginal.py), I'm not seeing that the float values are binned. Are they binned in some other place in the code, or not at all? If they are not binned anywhere, why is that?

My reasoning is that without binning, creating marginals via df.groupby leads to unique points for any marginal that includes a float value: e.g., records (0, 1, 0.05) and (0, 1, 0.05000001) would be considered distinct and count as errors in synthetic data, decreasing the metric value. This seems undesirable to me since the probability of getting exactly equal float values is vanishingly small. Binning the float values would address this issue.

I would appreciate if if you could help me better understand this matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should float columns be binned in the K-marginal score? #31

Should float columns be binned in the K-marginal score? #31

AwesomeLemon commented Dec 2, 2024

Should float columns be binned in the K-marginal score? #31

Should float columns be binned in the K-marginal score? #31

Comments

AwesomeLemon commented Dec 2, 2024