Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling with conditions={column: 0.0} for float columns doesn't work #525

Closed
shlomihod opened this issue Jul 28, 2021 · 2 comments · Fixed by #771
Closed

Sampling with conditions={column: 0.0} for float columns doesn't work #525

shlomihod opened this issue Jul 28, 2021 · 2 comments · Fixed by #771
Assignees
Labels
bug Something isn't working
Milestone

Comments

@shlomihod
Copy link

Environment Details

  • SDV version: 0.11.0
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04.1 LTS

Error Description

Trying to apply a condition on sampling with the value 0. for a float column leads to an exception.

Steps to reproduce

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula

data = load_tabular_demo('student_placements')
data['experience_years'] = data['experience_years'].astype(float)  # for the demonstration
model = GaussianCopula()
model.fit(data)
model.sample(1, conditions={'experience_years': 0.})

ValueError                                Traceback (most recent call last)
<ipython-input-147-1a9ae5e719f8> in <module>
      6 model = GaussianCopula()
      7 model.fit(data)
----> 8 model.sample(1, conditions={'experience_years': 0.})

/opt/conda/lib/python3.8/site-packages/sdv/tabular/base.py in sample(self, num_rows, max_retries, max_rows_multiplier, conditions, float_rtol, graceful_reject_sampling)

/opt/conda/lib/python3.8/site-packages/sdv/tabular/base.py in _conditionally_sample_rows(self, dataframe, max_retries, max_rows_multiplier, condition, transformed_condition, float_rtol, graceful_reject_sampling)

ValueError: No valid rows could be generated with the given conditions.

Ability to contribute

I think that I can fix the issue and I'll create a pull request.
I suspect that it happens because of rtol with zero. The method _filter_conditions checked for < instead of <= (In numpy, <= is used [source]):

if column_values.dtype.kind == 'f':
distance = value * float_rtol
sampled = sampled[np.abs(column_values - value) < distance]
sampled[column] = value
@shlomihod shlomihod added bug Something isn't working pending review labels Jul 28, 2021
@csala csala changed the title Sampling with a float constrain doesn't work for the value zero Sampling with conditions={column: 0.0} for float columns doesn't work Jul 29, 2021
@csala
Copy link
Contributor

csala commented Jul 29, 2021

Thanks for reporting this @shlomihod
I think you are right about the problem being related to the tolerance, which makes the distance become 0 when the value is 0.

I think that we will revisit this rtol at some point and rather convert it to an atol (absolute tolerance), so the distance does not depend on the actual value, but right now your proposal should work around the problem perfectly fine.

@katxiao katxiao added the under discussion Issue is currently being discussed label Sep 24, 2021
@katxiao
Copy link
Contributor

katxiao commented Sep 29, 2021

Hi @shlomihod, we appreciate your interest in contributing and your change looks good! Would you have a chance to make the changes requested? Please let us know if you have any other questions.

@katxiao katxiao added this to the 0.14.1 milestone May 3, 2022
tssbas added a commit to tssbas/SDV that referenced this issue May 11, 2022
@npatki npatki removed the under discussion Issue is currently being discussed label May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants