Enhance GreaterThan constraint to allow comparing to scalar values #485

amontanez24 · 2021-06-25T21:54:07Z

Resolve #368

amontanez24 · 2021-06-26T06:05:41Z

sdv/constraints/tabular.py

        separator = '#'
-        while not self._valid_separator(table_data, separator, self.constraint_columns):
-            separator += '#'
+        self._diff_column = separator + separator.join(self.constraint_columns)


@csala The reason that this doesn't use the _valid_separator method is that the self.column_constraints tuple might now only have one element. In this case, the _valid_separator method gets stuck in an infinite loop, because there is nothing to join the only column to, so it finds the column name inside the table. Not sure what you think is a good approach. I didn't use the uuid either because it will be good to know what the column name ends up being for integration tests. Also not sure how to keep the uuid from being something in the column names of the table

I think it is OK to do it here, but I would still go for the {column}####{column} format instead of the #{column}#{column}####} one.
wrt to the problem of having one column, you can just validate against set(table_data.columns) - {self.constraint_columns}.

I don't understand this. If we only validate against the columns not in the self.constraint_columns, it is possible to end up with a column name that is already present. If the self.constraint_columns contains ('a',), then the diff column should be called 'a#'. But if we just join the columns it will return 'a'

katxiao · 2021-06-29T08:59:43Z

sdv/constraints/base.py

@@ -161,6 +161,8 @@ def fit(self, table_data):
            table_data (pandas.DataFrame):
                Table data.
        """
+        self._fit(table_data)


Could you explain why this call was moved up?

In _fit the _constraint_columns might be changed. Id the part below runs before this, it will try to access the wrong _constraint_columns. Before there were no constraints that changed _constraint_columns in _fit so this wasn't a problem

sdv/constraints/tabular.py

katxiao · 2021-06-29T09:10:15Z

sdv/constraints/tabular.py

@@ -292,22 +325,31 @@ def reverse_transform(self, table_data):
        """
        table_data = table_data.copy()
        diff = (np.exp(table_data[self._diff_column]).round() - 1).clip(0)
-        if self._diff_is_datetime(table_data):
+        if self._is_datetime:


Could _is_datetime ever be None here? Since it depends on calling fit first.

it could. In a similar way though, self._diff_column could also be None. So I am assuming we want users to fit the constraint before reverse transforming. Not sure what you think

I think it is fine. reverse_transform should always be called after fit

I see. If it becomes a usability problem, we could add an explicit check and throw an informative error. Seems fine for now since most constraints probably assume this.

I think if we did that, we should probably just have on in the base class that complains if you try and reverse transform or transform before having fit anything

csala · 2021-06-28T19:38:18Z

sdv/constraints/tabular.py

        separator = '#'
-        while not self._valid_separator(table_data, separator, self.constraint_columns):
-            separator += '#'
+        self._diff_column = separator + separator.join(self.constraint_columns)


I think it is OK to do it here, but I would still go for the {column}####{column} format instead of the #{column}#{column}####} one.
wrt to the problem of having one column, you can just validate against set(table_data.columns) - {self.constraint_columns}.

sdv/constraints/tabular.py

csala · 2021-06-29T15:49:27Z

sdv/constraints/tabular.py

+        if self._low_is_scalar is None:
+            self._low_is_scalar = self._low not in table_data.columns
+
+        if self._low_is_scalar:


I would make this an elif and before this check if both are scalar at the same time and raise an error:

if self._low_is_scalar and self._high_is_scalar: raise TypeError('`low` and `high` cannot be both scalars at the same time') elif self._low_is_scalar: ... elif self._high_is_scalar: ... else: self._dtype = ...

csala · 2021-06-29T16:25:12Z

sdv/constraints/tabular.py

@@ -292,22 +325,31 @@ def reverse_transform(self, table_data):
        """
        table_data = table_data.copy()
        diff = (np.exp(table_data[self._diff_column]).round() - 1).clip(0)
-        if self._diff_is_datetime(table_data):
+        if self._is_datetime:


I think it is fine. reverse_transform should always be called after fit

csala · 2021-06-29T17:02:31Z

sdv/constraints/tabular.py

            invalid = ~self.is_valid(table_data)
-            new_high_values = low_column.loc[invalid] + diff.loc[invalid]
-            table_data[self._high].loc[invalid] = new_high_values.astype(self._dtype)
+            if self._high_is_scalar and not self._low_is_scalar:


I think this if/else block could be simplified by only assigning new_values and then doing:

if ... new_values = ... column = ... elif ... ... else: ... table_data[column].loc[invalid] = new_values.astype(self._dtype)

column should also be computed beforehand, in the __init__, since it will never change.

csala

This looks great @amontanez24 ! I love how thorough the tests are!

I added a couple of minor comments about the code itself, and one about the demo and the start_date field that was added, that may end up being confusing.

csala · 2021-06-30T17:56:59Z

sdv/constraints/tabular.py

+            return self._low
+        elif self._low in table_data.columns:
+            return table_data[self._low]
+        return None


I think the elif and this return can be removed, since when this is called _low_is_scalar has already been set to the right value, which means that if it evaluates to false the column must exist.

So this becomes:

if self._low_is_scalar: return self._low return table_data[self._low]

No because if self._low_is_scalar is false, the column could still have been dropped. That value can be false, but self._drop can be set to low

csala · 2021-06-30T17:58:01Z

sdv/constraints/tabular.py

+            name = self.constraint_columns[0] + token
+            while name in table_data.columns:
+                name += '#'
+            return name


I would add blank lines above this return (and also the others that come right after an indentation decrease)

csala · 2021-07-01T14:29:39Z

sdv/demo.py

@@ -325,6 +325,11 @@ def _load_tabular_dummy():
    faker = Faker()
    names = [faker.name() for _ in range(12)]
    adresses = [faker.address() for _ in range(12)]
+    start_date = datetime(1980, 1, 1)
+    start_dates = [
+        start_date + timedelta(days=np.random.randint(0, 14600))


I think this field choice is delicate, because we will then need to make the years_in_the_company match the start_date. Maybe we could use a date which is not related to for how long the employee has been in the company?

csala

This looks good @amontanez24 !

amontanez24 force-pushed the sdv-issue-368-greater-than-enhancements branch from cac2289 to 5959803 Compare June 26, 2021 06:00

amontanez24 changed the title ~~Implement GreaterThan constraint enhancements to allow scalars~~ Enhance GreaterThan constraint to allow comparing to scalar values Jun 26, 2021

amontanez24 commented Jun 26, 2021

View reviewed changes

amontanez24 force-pushed the sdv-issue-368-greater-than-enhancements branch from 8b09909 to 0cb7e32 Compare June 26, 2021 06:26

csala requested a review from katxiao June 28, 2021 19:34

katxiao reviewed Jun 29, 2021

View reviewed changes

katxiao approved these changes Jun 29, 2021

View reviewed changes

csala suggested changes Jun 29, 2021

View reviewed changes

amontanez24 added 4 commits June 29, 2021 16:03

Implement GreaterThan constraint enhancements to allow scalars

d825cf6

adding unit tests

fae8af1

adding docs

13df0cb

pr comments

2f25cbd

amontanez24 force-pushed the sdv-issue-368-greater-than-enhancements branch from 7ec6981 to 2f25cbd Compare June 29, 2021 21:04

removing unused method

b32f8f6

amontanez24 requested a review from csala June 30, 2021 18:32

csala suggested changes Jul 1, 2021

View reviewed changes

pr comments and demo improvements

5f34d4b

csala approved these changes Jul 1, 2021

View reviewed changes

csala merged commit b05d8b5 into master Jul 1, 2021

csala deleted the sdv-issue-368-greater-than-enhancements branch July 1, 2021 18:26

amontanez24 mentioned this pull request Jul 1, 2021

GreaterThan Constraint should apply to scalars #410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance GreaterThan constraint to allow comparing to scalar values #485

Enhance GreaterThan constraint to allow comparing to scalar values #485

amontanez24 commented Jun 25, 2021 •

edited by csala

Loading

amontanez24 Jun 26, 2021

csala Jun 28, 2021

amontanez24 Jun 29, 2021

katxiao Jun 29, 2021

amontanez24 Jun 29, 2021

katxiao Jun 29, 2021

amontanez24 Jun 29, 2021

csala Jun 29, 2021

katxiao Jun 29, 2021

amontanez24 Jun 29, 2021

csala Jun 28, 2021

csala Jun 29, 2021

csala Jun 29, 2021

csala Jun 29, 2021

csala left a comment

csala Jun 30, 2021

amontanez24 Jul 1, 2021

csala Jun 30, 2021

csala Jul 1, 2021

amontanez24 Jul 1, 2021

csala left a comment

Enhance GreaterThan constraint to allow comparing to scalar values #485

Enhance GreaterThan constraint to allow comparing to scalar values #485

Conversation

amontanez24 commented Jun 25, 2021 • edited by csala Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csala left a comment

Choose a reason for hiding this comment

amontanez24 commented Jun 25, 2021 •

edited by csala

Loading