Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying Unique Constraint errors when calling model.fit() on a subset of data #610

Closed
xamm opened this issue Oct 20, 2021 · 3 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@xamm
Copy link
Contributor

xamm commented Oct 20, 2021

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.12.1
  • Python version: 3.8.10
  • Operating System: Linux

Error Description

I tried to fit a GaussianCopula model on a subset of data applying a unique constraint on two columns.
This resulted in an IndexError: index 182 is out of bounds for axis 0 with size 10.
I expected a fitted model to sample from.

The columns where the unique constraint is applied on are unique, that was checked before.

Steps to reproduce

    test_df = pd.DataFrame({
        "key": [
            1,
            2,
            3,
            4,
            5,
        ],
        "error_column": [
            "A",
            "B",
            "C",
            "D",
            "E",
        ]
    })
    unique = Unique(
        columns=["error_column"]
    )
    err_metadata = {
        "name": "error",
        "constraints": [unique],
        "primary_key": "key",
        "fields": {
            "key": {
                "type": "id", "subtype": "string"
            },
            "error_column": {
                "type": "categorical",
            }
        }
    }

    test_df = test_df.iloc[[3]]
    model = GaussianCopula(table_metadata=err_metadata)
    model.fit(test_df)
    model.sample()

My analysis

The problem is, that the indices from a subset are scrambled / incomplete.
This leads to the unique constraint using indices which are not contained in the subset.

Fix

Calling .reset_index() on the subset before fitting the model solves the problem and results in the expected behavior.

model.fit(test_df.reset_index())

Traceback

./tests/integration/tabular/test_base.py::test_regression_erroring_unique_columns Failed: [undefined]IndexError: index 3 is out of bounds for axis 0 with size 1
def test_regression_erroring_unique_columns():
        test_df = pd.DataFrame({
            "key": [
                1,
                2,
                3,
                4,
                5,
            ],
            "error_column": [
                "A",
                "B",
                "C",
                "D",
                "E",
            ]
        })
        unique = Unique(
            columns=["error_column"]
        )
        err_metadata = {
            "name": "error",
            "constraints": [unique],
            "primary_key": "key",
            "fields": {
                "key": {
                    "type": "id", "subtype": "string"
                },
                "error_column": {
                    "type": "categorical",
                }
            }
        }
    
        test_df = test_df.iloc[[3]]
        model = GaussianCopula(table_metadata=err_metadata)
>       model.fit(test_df)

tests/integration/tabular/test_base.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sdv/tabular/base.py:138: in fit
    transformed = self._metadata.transform(data)
sdv/metadata/table.py:622: in transform
    self._validate_data_on_constraints(data)
sdv/metadata/table.py:593: in _validate_data_on_constraints
    if not constraint.is_valid(data).all():
sdv/constraints/tabular.py:1126: in is_valid
    valid.iloc[groups.first()['index'].values] = True
env/lib/python3.8/site-packages/pandas/core/indexing.py:670: in __setitem__
    iloc._setitem_with_indexer(indexer, value)
env/lib/python3.8/site-packages/pandas/core/indexing.py:1800: in _setitem_with_indexer
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:534: in setitem
    return self.apply("setitem", indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:406: in apply
    applied = getattr(b, f)(**kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = BoolBlock: 1 dtype: bool, indexer = array([3]), value = True

    def setitem(self, indexer, value):
        """
        Attempt self.values[indexer] = value, possibly creating a new array.
    
        Parameters
        ----------
        indexer : tuple, list-like, array-like, slice
            The subset of self.values to set
        value : object
            The value being set
    
        Returns
        -------
        Block
    
        Notes
        -----
        `indexer` is a direct slice/positional indexer. `value` must
        be a compatible shape.
        """
        transpose = self.ndim == 2
    
        if isinstance(indexer, np.ndarray) and indexer.ndim > self.ndim:
            raise ValueError(f"Cannot set values with ndim > {self.ndim}")
    
        # coerce None values, if appropriate
        if value is None:
            if self.is_numeric:
                value = np.nan
    
        # coerce if block dtype can store value
        values = self.values
        if self._can_hold_element(value):
            # We only get here for non-Extension Blocks, so _try_coerce_args
            #  is only relevant for DatetimeBlock and TimedeltaBlock
            if lib.is_scalar(value):
                value = convert_scalar_for_putitemlike(value, values.dtype)
    
        else:
            # current dtype cannot store value, coerce to common dtype
    
            if hasattr(value, "dtype"):
                dtype = value.dtype
    
            elif lib.is_scalar(value) and not isna(value):
                dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True)
    
            else:
                # e.g. we are bool dtype and value is nan
                # TODO: watch out for case with listlike value and scalar/empty indexer
                dtype, _ = maybe_promote(np.array(value).dtype)
                return self.astype(dtype).setitem(indexer, value)
    
            dtype = find_common_type([values.dtype, dtype])
            assert not is_dtype_equal(self.dtype, dtype)
            # otherwise should have _can_hold_element
    
            return self.astype(dtype).setitem(indexer, value)
    
        # value must be storable at this moment
        if is_extension_array_dtype(getattr(value, "dtype", None)):
            # We need to be careful not to allow through strings that
            #  can be parsed to EADtypes
            is_ea_value = True
            arr_value = value
        else:
            is_ea_value = False
            arr_value = np.array(value)
    
        if transpose:
            values = values.T
    
        # length checking
        check_setitem_lengths(indexer, value, values)
        exact_match = (
            len(arr_value.shape)
            and arr_value.shape[0] == values.shape[0]
            and arr_value.size == values.size
        )
        if is_empty_indexer(indexer, arr_value):
            # GH#8669 empty indexers
            pass
    
        elif is_scalar_indexer(indexer, self.ndim):
            # setting a single element for each dim and with a rhs that could
            #  be e.g. a list; see GH#6043
            values[indexer] = value
    
        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical,
            # we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))
    
        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just
            #  substitute in the new array
            return self.make_block(arr_value)
    
        # if we are an exact match (ex-broadcasting),
        # then use the resultant dtype
        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value
    
            values = values.astype(arr_value.dtype, copy=False)
    
        # set
        else:
>           values[indexer] = value
E           IndexError: index 3 is out of bounds for axis 0 with size 1

env/lib/python3.8/site-packages/pandas/core/internals/blocks.py:891: IndexError

@xamm xamm added bug Something isn't working pending review labels Oct 20, 2021
@xamm
Copy link
Contributor Author

xamm commented Oct 20, 2021

I added a test in my fork repository in this branch.

This test would error without adding .reset_index() before fitting a model.
If this line is removed the test will error again.

This might make it easier to reproduce the problem.

@katxiao
Copy link
Contributor

katxiao commented Oct 20, 2021

Hi @xamm, thanks for reporting this issue! The test is definitely helpful.

Would you be interested in contributing a solution for this issue? We would be happy to have your contribution and review your code. If not, we will triage it and take a look in the near future!

@xamm
Copy link
Contributor Author

xamm commented Oct 21, 2021

Hi @katxiao I created a PR for this issue.

This simply resets the indices of the data passed to the fit method everytime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants