Applying Unique Constraint errors when calling model.fit() on a subset of data #610

xamm · 2021-10-20T08:42:40Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.12.1
Python version: 3.8.10
Operating System: Linux

Error Description

I tried to fit a GaussianCopula model on a subset of data applying a unique constraint on two columns.
This resulted in an IndexError: index 182 is out of bounds for axis 0 with size 10.
I expected a fitted model to sample from.

The columns where the unique constraint is applied on are unique, that was checked before.

Steps to reproduce

    test_df = pd.DataFrame({
        "key": [
            1,
            2,
            3,
            4,
            5,
        ],
        "error_column": [
            "A",
            "B",
            "C",
            "D",
            "E",
        ]
    })
    unique = Unique(
        columns=["error_column"]
    )
    err_metadata = {
        "name": "error",
        "constraints": [unique],
        "primary_key": "key",
        "fields": {
            "key": {
                "type": "id", "subtype": "string"
            },
            "error_column": {
                "type": "categorical",
            }
        }
    }

    test_df = test_df.iloc[[3]]
    model = GaussianCopula(table_metadata=err_metadata)
    model.fit(test_df)
    model.sample()

My analysis

The problem is, that the indices from a subset are scrambled / incomplete.
This leads to the unique constraint using indices which are not contained in the subset.

Fix

Calling .reset_index() on the subset before fitting the model solves the problem and results in the expected behavior.

model.fit(test_df.reset_index())

Traceback

./tests/integration/tabular/test_base.py::test_regression_erroring_unique_columns Failed: [undefined]IndexError: index 3 is out of bounds for axis 0 with size 1
def test_regression_erroring_unique_columns():
        test_df = pd.DataFrame({
            "key": [
                1,
                2,
                3,
                4,
                5,
            ],
            "error_column": [
                "A",
                "B",
                "C",
                "D",
                "E",
            ]
        })
        unique = Unique(
            columns=["error_column"]
        )
        err_metadata = {
            "name": "error",
            "constraints": [unique],
            "primary_key": "key",
            "fields": {
                "key": {
                    "type": "id", "subtype": "string"
                },
                "error_column": {
                    "type": "categorical",
                }
            }
        }
    
        test_df = test_df.iloc[[3]]
        model = GaussianCopula(table_metadata=err_metadata)
>       model.fit(test_df)

tests/integration/tabular/test_base.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sdv/tabular/base.py:138: in fit
    transformed = self._metadata.transform(data)
sdv/metadata/table.py:622: in transform
    self._validate_data_on_constraints(data)
sdv/metadata/table.py:593: in _validate_data_on_constraints
    if not constraint.is_valid(data).all():
sdv/constraints/tabular.py:1126: in is_valid
    valid.iloc[groups.first()['index'].values] = True
env/lib/python3.8/site-packages/pandas/core/indexing.py:670: in __setitem__
    iloc._setitem_with_indexer(indexer, value)
env/lib/python3.8/site-packages/pandas/core/indexing.py:1800: in _setitem_with_indexer
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:534: in setitem
    return self.apply("setitem", indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:406: in apply
    applied = getattr(b, f)(**kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = BoolBlock: 1 dtype: bool, indexer = array([3]), value = True

    def setitem(self, indexer, value):
        """
        Attempt self.values[indexer] = value, possibly creating a new array.
    
        Parameters
        ----------
        indexer : tuple, list-like, array-like, slice
            The subset of self.values to set
        value : object
            The value being set
    
        Returns
        -------
        Block
    
        Notes
        -----
        `indexer` is a direct slice/positional indexer. `value` must
        be a compatible shape.
        """
        transpose = self.ndim == 2
    
        if isinstance(indexer, np.ndarray) and indexer.ndim > self.ndim:
            raise ValueError(f"Cannot set values with ndim > {self.ndim}")
    
        # coerce None values, if appropriate
        if value is None:
            if self.is_numeric:
                value = np.nan
    
        # coerce if block dtype can store value
        values = self.values
        if self._can_hold_element(value):
            # We only get here for non-Extension Blocks, so _try_coerce_args
            #  is only relevant for DatetimeBlock and TimedeltaBlock
            if lib.is_scalar(value):
                value = convert_scalar_for_putitemlike(value, values.dtype)
    
        else:
            # current dtype cannot store value, coerce to common dtype
    
            if hasattr(value, "dtype"):
                dtype = value.dtype
    
            elif lib.is_scalar(value) and not isna(value):
                dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True)
    
            else:
                # e.g. we are bool dtype and value is nan
                # TODO: watch out for case with listlike value and scalar/empty indexer
                dtype, _ = maybe_promote(np.array(value).dtype)
                return self.astype(dtype).setitem(indexer, value)
    
            dtype = find_common_type([values.dtype, dtype])
            assert not is_dtype_equal(self.dtype, dtype)
            # otherwise should have _can_hold_element
    
            return self.astype(dtype).setitem(indexer, value)
    
        # value must be storable at this moment
        if is_extension_array_dtype(getattr(value, "dtype", None)):
            # We need to be careful not to allow through strings that
            #  can be parsed to EADtypes
            is_ea_value = True
            arr_value = value
        else:
            is_ea_value = False
            arr_value = np.array(value)
    
        if transpose:
            values = values.T
    
        # length checking
        check_setitem_lengths(indexer, value, values)
        exact_match = (
            len(arr_value.shape)
            and arr_value.shape[0] == values.shape[0]
            and arr_value.size == values.size
        )
        if is_empty_indexer(indexer, arr_value):
            # GH#8669 empty indexers
            pass
    
        elif is_scalar_indexer(indexer, self.ndim):
            # setting a single element for each dim and with a rhs that could
            #  be e.g. a list; see GH#6043
            values[indexer] = value
    
        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical,
            # we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))
    
        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just
            #  substitute in the new array
            return self.make_block(arr_value)
    
        # if we are an exact match (ex-broadcasting),
        # then use the resultant dtype
        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value
    
            values = values.astype(arr_value.dtype, copy=False)
    
        # set
        else:
>           values[indexer] = value
E           IndexError: index 3 is out of bounds for axis 0 with size 1

env/lib/python3.8/site-packages/pandas/core/internals/blocks.py:891: IndexError

The text was updated successfully, but these errors were encountered:

xamm · 2021-10-20T11:14:22Z

I added a test in my fork repository in this branch.

This test would error without adding .reset_index() before fitting a model.
If this line is removed the test will error again.

This might make it easier to reproduce the problem.

katxiao · 2021-10-20T19:40:58Z

Hi @xamm, thanks for reporting this issue! The test is definitely helpful.

Would you be interested in contributing a solution for this issue? We would be happy to have your contribution and review your code. If not, we will triage it and take a look in the near future!

xamm · 2021-10-21T07:46:31Z

Hi @katxiao I created a PR for this issue.

This simply resets the indices of the data passed to the fit method everytime.

xamm added bug Something isn't working pending review labels Oct 20, 2021

xamm mentioned this issue Oct 21, 2021

Fixes #610 using unique constraint on subsets of data when fitting a model #611

Merged

csala mentioned this issue Oct 28, 2021

The Unique constraint cannot handle non-default index #617

Closed

katxiao closed this as completed in 3a9cb24 Oct 29, 2021

amontanez24 assigned xamm Nov 19, 2021

amontanez24 added this to the 0.13.0 milestone Nov 19, 2021

npatki removed the pending review label Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying Unique Constraint errors when calling model.fit() on a subset of data #610

Applying Unique Constraint errors when calling model.fit() on a subset of data #610

xamm commented Oct 20, 2021

xamm commented Oct 20, 2021

katxiao commented Oct 20, 2021

xamm commented Oct 21, 2021

Applying Unique Constraint errors when calling model.fit() on a subset of data #610

Applying Unique Constraint errors when calling model.fit() on a subset of data #610

Comments

xamm commented Oct 20, 2021

Environment Details

Error Description

Steps to reproduce

My analysis

Fix

Traceback

xamm commented Oct 20, 2021

katxiao commented Oct 20, 2021

xamm commented Oct 21, 2021