You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please indicate the following details about the environment in which you found the bug:
SDV version: 0.12.1
Python version: 3.8.10
Operating System: Linux
Error Description
I tried to fit a GaussianCopula model on a subset of data applying a unique constraint on two columns.
This resulted in an IndexError: index 182 is out of bounds for axis 0 with size 10.
I expected a fitted model to sample from.
The columns where the unique constraint is applied on are unique, that was checked before.
The problem is, that the indices from a subset are scrambled / incomplete.
This leads to the unique constraint using indices which are not contained in the subset.
Fix
Calling .reset_index() on the subset before fitting the model solves the problem and results in the expected behavior.
model.fit(test_df.reset_index())
Traceback
./tests/integration/tabular/test_base.py::test_regression_erroring_unique_columns Failed: [undefined]IndexError: index 3 is out of bounds for axis 0 with size 1
def test_regression_erroring_unique_columns():
test_df = pd.DataFrame({
"key": [
1,
2,
3,
4,
5,
],
"error_column": [
"A",
"B",
"C",
"D",
"E",
]
})
unique = Unique(
columns=["error_column"]
)
err_metadata = {
"name": "error",
"constraints": [unique],
"primary_key": "key",
"fields": {
"key": {
"type": "id", "subtype": "string"
},
"error_column": {
"type": "categorical",
}
}
}
test_df = test_df.iloc[[3]]
model = GaussianCopula(table_metadata=err_metadata)
> model.fit(test_df)
tests/integration/tabular/test_base.py:133:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sdv/tabular/base.py:138: in fit
transformed = self._metadata.transform(data)
sdv/metadata/table.py:622: in transform
self._validate_data_on_constraints(data)
sdv/metadata/table.py:593: in _validate_data_on_constraints
if not constraint.is_valid(data).all():
sdv/constraints/tabular.py:1126: in is_valid
valid.iloc[groups.first()['index'].values] = True
env/lib/python3.8/site-packages/pandas/core/indexing.py:670: in __setitem__
iloc._setitem_with_indexer(indexer, value)
env/lib/python3.8/site-packages/pandas/core/indexing.py:1800: in _setitem_with_indexer
self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:534: in setitem
return self.apply("setitem", indexer=indexer, value=value)
env/lib/python3.8/site-packages/pandas/core/internals/managers.py:406: in apply
applied = getattr(b, f)(**kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = BoolBlock: 1 dtype: bool, indexer = array([3]), value = True
def setitem(self, indexer, value):
"""
Attempt self.values[indexer] = value, possibly creating a new array.
Parameters
----------
indexer : tuple, list-like, array-like, slice
The subset of self.values to set
value : object
The value being set
Returns
-------
Block
Notes
-----
`indexer` is a direct slice/positional indexer. `value` must
be a compatible shape.
"""
transpose = self.ndim == 2
if isinstance(indexer, np.ndarray) and indexer.ndim > self.ndim:
raise ValueError(f"Cannot set values with ndim > {self.ndim}")
# coerce None values, if appropriate
if value is None:
if self.is_numeric:
value = np.nan
# coerce if block dtype can store value
values = self.values
if self._can_hold_element(value):
# We only get here for non-Extension Blocks, so _try_coerce_args
# is only relevant for DatetimeBlock and TimedeltaBlock
if lib.is_scalar(value):
value = convert_scalar_for_putitemlike(value, values.dtype)
else:
# current dtype cannot store value, coerce to common dtype
if hasattr(value, "dtype"):
dtype = value.dtype
elif lib.is_scalar(value) and not isna(value):
dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True)
else:
# e.g. we are bool dtype and value is nan
# TODO: watch out for case with listlike value and scalar/empty indexer
dtype, _ = maybe_promote(np.array(value).dtype)
return self.astype(dtype).setitem(indexer, value)
dtype = find_common_type([values.dtype, dtype])
assert not is_dtype_equal(self.dtype, dtype)
# otherwise should have _can_hold_element
return self.astype(dtype).setitem(indexer, value)
# value must be storable at this moment
if is_extension_array_dtype(getattr(value, "dtype", None)):
# We need to be careful not to allow through strings that
# can be parsed to EADtypes
is_ea_value = True
arr_value = value
else:
is_ea_value = False
arr_value = np.array(value)
if transpose:
values = values.T
# length checking
check_setitem_lengths(indexer, value, values)
exact_match = (
len(arr_value.shape)
and arr_value.shape[0] == values.shape[0]
and arr_value.size == values.size
)
if is_empty_indexer(indexer, arr_value):
# GH#8669 empty indexers
pass
elif is_scalar_indexer(indexer, self.ndim):
# setting a single element for each dim and with a rhs that could
# be e.g. a list; see GH#6043
values[indexer] = value
elif exact_match and is_categorical_dtype(arr_value.dtype):
# GH25495 - If the current dtype is not categorical,
# we need to create a new categorical block
values[indexer] = value
return self.make_block(Categorical(self.values, dtype=arr_value.dtype))
elif exact_match and is_ea_value:
# GH#32395 if we're going to replace the values entirely, just
# substitute in the new array
return self.make_block(arr_value)
# if we are an exact match (ex-broadcasting),
# then use the resultant dtype
elif exact_match:
# We are setting _all_ of the array's values, so can cast to new dtype
values[indexer] = value
values = values.astype(arr_value.dtype, copy=False)
# set
else:
> values[indexer] = value
E IndexError: index 3 is out of bounds for axis 0 with size 1
env/lib/python3.8/site-packages/pandas/core/internals/blocks.py:891: IndexError
The text was updated successfully, but these errors were encountered:
Hi @xamm, thanks for reporting this issue! The test is definitely helpful.
Would you be interested in contributing a solution for this issue? We would be happy to have your contribution and review your code. If not, we will triage it and take a look in the near future!
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I tried to fit a GaussianCopula model on a subset of data applying a unique constraint on two columns.
This resulted in an
IndexError: index 182 is out of bounds for axis 0 with size 10
.I expected a fitted model to sample from.
The columns where the unique constraint is applied on are unique, that was checked before.
Steps to reproduce
My analysis
The problem is, that the indices from a subset are scrambled / incomplete.
This leads to the unique constraint using indices which are not contained in the subset.
Fix
Calling
.reset_index()
on the subset before fitting the model solves the problem and results in the expected behavior.Traceback
The text was updated successfully, but these errors were encountered: