-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabular: ensure that values fall within range #200
Comments
To address this, at the moment this is achievable by using a workaround based on CustomConstraint. This can be used in two ways:
For a later release we are working on a better method that handles this properly within the modeling process without the need to add external constraints. |
The recent introduction of CopulaGAN solves this problem by transforming each column using its marginal distribution. Here is an example using CopulaGAN on the Census dataset forcing Gamma as the distribution for the from sdv.demo import load_tabular_demo
from sdv.tabular import CopulaGAN
census = load_tabular_demo('census')
field_distributions = {
'capital-gain': 'gamma',
'capital-loss': 'gamma'
}
model = CopulaGAN(field_distributions=field_distributions)
model.fit(census)
model.sample() |
Hi, is it necessary to specify gamma distribution for the columns where only positive values are allowed? I have trained a CopulaGAN on the UNSW-NB15 dataset without specifying any distribution and found that negative values are still generated in columns where only positive values exist in the real dataset. Thank you. |
@tokchinkuan Yes, the gamma distribution specifies that no values outside the original range can exist. If you don't specify to use gamma, CopulaGAN will default to using Gaussians, which goes back to the problem outlined at the first message in this issue thread. |
This topic has been completely covered in the latest releases, so it can be closed |
Can you please explain detection metrics in detail. It is very confusing. Thank you ! |
Following up from an issue open in CTGAN: sdv-dev/CTGAN#24 (comment)
Current Tabular model implementations do not properly identify the range in which values should be generated, oftentimes producing values outside of the desired range. This is especially obvious in situations where a value is expected to be always positive but has an average close to 0.
@Baukebrenninkmeijer explained it very well here:
We should find a way to allow the users to indicate that the range in which the values are generated needs to be learned from the training data and then ensure that this value range is respected.
The text was updated successfully, but these errors were encountered: