Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you manage an inter-column dependency? #2318

Closed
npatki opened this issue Dec 10, 2024 · 2 comments
Closed

How do you manage an inter-column dependency? #2318

npatki opened this issue Dec 10, 2024 · 2 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Dec 10, 2024

I'm filing this issue on behalf of @Pavan-Kalyan1432, who first asked the question in this comment.

Problem description

How to manage inter column dependency...
For example we have 3 columns date of birth, date of death and age... In the synthetic data it is not coming properly. Give me the answer for both single table and multi table

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Dec 10, 2024
@npatki
Copy link
Contributor Author

npatki commented Dec 10, 2024

Hi @Pavan-Kalyan1432,

I am assuming that birth and date of death are both datetime columns, whereas age is a numerical column. It seems your data has the following logical rules:

  1. date of death must occur after birth
  2. age must be exactly equal to the # of years between birth and date of death

Note that SDV synthesizers use AI to learn from your data, which is inherently probabilistic. So if you have any hard-and-fast rules like this (that all rows must follow), a synthesizer will not produce it 100% of the time using just the default options. This is to be expected.

Using Constraints

To resolve a hard-and-fast rule like this, I would recommend you use constraints. Constraints can be applied to both single and multi-table datasets. Some resources are below:

  • Demo about using constraints
  • Inequality constraint -- this could be useful to enforce that date of death must occur after birth
  • Custom constraint -- you would probably need to add custom logic for the computation in the age column.
    • Alternatively, since age can be computed using the other two columns, there is really no need to input into SDV in the first place. You can just leave it out (drop the column) and recreate it in the synthetic data afterwards.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 10, 2024
@npatki
Copy link
Contributor Author

npatki commented Jan 6, 2025

Hi @Pavan-Kalyan1432, are you still working this? I'm closing this issue off because it's been inactive for a while and we've provided an answer. Please feel free to reply if there is more to discuss around this topic (inter-column dependency), as I can always re-open this. If you have questions about other topics, please file a new issue. Thanks.

@npatki npatki closed this as completed Jan 6, 2025
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

1 participant