Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating different synthetic data while training the model multiple times. #299

Closed
Amanhelloworld opened this issue Jan 21, 2021 · 2 comments
Labels
question General question about the software
Milestone

Comments

@Amanhelloworld
Copy link

Amanhelloworld commented Jan 21, 2021

Hi team,
Thank you very much for a great package like this. I am using this package for one of my project where I am showing the results on synthetic data.

I have some issue while training the model.
1 . I wanted to train my model to generated synthetic data sample for example I am using CTGAN. The quality of data is
different every time if I re-trained my model.
2. If I retrain the model again there is huge difference in the generated synthetic data and when I use this synthetic data for
other tasks, there is large performance gaps in the results.

So, is there any way to make the model consistent across multiple run. I could use seed or saving the model so that my model wont change much, but the problem is if someone else what to do the same experiment then on his machine they will get the different results which wont match with mine.

if you could tell me how it can be solve, It would be very helpful for my project.

@Amanhelloworld Amanhelloworld added feature request Request for a new feature pending review labels Jan 21, 2021
@csala csala added question General question about the software and removed feature request Request for a new feature pending review labels Jan 21, 2021
@csala
Copy link
Contributor

csala commented Jan 21, 2021

Interesting question @Amanhelloworld

In most cases, you can ensure reproducibility by fixing the numpy and torch seeds, as follows:

np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)

Here's an example:

In [1]: import numpy as np
   ...: import torch
   ...: from sdv.demo import load_tabular_demo
   ...: from sdv.tabular import CTGAN
   ...: 
   ...: data = load_tabular_demo('student_placements')

In [2]: torch.manual_seed(0)
   ...: np.random.seed(0)
   ...: model = CTGAN(epochs=10)
   ...: model.fit(data)
   ...: model.sample(5)
Out[2]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17433      M    69.588847  55.257127  Commerce    68.440825   Comm&Mgmt            False                 0           56.604584   Mkt&HR  51.130539  37699.395301   False 2020-09-10 2020-12-09      NaN
1       17395      M    46.089210  61.477286  Commerce    75.891097      Others            False                 1           61.801707   Mkt&HR  58.038007  32413.742727    True        NaT 2020-08-22      3.0
2       17301      F    72.407853  58.146130      Arts    85.528594   Comm&Mgmt             True                 0           48.795626  Mkt&Fin  67.373889           NaN    True 2020-02-19 2020-06-11      3.0
3       17323      M    70.313107  45.468931  Commerce    57.623638   Comm&Mgmt             True                 0           41.398895   Mkt&HR  55.129773  28015.573652    True 2020-03-07 2019-12-12      NaN
4       17483      M    56.702416  91.571410   Science    76.770451   Comm&Mgmt            False                 1           73.093578   Mkt&HR  59.265596  42264.083767    True 2020-02-12 2020-08-02      6.0

In [3]: torch.manual_seed(0)
   ...: np.random.seed(0)
   ...: model = CTGAN(epochs=10)
   ...: model.fit(data)
   ...: model.sample(5)
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17433      M    69.588847  55.257127  Commerce    68.440825   Comm&Mgmt            False                 0           56.604584   Mkt&HR  51.130539  37699.395301   False 2020-09-10 2020-12-09      NaN
1       17395      M    46.089210  61.477286  Commerce    75.891097      Others            False                 1           61.801707   Mkt&HR  58.038007  32413.742727    True        NaT 2020-08-22      3.0
2       17301      F    72.407853  58.146130      Arts    85.528594   Comm&Mgmt             True                 0           48.795626  Mkt&Fin  67.373889           NaN    True 2020-02-19 2020-06-11      3.0
3       17323      M    70.313107  45.468931  Commerce    57.623638   Comm&Mgmt             True                 0           41.398895   Mkt&HR  55.129773  28015.573652    True 2020-03-07 2019-12-12      NaN
4       17483      M    56.702416  91.571410   Science    76.770451   Comm&Mgmt            False                 1           73.093578   Mkt&HR  59.265596  42264.083767    True 2020-02-12 2020-08-02      6.0

@csala
Copy link
Contributor

csala commented Sep 9, 2021

Closing this, as the question was already responded long ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

3 participants