generating different synthetic data while training the model multiple times. #299

Amanhelloworld · 2021-01-21T16:15:09Z

Hi team,
Thank you very much for a great package like this. I am using this package for one of my project where I am showing the results on synthetic data.

I have some issue while training the model.
1 . I wanted to train my model to generated synthetic data sample for example I am using CTGAN. The quality of data is
different every time if I re-trained my model.
2. If I retrain the model again there is huge difference in the generated synthetic data and when I use this synthetic data for
other tasks, there is large performance gaps in the results.

So, is there any way to make the model consistent across multiple run. I could use seed or saving the model so that my model wont change much, but the problem is if someone else what to do the same experiment then on his machine they will get the different results which wont match with mine.

if you could tell me how it can be solve, It would be very helpful for my project.

csala · 2021-01-21T16:38:33Z

Interesting question @Amanhelloworld

In most cases, you can ensure reproducibility by fixing the numpy and torch seeds, as follows:

np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)

Here's an example:

In [1]: import numpy as np
   ...: import torch
   ...: from sdv.demo import load_tabular_demo
   ...: from sdv.tabular import CTGAN
   ...: 
   ...: data = load_tabular_demo('student_placements')

In [2]: torch.manual_seed(0)
   ...: np.random.seed(0)
   ...: model = CTGAN(epochs=10)
   ...: model.fit(data)
   ...: model.sample(5)
Out[2]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17433      M    69.588847  55.257127  Commerce    68.440825   Comm&Mgmt            False                 0           56.604584   Mkt&HR  51.130539  37699.395301   False 2020-09-10 2020-12-09      NaN
1       17395      M    46.089210  61.477286  Commerce    75.891097      Others            False                 1           61.801707   Mkt&HR  58.038007  32413.742727    True        NaT 2020-08-22      3.0
2       17301      F    72.407853  58.146130      Arts    85.528594   Comm&Mgmt             True                 0           48.795626  Mkt&Fin  67.373889           NaN    True 2020-02-19 2020-06-11      3.0
3       17323      M    70.313107  45.468931  Commerce    57.623638   Comm&Mgmt             True                 0           41.398895   Mkt&HR  55.129773  28015.573652    True 2020-03-07 2019-12-12      NaN
4       17483      M    56.702416  91.571410   Science    76.770451   Comm&Mgmt            False                 1           73.093578   Mkt&HR  59.265596  42264.083767    True 2020-02-12 2020-08-02      6.0

In [3]: torch.manual_seed(0)
   ...: np.random.seed(0)
   ...: model = CTGAN(epochs=10)
   ...: model.fit(data)
   ...: model.sample(5)
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17433      M    69.588847  55.257127  Commerce    68.440825   Comm&Mgmt            False                 0           56.604584   Mkt&HR  51.130539  37699.395301   False 2020-09-10 2020-12-09      NaN
1       17395      M    46.089210  61.477286  Commerce    75.891097      Others            False                 1           61.801707   Mkt&HR  58.038007  32413.742727    True        NaT 2020-08-22      3.0
2       17301      F    72.407853  58.146130      Arts    85.528594   Comm&Mgmt             True                 0           48.795626  Mkt&Fin  67.373889           NaN    True 2020-02-19 2020-06-11      3.0
3       17323      M    70.313107  45.468931  Commerce    57.623638   Comm&Mgmt             True                 0           41.398895   Mkt&HR  55.129773  28015.573652    True 2020-03-07 2019-12-12      NaN
4       17483      M    56.702416  91.571410   Science    76.770451   Comm&Mgmt            False                 1           73.093578   Mkt&HR  59.265596  42264.083767    True 2020-02-12 2020-08-02      6.0

csala · 2021-09-09T11:32:38Z

Closing this, as the question was already responded long ago.

Amanhelloworld added feature request Request for a new feature pending review labels Jan 21, 2021

csala added question General question about the software and removed feature request Request for a new feature pending review labels Jan 21, 2021

csala mentioned this issue Jan 21, 2021

No way to fix the random seed? #157

Closed

tjhallum mentioned this issue Jul 21, 2021

Can you set a random state for the sdv.tabular.ctgan.CTGAN.sample method? #515

Closed

csala closed this as completed Sep 9, 2021

This was referenced Sep 9, 2021

Add random_state arguments wherever relevant #586

Open

Metrics for evaluation in evaluation package #295

Closed

katxiao added this to the 0.14.0 milestone Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generating different synthetic data while training the model multiple times. #299

generating different synthetic data while training the model multiple times. #299

Amanhelloworld commented Jan 21, 2021 •

edited

Loading

csala commented Jan 21, 2021

csala commented Sep 9, 2021

generating different synthetic data while training the model multiple times. #299

generating different synthetic data while training the model multiple times. #299

Comments

Amanhelloworld commented Jan 21, 2021 • edited Loading

csala commented Jan 21, 2021

csala commented Sep 9, 2021

Amanhelloworld commented Jan 21, 2021 •

edited

Loading