Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

imsitu · 2019-07-04T05:07:05Z

SDV version: 0.1.1
Python version: 3.6
Operating System: Mac Mojave

Description

I am trying to setup automated test data generation for my testing.
I generated metadata JSON for the table and fit the model with it.
As sampler is a dict, I am storing sampler from data_vault as pickle.
The goal is to store this pickle of sampler in db or in a remote server and generate test data wherever and whenever necessary.

The samples are taking too much of time to generate - for a table of 29 columns and 1800 rows
To generate 10 samples it is taking 5 mins. I tried to generate whole 1800 rows but it was never completed , had to kill.

Please let me know If am dealing things in a wrong way or Anything which i need to tweak to get faster response.

What I Did

        data_vault = SDV(self.findMeta())
        data_vault.fit()

        output = open('sampler.pkl', 'wb')
        pickle.dump(data_vault.sampler, output)
        output.close()

        infile = open('sampler.pkl','rb')
        new_dict = pickle.load(infile)
        infile.close()
        samples = new_dict.sample_all(100)

The text was updated successfully, but these errors were encountered:

ManuelAlvarezC · 2019-07-09T11:18:20Z

Hi @imsitu, and thanks for your question.

I'm not sure what may be causing this, but there are some details that you could share with us that will help us figure it out. Can you please share:

Specs of the computer you are using
Metadata you generated
Actual data you use to fit (If it's possible)

Beyond that, there is a little detail that you mention:

As sampler is a dict, I am storing sampler from data_vault as pickle.

I'm not really sure about what do you mean with sampler is a dict, but if you are interested in store a fitted version of SDV, you can do it easily using the save and load methods from the SDV class.

imsitu · 2019-07-09T12:19:17Z

specs :
MacBook Pro
processor 2.6 GHz intel i7
Memory : 16gb 2000 MHz DDR 4

attached metadata as an attachment.

Sry, I couldn’t share the actual data.

forget about the dictionary part.My BAD which has resulter confusion.

I am saving SDV object in pickle files and loading in it (which is what SDV load and save are doing internally )
and from the loaded pickle Im generating sample with using

sdv_pickle.sample_all()

meta_ref.txt

I think SDV is taking more time when there are more columns and categorical columns.

csala · 2019-07-09T14:04:31Z

Hi @imsitu Just out of curiosity: Why are you using pickle yourself instead of calling save and load?

imsitu · 2019-07-09T14:54:56Z

@csala Its basically same code underneath and besides that I want to use multiprocessing to speed up.
I think whichever way the pickle files is generated it should be the same right.

imsitu · 2019-07-10T12:05:17Z

@csala , @ManuelAlvarezC ,
Just to add another point for single table of 29 columns and 1800 rows :
sampler = Sampler(new_datanavigator, modeler)

sampler.sample_rows('ref_table’,1800) and sampler.sample_table('ref_table’) are generating samples within 50 secs.

BUT generating default - 5 samples is taking 173 secs with sample_all()

sampler.sample_all(5)

Why would sample all take more even if there no child tables or no foriegn-key relations ?

DataDoctorNG · 2019-07-14T16:29:08Z

@csala , @ManuelAlvarezC , and @imsitu I am also having issues with the sample all class I am attaching the meta file with my csv as an xlsx. Is there any way to speed this up?
Meta.txt
model_data.xlsx

kveerama · 2019-07-19T10:38:44Z

@imsitu I have a question for you. Is it possible to reach you via email? Can you email me at kalyanv@mit.edu

imsitu · 2019-07-19T12:53:31Z

@kveerama you can reach me @ situ.wantsyou@gmail.com

csala · 2019-11-11T15:49:52Z

This has been resolved in v0.2.0

imsitu · 2019-11-12T03:54:33Z

@csala May I know the commit id or pr number just to know the fix.

csala · 2019-11-12T13:25:16Z

It was done in PR #121, but unfortunately I cannot tell you the exact commit, as the change is buried among a lot of other big refactoring changes.

But I can explain and point you at the reason of the problem in the old code-base: https://github.com/HDI-Project/SDV/blob/v0.1.2/sdv/sampler.py#L470

The problem was that the previous categorical encoding implementation required the internal sampled values to be exactly between 0 and 1, and the way to get to this number was a loop in which out of range values where dropped and re-sampled until all the values where valid.
Because of how the old CatTransformer from RDT worked, negative values were very like to show up when sampling, which meat that there was a huge number re-sampling attempts. And this number grew at a more than linear ratio with the total number of samples requested.

The CategoricalTransformer from RDT does not have this [0-1] requirement, so that validate and discard loop was removed altogether from the Sampler implementation.

sdv-dev#103) * Add n_discriminator steps * move parameter to init * Update synthesizer.py * remove whitespace * Add extra information in docstring and change variable name to discriminator_steps Co-authored-by: Carles Sala <carles@pythiac.com>

csala added the question General question about the software label Jul 9, 2019

csala assigned csala and JDTheRipperPC Nov 11, 2019

csala added approved bug Something isn't working and removed question General question about the software labels Nov 11, 2019

csala added this to the 0.2.0 milestone Nov 11, 2019

csala closed this as completed Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

imsitu commented Jul 4, 2019 •

edited

Loading

ManuelAlvarezC commented Jul 9, 2019

imsitu commented Jul 9, 2019 •

edited

Loading

csala commented Jul 9, 2019

imsitu commented Jul 9, 2019 •

edited

Loading

imsitu commented Jul 10, 2019

DataDoctorNG commented Jul 14, 2019

kveerama commented Jul 19, 2019

imsitu commented Jul 19, 2019

csala commented Nov 11, 2019

imsitu commented Nov 12, 2019

csala commented Nov 12, 2019

Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

Generating samples is taking lot of time. Is there any way to speed up sample generation. #103

Comments

imsitu commented Jul 4, 2019 • edited Loading

Description

What I Did

ManuelAlvarezC commented Jul 9, 2019

imsitu commented Jul 9, 2019 • edited Loading

csala commented Jul 9, 2019

imsitu commented Jul 9, 2019 • edited Loading

imsitu commented Jul 10, 2019

DataDoctorNG commented Jul 14, 2019

kveerama commented Jul 19, 2019

imsitu commented Jul 19, 2019

csala commented Nov 11, 2019

imsitu commented Nov 12, 2019

csala commented Nov 12, 2019

imsitu commented Jul 4, 2019 •

edited

Loading

imsitu commented Jul 9, 2019 •

edited

Loading

imsitu commented Jul 9, 2019 •

edited

Loading