-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating samples is taking lot of time. Is there any way to speed up sample generation. #103
Comments
Hi @imsitu, and thanks for your question. I'm not sure what may be causing this, but there are some details that you could share with us that will help us figure it out. Can you please share:
Beyond that, there is a little detail that you mention:
I'm not really sure about what do you mean with sampler is a dict, but if you are interested in store a fitted version of SDV, you can do it easily using the |
specs : attached metadata as an attachment. Sry, I couldn’t share the actual data. forget about the dictionary part.My BAD which has resulter confusion. I am saving SDV object in pickle files and loading in it (which is what SDV load and save are doing internally )
I think SDV is taking more time when there are more columns and categorical columns. |
Hi @imsitu Just out of curiosity: Why are you using pickle yourself instead of calling |
@csala Its basically same code underneath and besides that I want to use multiprocessing to speed up. |
@csala , @ManuelAlvarezC ,
BUT generating default - 5 samples is taking 173 secs with sample_all()
Why would sample all take more even if there no child tables or no foriegn-key relations ? |
@csala , @ManuelAlvarezC , and @imsitu I am also having issues with the sample all class I am attaching the meta file with my csv as an xlsx. Is there any way to speed this up? |
@imsitu I have a question for you. Is it possible to reach you via email? Can you email me at kalyanv@mit.edu |
@kveerama you can reach me @ situ.wantsyou@gmail.com |
This has been resolved in v0.2.0 |
@csala May I know the commit id or pr number just to know the fix. |
It was done in PR #121, but unfortunately I cannot tell you the exact commit, as the change is buried among a lot of other big refactoring changes. But I can explain and point you at the reason of the problem in the old code-base: https://github.com/HDI-Project/SDV/blob/v0.1.2/sdv/sampler.py#L470 The problem was that the previous categorical encoding implementation required the internal sampled values to be exactly between 0 and 1, and the way to get to this number was a loop in which out of range values where dropped and re-sampled until all the values where valid. The CategoricalTransformer from RDT does not have this [0-1] requirement, so that validate and discard loop was removed altogether from the Sampler implementation. |
sdv-dev#103) * Add n_discriminator steps * move parameter to init * Update synthesizer.py * remove whitespace * Add extra information in docstring and change variable name to discriminator_steps Co-authored-by: Carles Sala <carles@pythiac.com>
Description
I am trying to setup automated test data generation for my testing.
I generated metadata JSON for the table and fit the model with it.
As sampler is a dict, I am storing sampler from data_vault as pickle.
The goal is to store this pickle of sampler in db or in a remote server and generate test data wherever and whenever necessary.
The samples are taking too much of time to generate - for a table of 29 columns and 1800 rows
To generate 10 samples it is taking 5 mins. I tried to generate whole 1800 rows but it was never completed , had to kill.
Please let me know If am dealing things in a wrong way or Anything which i need to tweak to get faster response.
What I Did
The text was updated successfully, but these errors were encountered: