-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483
Comments
Hi @sahoosan, nice to meet you! It may be helpful to review the current functionality first. Current FunctionalityThe SDV sampling is designed to generate synthetic data incrementally. For example: synthetic_data1 = synthesizer.sample(num_rows=100)
synthetic_data2 = synthesizer.sample(num_rows=100) The first call will contain synthesizer.reset_sampling()
synthetic_data3 = synthesizer.sample(num_rows=100) Now the synthetic data 3 is the same as the first result -- for the IDs as well as any other columns. Next StepsI'm not very familiar with Spark so I'm not sure how this is being set up. I'm guessing each processor has its own copy of the synthesizer, so each will start from the same state? There is no way to "fast forward" the state at the moment -- but we can consider this a feature request if this will be useful to you. I'm curious what your use case is? I assume using the demo guests/hotels dataset is a trial run for something? Workaround: I haven't verified this, but worth a try.
|
Hi @sahoosan, I had a similar issue when using multiprocessing with Python where every process produces the same results after the PARSynthesizer model sampling. After a long search, I was able to get it to work by using the following line just before the sampling. Hope this helps! |
Hi @sahoosan the workaround mentioned by @sanketahegde probably won’t work for HMASynthesizer as it is not a torch based model. We are checking a few things and will get back to you. |
Thanks all for your suggestions. Setting the seed directly on the synthesizer like following seesm to work, but not sure if that is the right way as I think this is the seed it was trained with and seems to be a protected variable. Hotel Ids still duplicate with this, though the other columns look good. |
Hello all, I'm not sure if you are still working on these projects? Indeed setting the numpy seed and torch seed would help cover most cases of randomness. However, you may notice some different behavior with primary key IDs.
For primary keys, the goal is to generate globally unique IDs (a primary key must be unique in the entire table in order to uniquely identify a particular row). If want to sample IDs with Spark parallel, it is a bit difficult to ensure that a Spark cluster does not accidentally create the same ID that another one created independently. Some options you may have:
|
As this discussion has been inactive for a while, I will close the issue and mark the overall question as answered. But please feel free to reply if there is more to investigate and I am happy to reopen for discussion. Thanks. |
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I am using the Hotels and Guuests example that uses the HMASynthesizer. When generating from one processor worked fine I encountered issues when I try to run using Spark. The idea behind running form Spark is to see if the generation process can scale when I need to generate billions of records.
I tried to generate 100 Hotel records from 5 spark executors asking them to generate 20 each. All 5 of them returned the same 20 records.
How do I make them generate unique records? For example I expect the Hotel_ids in the result be unique with values ranging from HID_000 to HID_099 instead of HID_000 to HID_019 repeating 5 times.
Thanks
Sanjeeb
The text was updated successfully, but these errors were encountered: