When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

sahoosan · 2023-06-29T16:03:03Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 1.2.0
Python version: 3.8.17
Operating System: Linux

Problem description

I am using the Hotels and Guuests example that uses the HMASynthesizer. When generating from one processor worked fine I encountered issues when I try to run using Spark. The idea behind running form Spark is to see if the generation process can scale when I need to generate billions of records.

I tried to generate 100 Hotel records from 5 spark executors asking them to generate 20 each. All 5 of them returned the same 20 records.
How do I make them generate unique records? For example I expect the Hotel_ids in the result be unique with values ranging from HID_000 to HID_099 instead of HID_000 to HID_019 repeating 5 times.

Thanks
Sanjeeb

npatki · 2023-06-29T20:44:24Z

Hi @sahoosan, nice to meet you! It may be helpful to review the current functionality first.

Current Functionality

The SDV sampling is designed to generate synthetic data incrementally. For example:

synthetic_data1 = synthesizer.sample(num_rows=100)
synthetic_data2 = synthesizer.sample(num_rows=100)

The first call will contain HID_000 to HID_099. The second call will contain HID_100 to HID_199, etc. You can then use reset_sampling to reset to the original state.

synthesizer.reset_sampling()
synthetic_data3 = synthesizer.sample(num_rows=100)

Now the synthetic data 3 is the same as the first result -- for the IDs as well as any other columns.

Next Steps

I'm not very familiar with Spark so I'm not sure how this is being set up. I'm guessing each processor has its own copy of the synthesizer, so each will start from the same state? There is no way to "fast forward" the state at the moment -- but we can consider this a feature request if this will be useful to you.

I'm curious what your use case is? I assume using the demo guests/hotels dataset is a trial run for something?

Workaround: I haven't verified this, but worth a try.

You can try setting random seeds in numpy according to No way to fix the random seed? #157. Then, the data columns will be different
Unfortunately, you'd have to override the ID columns yourself to prevent duplicates. Maybe each processor can perturb the IDs by some amount?

sanketahegde · 2023-07-03T12:10:03Z

Hi @sahoosan,

I had a similar issue when using multiprocessing with Python where every process produces the same results after the PARSynthesizer model sampling.

After a long search, I was able to get it to work by using the following line just before the sampling.
torch.manual_seed(torch.seed())

Hope this helps!

kveerama · 2023-07-03T15:40:57Z

Hi @sahoosan the workaround mentioned by @sanketahegde probably won’t work for HMASynthesizer as it is not a torch based model. We are checking a few things and will get back to you.

sahoosan · 2023-07-05T15:51:22Z

Thanks all for your suggestions. Setting the seed directly on the synthesizer like following seesm to work, but not sure if that is the right way as I think this is the seed it was trained with and seems to be a protected variable. Hotel Ids still duplicate with this, though the other columns look good.
synthesizer._numpy_seed = torch.seed() % 2**32

npatki · 2024-01-17T03:40:22Z

Hello all, I'm not sure if you are still working on these projects?

Indeed setting the numpy seed and torch seed would help cover most cases of randomness. However, you may notice some different behavior with primary key IDs.

Hotel Ids still duplicate with this, though the other columns look good.

For primary keys, the goal is to generate globally unique IDs (a primary key must be unique in the entire table in order to uniquely identify a particular row). If want to sample IDs with Spark parallel, it is a bit difficult to ensure that a Spark cluster does not accidentally create the same ID that another one created independently.

Some options you may have:

You can set up each cluster to append a value to the end of the synthesized ID. For example cluster A can append to create HID_000(a), HID_001(a), ... and cluster B can append to create HID_000(b), HID_001(b), etc. This will ensure that when you put all the data together (from all clusters), the values will be unique.
You can play around with our pre-processing functions. It should be possible to assign an AnonymizedFaker that is capable of generating long strings of random values such as UUIDs. When these are randomized, it will be very unlikely to create 2 IDs that collide. If this interests you, let me know and I can provide an example.

npatki · 2024-01-17T03:41:19Z

As this discussion has been inactive for a while, I will close the issue and mark the overall question as answered. But please feel free to reply if there is more to investigate and I am happy to reopen for discussion. Thanks.

sahoosan added new Automatic label applied to new issues question General question about the software labels Jun 29, 2023

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jun 29, 2023

npatki closed this as completed Jan 17, 2024

npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Jan 17, 2024

npatki mentioned this issue Oct 18, 2024

GaussianCopula generates Duplicate Samples #2265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

sahoosan commented Jun 29, 2023

npatki commented Jun 29, 2023

sanketahegde commented Jul 3, 2023

kveerama commented Jul 3, 2023

sahoosan commented Jul 5, 2023

npatki commented Jan 17, 2024 •

edited

Loading

npatki commented Jan 17, 2024

When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

Comments

sahoosan commented Jun 29, 2023

Environment details

Problem description

npatki commented Jun 29, 2023

Current Functionality

Next Steps

sanketahegde commented Jul 3, 2023

kveerama commented Jul 3, 2023

sahoosan commented Jul 5, 2023

npatki commented Jan 17, 2024 • edited Loading

npatki commented Jan 17, 2024

npatki commented Jan 17, 2024 •

edited

Loading