Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When leveraging Spark to generate in parallel how to make sure the records are unqiue #1483

Closed
sahoosan opened this issue Jun 29, 2023 · 6 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@sahoosan
Copy link

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.2.0
  • Python version: 3.8.17
  • Operating System: Linux

Problem description

I am using the Hotels and Guuests example that uses the HMASynthesizer. When generating from one processor worked fine I encountered issues when I try to run using Spark. The idea behind running form Spark is to see if the generation process can scale when I need to generate billions of records.

I tried to generate 100 Hotel records from 5 spark executors asking them to generate 20 each. All 5 of them returned the same 20 records.
How do I make them generate unique records? For example I expect the Hotel_ids in the result be unique with values ranging from HID_000 to HID_099 instead of HID_000 to HID_019 repeating 5 times.

Thanks
Sanjeeb

@sahoosan sahoosan added new Automatic label applied to new issues question General question about the software labels Jun 29, 2023
@npatki
Copy link
Contributor

npatki commented Jun 29, 2023

Hi @sahoosan, nice to meet you! It may be helpful to review the current functionality first.

Current Functionality

The SDV sampling is designed to generate synthetic data incrementally. For example:

synthetic_data1 = synthesizer.sample(num_rows=100)
synthetic_data2 = synthesizer.sample(num_rows=100)

The first call will contain HID_000 to HID_099. The second call will contain HID_100 to HID_199, etc. You can then use reset_sampling to reset to the original state.

synthesizer.reset_sampling()
synthetic_data3 = synthesizer.sample(num_rows=100)

Now the synthetic data 3 is the same as the first result -- for the IDs as well as any other columns.

Next Steps

I'm not very familiar with Spark so I'm not sure how this is being set up. I'm guessing each processor has its own copy of the synthesizer, so each will start from the same state? There is no way to "fast forward" the state at the moment -- but we can consider this a feature request if this will be useful to you.

I'm curious what your use case is? I assume using the demo guests/hotels dataset is a trial run for something?

Workaround: I haven't verified this, but worth a try.

  1. You can try setting random seeds in numpy according to No way to fix the random seed? #157. Then, the data columns will be different
  2. Unfortunately, you'd have to override the ID columns yourself to prevent duplicates. Maybe each processor can perturb the IDs by some amount?

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jun 29, 2023
@sanketahegde
Copy link

Hi @sahoosan,

I had a similar issue when using multiprocessing with Python where every process produces the same results after the PARSynthesizer model sampling.

After a long search, I was able to get it to work by using the following line just before the sampling.
torch.manual_seed(torch.seed())

Hope this helps!

@kveerama
Copy link
Contributor

kveerama commented Jul 3, 2023

Hi @sahoosan the workaround mentioned by @sanketahegde probably won’t work for HMASynthesizer as it is not a torch based model. We are checking a few things and will get back to you.

@sahoosan
Copy link
Author

sahoosan commented Jul 5, 2023

Thanks all for your suggestions. Setting the seed directly on the synthesizer like following seesm to work, but not sure if that is the right way as I think this is the seed it was trained with and seems to be a protected variable. Hotel Ids still duplicate with this, though the other columns look good.
synthesizer._numpy_seed = torch.seed() % 2**32

@npatki
Copy link
Contributor

npatki commented Jan 17, 2024

Hello all, I'm not sure if you are still working on these projects?

Indeed setting the numpy seed and torch seed would help cover most cases of randomness. However, you may notice some different behavior with primary key IDs.

Hotel Ids still duplicate with this, though the other columns look good.

For primary keys, the goal is to generate globally unique IDs (a primary key must be unique in the entire table in order to uniquely identify a particular row). If want to sample IDs with Spark parallel, it is a bit difficult to ensure that a Spark cluster does not accidentally create the same ID that another one created independently.

Some options you may have:

  • You can set up each cluster to append a value to the end of the synthesized ID. For example cluster A can append to create HID_000(a), HID_001(a), ... and cluster B can append to create HID_000(b), HID_001(b), etc. This will ensure that when you put all the data together (from all clusters), the values will be unique.
  • You can play around with our pre-processing functions. It should be possible to assign an AnonymizedFaker that is capable of generating long strings of random values such as UUIDs. When these are randomized, it will be very unlikely to create 2 IDs that collide. If this interests you, let me know and I can provide an example.

@npatki
Copy link
Contributor

npatki commented Jan 17, 2024

As this discussion has been inactive for a while, I will close the issue and mark the overall question as answered. But please feel free to reply if there is more to investigate and I am happy to reopen for discussion. Thanks.

@npatki npatki closed this as completed Jan 17, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

4 participants