-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temperature parameter to get more unlikely samples #2240
Comments
Hi @ulfaslak 👋 you're correct that the underlying synthesizers live in separate libraries (e.g. DeepEcho, Copulas, or CTGAN) that we maintain as part of SDV: https://sdv.dev/ SDV sits on top of these libraries and provides useful abstractions, but doesn't expose most of the low level model attributes. By default, SDV Synthesizers are designed to learn the patterns inherent in your data and mirror those patterns in the synthetic data. In situations like yours where you want more control over the distribution of values in some of your columns, we created conditional sampling features. In the following code snippet, we request that 250 rows be generated for guests that book a Suite and have a rewards account, and 100 for those that book a suite and don't have a rewards account. The imbalance is different in the real data though! from sdv.sampling import Condition
suite_guests_with_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': True}
)
suite_guests_without_rewards = Condition(
num_rows=100,
column_values={'room_type': 'SUITE', 'has_rewards': False}
)
synthetic_data = custom_synthesizer.sample_from_conditions(
conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
output_file_path='synthetic_simulated_scenario.csv'
) This is our recommended approach to make sure rarer results are included in your synthetic data. Give that a try and let us know what you think! |
Yes, using conditional sampling is an option, though as I mentioned gets involved when you have many dimensions you want to condition on. Is setting the default distribution to "uniform", not also going to have the desired effect? synthesizer = GaussianCopulaSynthesizer(
metadata, # required
enforce_min_max_values=True,
enforce_rounding=False,
default_distribution='uniform' # <-- THIS IS WHAT I MEAN
) Or are there some serious drawbacks to this that should be flagged? |
Hi @ulfaslak just following up here :) |
@srinify Don't know the sample size, but at glance this looks exactly like what I need! Thanks for making that plot 💪. Also, not sure I need it (but I might who knows), but adding support for a horseshoe distribution might be nice. But I also suppose there's a limit to how poor the chosen distribution can fit the data before you get sampling issues 🤔. But horseshoe would be cool in cases where you explicitly want to sample "extreme cases". Like super old rich people, with extreme opinions and rare tastes. Still plausible samples though (respects correlations in training data). |
Problem Description
Hi! Awesome library, I literally JUST discovered it and it solves a huge problem for me. I'm working on a project where we are generating synthetic personas as LLMs for various tasks, and one for the problems we have when we sample populations is that the created individuals give "too mid" answers. It's essentially a regression towards the mean type problem. We deal with this by biasing the synthesized personas to be more extreme (or rare).
Expected behavior
I was hoping there would be something like a
temperature
parameter in your sampler, like:which would factor into some decision layer like
probabilities = softmax(logits / self.temperature)
.Additional context
I searched your code for something like the above, but seems your samplers are coming from external libraries? At least I couldn't find an implementation of a sample call.
I also figured I could set all distributions to be uniform, or just use conditional sampling (though that quickly gets convoluted and inelegant). Temperature would be amazing. Or something else that achieves more sampling in the extremes.
🙏 And thanks for making awesome OSS!
The text was updated successfully, but these errors were encountered: