Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperature parameter to get more unlikely samples #2240

Closed
ulfaslak opened this issue Sep 26, 2024 · 6 comments
Closed

Temperature parameter to get more unlikely samples #2240

ulfaslak opened this issue Sep 26, 2024 · 6 comments
Assignees
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@ulfaslak
Copy link

ulfaslak commented Sep 26, 2024

Problem Description

Hi! Awesome library, I literally JUST discovered it and it solves a huge problem for me. I'm working on a project where we are generating synthetic personas as LLMs for various tasks, and one for the problems we have when we sample populations is that the created individuals give "too mid" answers. It's essentially a regression towards the mean type problem. We deal with this by biasing the synthesized personas to be more extreme (or rare).

Expected behavior

I was hoping there would be something like a temperature parameter in your sampler, like:

synthetic_data = synthesizer.sample(
    num_rows=1_000_000,
    batch_size=1_000,
    temperature=2
)

which would factor into some decision layer like probabilities = softmax(logits / self.temperature).

Additional context

I searched your code for something like the above, but seems your samplers are coming from external libraries? At least I couldn't find an implementation of a sample call.

I also figured I could set all distributions to be uniform, or just use conditional sampling (though that quickly gets convoluted and inelegant). Temperature would be amazing. Or something else that achieves more sampling in the extremes.

🙏 And thanks for making awesome OSS!

@ulfaslak ulfaslak added feature request Request for a new feature new Automatic label applied to new issues labels Sep 26, 2024
@srinify
Copy link
Contributor

srinify commented Sep 26, 2024

Hi @ulfaslak 👋 you're correct that the underlying synthesizers live in separate libraries (e.g. DeepEcho, Copulas, or CTGAN) that we maintain as part of SDV: https://sdv.dev/ SDV sits on top of these libraries and provides useful abstractions, but doesn't expose most of the low level model attributes.

By default, SDV Synthesizers are designed to learn the patterns inherent in your data and mirror those patterns in the synthetic data. In situations like yours where you want more control over the distribution of values in some of your columns, we created conditional sampling features.

In the following code snippet, we request that 250 rows be generated for guests that book a Suite and have a rewards account, and 100 for those that book a suite and don't have a rewards account. The imbalance is different in the real data though!

from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=100,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)

synthetic_data = custom_synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
    output_file_path='synthetic_simulated_scenario.csv'
)

This is our recommended approach to make sure rarer results are included in your synthetic data. Give that a try and let us know what you think!

@srinify srinify self-assigned this Sep 26, 2024
@srinify srinify added question General question about the software under discussion Issue is currently being discussed and removed feature request Request for a new feature new Automatic label applied to new issues labels Sep 26, 2024
@ulfaslak
Copy link
Author

Yes, using conditional sampling is an option, though as I mentioned gets involved when you have many dimensions you want to condition on. Is setting the default distribution to "uniform", not also going to have the desired effect?

synthesizer = GaussianCopulaSynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    default_distribution='uniform'   # <-- THIS IS WHAT I MEAN
)

Or are there some serious drawbacks to this that should be flagged?

@srinify
Copy link
Contributor

srinify commented Oct 1, 2024

Changing the distribution might work here because the synthesizer will be forced to model it in a way that's closer to your use case.

I set the distribution for just the amenities_fee column in of our single table default datasets (fake_hotel_guests) to uniform:

uniform

I'd be curious to know if this option works for you -- try it and circle back! 🙏

@srinify srinify added under discussion Issue is currently being discussed and removed under discussion Issue is currently being discussed labels Oct 1, 2024
@srinify
Copy link
Contributor

srinify commented Oct 7, 2024

Hi @ulfaslak just following up here :)

@ulfaslak
Copy link
Author

ulfaslak commented Oct 7, 2024

@srinify Don't know the sample size, but at glance this looks exactly like what I need! Thanks for making that plot 💪.

Also, not sure I need it (but I might who knows), but adding support for a horseshoe distribution might be nice. But I also suppose there's a limit to how poor the chosen distribution can fit the data before you get sampling issues 🤔.

But horseshoe would be cool in cases where you explicitly want to sample "extreme cases". Like super old rich people, with extreme opinions and rare tastes. Still plausible samples though (respects correlations in training data).

@srinify
Copy link
Contributor

srinify commented Oct 8, 2024

@ulfaslak Great! Added as a feature request here :) #2258

Let's close this specific issue out then!

@srinify srinify closed this as completed Oct 8, 2024
@srinify srinify added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants