Temperature parameter to get more unlikely samples #2240

ulfaslak · 2024-09-26T05:03:26Z

Problem Description

Hi! Awesome library, I literally JUST discovered it and it solves a huge problem for me. I'm working on a project where we are generating synthetic personas as LLMs for various tasks, and one for the problems we have when we sample populations is that the created individuals give "too mid" answers. It's essentially a regression towards the mean type problem. We deal with this by biasing the synthesized personas to be more extreme (or rare).

Expected behavior

I was hoping there would be something like a temperature parameter in your sampler, like:

synthetic_data = synthesizer.sample(
    num_rows=1_000_000,
    batch_size=1_000,
    temperature=2
)

which would factor into some decision layer like probabilities = softmax(logits / self.temperature).

Additional context

I searched your code for something like the above, but seems your samplers are coming from external libraries? At least I couldn't find an implementation of a sample call.

I also figured I could set all distributions to be uniform, or just use conditional sampling (though that quickly gets convoluted and inelegant). Temperature would be amazing. Or something else that achieves more sampling in the extremes.

🙏 And thanks for making awesome OSS!

The text was updated successfully, but these errors were encountered:

srinify · 2024-09-26T13:20:38Z

Hi @ulfaslak 👋 you're correct that the underlying synthesizers live in separate libraries (e.g. DeepEcho, Copulas, or CTGAN) that we maintain as part of SDV: https://sdv.dev/ SDV sits on top of these libraries and provides useful abstractions, but doesn't expose most of the low level model attributes.

By default, SDV Synthesizers are designed to learn the patterns inherent in your data and mirror those patterns in the synthetic data. In situations like yours where you want more control over the distribution of values in some of your columns, we created conditional sampling features.

In the following code snippet, we request that 250 rows be generated for guests that book a Suite and have a rewards account, and 100 for those that book a suite and don't have a rewards account. The imbalance is different in the real data though!

from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=100,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)

synthetic_data = custom_synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
    output_file_path='synthetic_simulated_scenario.csv'
)

This is our recommended approach to make sure rarer results are included in your synthetic data. Give that a try and let us know what you think!

ulfaslak · 2024-09-27T05:06:13Z

Yes, using conditional sampling is an option, though as I mentioned gets involved when you have many dimensions you want to condition on. Is setting the default distribution to "uniform", not also going to have the desired effect?

synthesizer = GaussianCopulaSynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    default_distribution='uniform'   # <-- THIS IS WHAT I MEAN
)

Or are there some serious drawbacks to this that should be flagged?

srinify · 2024-10-01T15:31:32Z

Changing the distribution might work here because the synthesizer will be forced to model it in a way that's closer to your use case.

I set the distribution for just the amenities_fee column in of our single table default datasets (fake_hotel_guests) to uniform:

I'd be curious to know if this option works for you -- try it and circle back! 🙏

srinify · 2024-10-07T17:04:59Z

Hi @ulfaslak just following up here :)

ulfaslak · 2024-10-07T21:10:53Z

@srinify Don't know the sample size, but at glance this looks exactly like what I need! Thanks for making that plot 💪.

Also, not sure I need it (but I might who knows), but adding support for a horseshoe distribution might be nice. But I also suppose there's a limit to how poor the chosen distribution can fit the data before you get sampling issues 🤔.

But horseshoe would be cool in cases where you explicitly want to sample "extreme cases". Like super old rich people, with extreme opinions and rare tastes. Still plausible samples though (respects correlations in training data).

srinify · 2024-10-08T17:35:13Z

@ulfaslak Great! Added as a feature request here :) #2258

Let's close this specific issue out then!

ulfaslak added feature request Request for a new feature new Automatic label applied to new issues labels Sep 26, 2024

srinify self-assigned this Sep 26, 2024

srinify added question General question about the software under discussion Issue is currently being discussed and removed feature request Request for a new feature new Automatic label applied to new issues labels Sep 26, 2024

srinify added under discussion Issue is currently being discussed and removed under discussion Issue is currently being discussed labels Oct 1, 2024

srinify mentioned this issue Oct 8, 2024

Support distributions in GaussianCopulaSynthesizer that better capture extreme values #2258

Open

srinify closed this as completed Oct 8, 2024

srinify added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temperature parameter to get more unlikely samples #2240

Temperature parameter to get more unlikely samples #2240

ulfaslak commented Sep 26, 2024 •

edited

Loading

srinify commented Sep 26, 2024 •

edited by gsheni

Loading

ulfaslak commented Sep 27, 2024

srinify commented Oct 1, 2024 •

edited

Loading

srinify commented Oct 7, 2024

ulfaslak commented Oct 7, 2024 •

edited

Loading

srinify commented Oct 8, 2024

Temperature parameter to get more unlikely samples #2240

Temperature parameter to get more unlikely samples #2240

Comments

ulfaslak commented Sep 26, 2024 • edited Loading

Problem Description

Expected behavior

Additional context

srinify commented Sep 26, 2024 • edited by gsheni Loading

ulfaslak commented Sep 27, 2024

srinify commented Oct 1, 2024 • edited Loading

srinify commented Oct 7, 2024

ulfaslak commented Oct 7, 2024 • edited Loading

srinify commented Oct 8, 2024

ulfaslak commented Sep 26, 2024 •

edited

Loading

srinify commented Sep 26, 2024 •

edited by gsheni

Loading

srinify commented Oct 1, 2024 •

edited

Loading

ulfaslak commented Oct 7, 2024 •

edited

Loading