[Bug] Inconsistent behavior: `standardize` vs. `Standardize` with n < 2 #2422

AdrianSosic · 2024-06-24T18:25:55Z

AdrianSosic
Jun 24, 2024

🐛 Bug

I am currently playing around with situations where there's no training data available yet and noticed that the behaviors of utils.standardize and transforms.Standardize are inconsistent.

To reproduce

Here a minimal example adopted from the code on the landing page. When you run the original code for the GP creation (in the comments), everything works fine. However, when you run the displayed version, you get the error shown below.

Code snippet to reproduce

import torch
from botorch.acquisition import UpperConfidenceBound
from botorch.models import SingleTaskGP
from botorch.models.transforms import Standardize
from botorch.optim import optimize_acqf
from botorch.utils import standardize

# >>>>> changed code
# Unlike in the main page example, we set the number of training points to 0
train_X = torch.rand(0, 2)
# <<<<< changed code

Y = 1 - torch.linalg.norm(train_X - 0.5, dim=-1, keepdim=True)
Y = Y + 0.1 * torch.randn_like(Y)

# >>>>> original version
# train_Y = standardize(Y)
# gp = SingleTaskGP(train_X, train_Y)
# -----
train_Y = Y
gp = SingleTaskGP(train_X, train_Y, outcome_transform=Standardize(1))
# <<<<< version throwing error

# >>>>> changed code
# Because there is no training data, we do not fit the model parameters
gp.eval()
# <<<<< changed code

UCB = UpperConfidenceBound(gp, beta=0.1)
bounds = torch.stack([torch.zeros(2), torch.ones(2)])
candidate, acq_value = optimize_acqf(
    UCB,
    bounds=bounds,
    q=1,
    num_restarts=5,
    raw_samples=20,
)

Stack trace/error message

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected Behavior

In both cases, the GP posterior should simply match the GP posterior and the standardization applied along the way should not mess with the computation. I haven't checked what exact logic is applied internally when there is no data / only one data point being passed to the transformation, but my intuition would tell me that what should happen is:

for one data point, standardization should simply use the point's value as mean and impose some default value (e.g. 1) for the estimated standard deviation
for zero data, same as above but using zero as mean value.

System information

Please complete the following information:

BoTorch Version: 0.11.0
GPyTorch Version: 1.11
PyTorch Version: 2.3.0
OS: macOS

esantorella · 2024-06-25T00:56:40Z

esantorella
Jun 25, 2024
Collaborator

This isn't supported as of BoTorch 0.11.1; Standardize now raises an exception when given empty data. standardize isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.

We should update the documentation to avoid references to standardize, and either deprecate it or clarify its behavior with empty data.

As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and optimize_acqf won't be able to make any progress. It will return something random.

To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with draw_sobol_samples.

0 replies

AdrianSosic · 2024-06-25T07:37:39Z

AdrianSosic
Jun 25, 2024
Author

Hi @esantorella, thanks for the quick answer ✌🏼. I'm aware of the conceptual problems that arise when working with empty training data, and I'm also clear about the difference between the two scalers, but let me reply point by point:

This isn't supported as of BoTorch 0.11.1; Standardize now raises an exception when given empty data. standardize isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.

We should update the documentation to avoid references to standardize, and either deprecate it or clarify its behavior with empty data.

Thanks for the info, I now also saw your PR that implemented these changes. I think if you plan to discourage users from using standardize, I recommend updating the code on the landing page, which is the first point of contact with the package for many users 🙃

As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and optimize_acqf won't be able to make any progress. It will return something random.

To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with draw_sobol_samples.

I think I don't completely agree with this statement. While this is true for most situations and, in the average use case, people are probably better off resorting to alternatives (like Sobol sampling), your claim does not hold in all scenarios:

If you request a single recommendation -> agree. However, if you optimize a batch, the situation looks already different. Here, the effect you'd get is similar to farthest point sampling / determinantal point processes, where your recommendations would be generated to be most "dissimilar", which I think is still a very reasonable design. For instance, when optimizing on a hyperrectangle space, the first few recommendations would land on the corners of the space. This is even true when setting sequential=True, with the exception that the first point of the batch would remain random.
If you use a nonstationary kernel, the above described effect would already happen for a single requested point.

What I'm trying to say here: I do get your point, but because of the way standardization is handled, it's currently impossible to create a single recommendation model that behaves consistently (i.e. applies the exact same logic) across all training data sizes. That is, is you created a plot of the "performance" of that model where on the x-axis you have the number of training data points, you would not be able to get a smooth curve that fully extends to 0 on its left end, because that case currently requires applying a different model logic – even though conceptually the same logic could be applied, as my examples above demonstrate.

Perhaps I'm overlooking an important aspect here, so please feel free to correct me if I'm mistaken 😃

0 replies

esantorella · 2024-06-25T18:08:51Z

esantorella
Jun 25, 2024
Collaborator

That all sounds right to me, with the proviso that you'd need to use an acquisition function that supports q > 1 in the batch case, e.g. use qUpperConfidenceBound rather than UpperConfidenceBound. (You may know this, but not everyone else reading this will.) I especially agree that the documentation should definitely be updated (PRs welcome!).

There are two approaches we could go with here:

Status quo: The Standardize transform can continue to not support the case of empty data. Users should just not use a data-dependent transform like Standardize in this case. This makes the behavior of BoTorch more transparent (because it's not obvious how empty data would be standardized).
To facilitate having a consistent API across data sizes, Standardize can support empty data. It will need to set some mean and standard deviation (say to 0 and 1) in order to be able to untransform predictions. This might make things a little easier on the user, but at the cost of more code complexity and nonobvious behavior within BoTorch.

Would it serve your use case to just not use a transform when data is empty?

0 replies

AdrianSosic · 2024-06-26T11:40:59Z

AdrianSosic
Jun 26, 2024
Author

Simply not transforming in case of empty data would certainly work and would have been my manual workaround for it 👍🏼
And I agree that there are two reasonable options, namely 1) to handle the ill-defined cases internally or 2) to raise errors.

However, I see a discrepancy between my expectation and the current implementation, and I think the current logic of Standardize is a somewhat weird/inconsistent mix of both alternatives.

To explain what I mean, let us consider the two scenarios that can happen:

Either, we are in a regime where the estimators are well defined, which is for N>2. In this case, everything is fine, no need for custom logic / errors.
Or, we are in a regime where the estimators break and it is no longer obvious what to do, i.e. things become a design decision. For N=1, the mean estimate is still intact, but the Bessel corrected standard deviation estimate breaks. For N=0, both estimates break.

The current version applies a manual fix for N=1 by setting the standard deviation to the arbitrary value of 1. But then, in the similarly ill-defined setting of N=0, where the same fix could be applied to the mean by setting it to 0 (since the decision "let's internally handle the breaking cases" has already been made), an error is thrown instead.

Long story short: to me it would seem much more consistent to either say "nope, not applying any custom logic here" for all cases N<2, or to fix all of them. Because I don't see how fixing only the standard deviation is less invasive than fixing both.

0 replies

esantorella · 2024-06-26T12:34:55Z

esantorella
Jun 26, 2024
Collaborator

I agree with this, and found it confusing to understand what was happening in the N=1 and N=0 cases the last time I looked at the Standardize code. I'd be fine with supporting neither (with a clear error message) or both (with clear documentation on each case). I'm not sure I'll have time to do it myself promptly, but PRs are always welcome. :)

0 replies

esantorella · 2024-06-26T15:39:16Z

esantorella
Jun 26, 2024
Collaborator

On second thought, I think Standardize should be supported with one data point -- if model-fitting with one data point is supported at all -- because the priors don't make sense and model-fitting may struggle if the data isn't standardized. Here is an example where failing to use a transform results in a posterior mean that is even farther from zero than the data, which is pretty weird.

On the other hand, the second model seems highly overconfident in the presence of such large noise, so setting the standard deviation to 1 may be a bad choice in the presence of observed noise. It might be more sensible to set it to sqrt(yvar) or something, which would give a posterior that looks like this:

We could also conclude that we are not going to get reasonable predictions with 0 or 1 data points and disallow it entirely.

0 replies

AdrianSosic · 2024-06-27T09:05:02Z

AdrianSosic
Jun 27, 2024
Author

Hi @esantorella, thank for creating the examples. Before we continue or discussion on whether or not to allow the degenerate cases, let's perhaps first figure out what's going wrong here!

First of all, thanks for bringing up the train_Yvar option – I actually didn't have this on my radar at all. Of course, that needs to be taken into account properly, and I also think that something like sqrt(yvar) would probably make sense. But the elephant in the room is definitely: how on earth can the posterior mean be above the data point in the unscaled case!? Clearly something is fishy here (or we are completely overlooking an important aspect)! Before we can continue to talk about scaling, I think we need to find the cause for that. Could it be that something in the fixed noise model is set up incorrectly?

0 replies

esantorella · 2024-06-28T14:20:35Z

esantorella
Jun 28, 2024
Collaborator

Yeah good question. I'm not going to have a chance to look closely into this right away, but my best guess would be that failing to standardize leads to difficulty with model fit, similar to #2392. The prior probability of having a data point at 1000 is very low, especially since the provided yvar nearly rules out the possibility that this is noise. So the marginal likelihood might be very flat, and tiny, near the optimum, causing numerical convergence troubles. Just a guess though!

0 replies

esantorella · 2024-07-10T15:27:59Z

esantorella
Jul 10, 2024
Collaborator

I opened #2421 for removing references to standardize, and I'm going to convert this (interesting!) issue to a discussion, since it's not clear what ought to be done here. Once it's clear what needs to be changed, we can open a new issue for that task.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Inconsistent behavior: `standardize` vs. `Standardize` with n < 2 #2422

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Bug] Inconsistent behavior: standardize vs. Standardize with n < 2 #2422

AdrianSosic Jun 24, 2024

🐛 Bug

To reproduce

Expected Behavior

System information

Replies: 9 comments

esantorella Jun 25, 2024 Collaborator

AdrianSosic Jun 25, 2024 Author

esantorella Jun 25, 2024 Collaborator

AdrianSosic Jun 26, 2024 Author

esantorella Jun 26, 2024 Collaborator

esantorella Jun 26, 2024 Collaborator

AdrianSosic Jun 27, 2024 Author

esantorella Jun 28, 2024 Collaborator

esantorella Jul 10, 2024 Collaborator

[Bug] Inconsistent behavior: `standardize` vs. `Standardize` with n < 2 #2422

AdrianSosic
Jun 24, 2024

esantorella
Jun 25, 2024
Collaborator

AdrianSosic
Jun 25, 2024
Author

esantorella
Jun 25, 2024
Collaborator

AdrianSosic
Jun 26, 2024
Author

esantorella
Jun 26, 2024
Collaborator

esantorella
Jun 26, 2024
Collaborator

AdrianSosic
Jun 27, 2024
Author

esantorella
Jun 28, 2024
Collaborator

esantorella
Jul 10, 2024
Collaborator