Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do the numeric values of ordinal ChoiceParameter choices affect the model? #750

Closed
sgbaird opened this issue Dec 8, 2021 · 6 comments
Closed
Labels
question Further information is requested

Comments

@sgbaird
Copy link
Contributor

sgbaird commented Dec 8, 2021

TL;DR

The docs seem to suggest that ordinal parameters use a Matern 5/2 kernel, so I assume the answer is "yes" it does affect it. Is there a way to change this to Hamming distance so that order constraints can be used with categorical variables? See toy problem below

Toy Problem

Take some data based on choices which are used to construct choice parameters (slots) and constraints. The choices can go into any of the slots, and choices are constrained to populate slots in a particular order (e.g. BAC is not allowed, only ABC). The script-version is given in ordinal_example.py.

Imports

import numpy as np
import pandas as pd

Choices and Data

data = [["A", "B", "C"], ["D", "C", "A"], ["C", "A", "B"], ["C", "B", "A"]]
choices = list(np.unique(np.array(data).flatten()))
n_choices = len(choices)

Ordinal Encoding

df = pd.DataFrame(data)
choice_lookup = {
    choice: choice_num for (choice, choice_num) in zip(choices, range(n_choices))
}
encoded_df = df.replace(choice_lookup)
encoded_choices = pd.DataFrame(choices)[0].map(choice_lookup).values
encoded_data = encoded_df.values
print(encoded_data)
[[0 1 2]
 [3 2 0]
 [2 0 1]
 [2 1 0]]

Choice Parameters

nslots = 3
slot_names = ["slot_" + str(i) for i in range(nslots)]
slots = [
    {
        "name": slot_name,
        "type": "choice",
        "values": encoded_choices,
    }
    for slot_name in slot_names
]
print(slots)  # then format via black
[
    {"name": "slot_0", "type": "choice", "values": ["A", "B", "C", "D"]},
    {"name": "slot_1", "type": "choice", "values": ["A", "B", "C", "D"]},
    {"name": "slot_2", "type": "choice", "values": ["A", "B", "C", "D"]},
]

Ordered Constraints

constraints = [
    lhs + " <= " + rhs for (lhs, rhs) in list(zip(slot_names[:-1], slot_names[1:]))
]
print(constraints)
["slot_0 >= slot_1", "slot_1 >= slot_2"]

Docs suggest ordinal parameters use Matern 5/2 kernel

Based on Support for mixed search spaces and categorical variables (docs):

The most common way of dealing with categorical variables in Bayesian optimization is to one-hot encode the categories to allow fitting a GP model in a continuous space. In this setting, a categorical variable with categories ["red", "blue", "green"] is represented by three new variables (one for each category). While this is a convenient choice, it can drastically increase the dimensionality of the search space. In addition, the acquisition function is often optimized in the corresponding continuous space and the final candidate is selected by rounding back to the original space, which may result in selecting sub-optimal points according to the acquisition function.

Our new approach uses separate kernels for the categorical and ordinal (continuous/integer) variables. In particular, we use a kernel of the form:
k(x,y)=kcat(xcat,ycat)×kord(xord,yord)+kcat(xcat,ycat)+kord(xord,yord)
For the ordinal variables we can use a standard kernel such as Matérn-5/2, but for the categorical variables we need a way to compute distances between the different categories. A natural choice is to set the distance is 0 if two categories are equal and 1 otherwise, similar to the idea of Hamming distances. This approach can be combined with the idea automatic relevance determination (ARD) where each categorical variable has its own lengthscale. Rather than optimizing the acquisition function in a continuously relaxed space, we optimize it separately over each combination of the categorical variables. While this is likely to result in better optimization performance, it may lead to slow optimization of the acquisition function when there are many categorical variables.

It seems like ordinal variables will use a Matérn-5/2 kernel by default, in which case I'd assume the numeric choices of the ordinal parameters to play a significant role. Is this the case? How do I replace this with a Hamming distance instead? Is this a flag that could be incorporated into e.g. ax_client.create_experiment() or the other APIs?

@lena-kashtelyan
Copy link
Contributor

cc @qingfeng10 (as modopt oncall) and @Balandat

@Balandat
Copy link
Contributor

So if we represent ordered choices internally we normalize them to equidistant numbers in [0, 1] (this may not be true for the Service API, I will need to check), regardless of their numerical value. So regardless whether you have an ordinal parameter with values ["A", "B", "C"] or values [1, 2, 1000], both of then will be normalized to the representation [0, 0.5, 1.0].

In general, once a parameter is defined as an ordinal categorical parameter, I do not believe we allow imposing linear constraints across it and another parameter - simply b/c designating it as categorical means that the values are not directly comparable (within the parameter, values are completely ordered by definition). You could try to do sth like you're suggesting above, but that would require using integer parameters.

Is the application here essentially to select and ordered subset of elements from some ordered list? it seems that in such cases with rather complex constraints it might make sense to think about a more custom way of optimizing this rather than trying to express it in the existing parameter & constraint interface.

cc @dme65

@sgbaird
Copy link
Contributor Author

sgbaird commented Dec 10, 2021

@Balandat, thank you for getting back to me. For some additional context, please see the initial suggestion #710 (comment) and my response #710 (comment).

I'm a bit surprised to hear that:

["A", "B", "C"] or values [1, 2, 1000], both of then will be normalized to the representation [0, 0.5, 1.0]

Especially based on the docs description (above) that:

For the ordinal variables we can use a standard kernel such as Matérn-5/2, but for the categorical variables we need a way to compute distances between the different categories. A natural choice is to set the distance is 0 if two categories are equal and 1 otherwise, similar to the idea of Hamming distances.

I'm working on a MWE to verify whether ordered constraints can be imposed on numeric ordinal parameters. #710 shows the stack trace for the error that comes with trying to impose an ordered constraint on string-based categorical ordinal parameters.

@sgbaird
Copy link
Contributor Author

sgbaird commented Dec 10, 2021

I see now that indeed, Parameter Constraints [are] not supported for ChoiceParameter (Reproducer via Google Colab). I put the full stack trace at #710 (comment) since it also had the stack trace for the categorical variables.

You mentioned:

You could try to do sth like you're suggesting above, but that would require using integer parameters.

I'm assuming integer parameters probably use a Matern-like kernel by default rather than a Hamming-like kernel. While not necessarily straightforward for me, it seems possible to swap out the Matern kernel with the Hamming kernel. However, I'm not sure how this could be done only for certain parameters (e.g. only int parameters) while leaving the rest of the parameters (e.g. all float-s) to the default Matern kernel. Granted, it seems like it should be possible since there's a mix of kernels already based on the parameter type. I'll probably leave this approach of using integer parameters as if they were categorical parameters on the back-burner for now.

I'm definitely open to other ideas for how to optimize this. The data augmentation approach I've been considering in #710 (comment) will probably work for the small dataset (~O 100) we're interested in as long as the dataset remains small and nslots < 7 (I referred to this as max_components elsewhere). But I think it would still be good to know if there's a way to implement it without these limitations.

@Balandat
Copy link
Contributor

For the ordinal variables we can use a standard kernel such as Matérn-5/2, but for the categorical variables we need a way to compute distances between the different categories. A natural choice is to set the distance is 0 if two categories are equal and 1 otherwise, similar to the idea of Hamming distances.

So this is for ordered categorical variables. For unordered ones we do indeed use the hamming distance.

I'm assuming integer parameters probably use a Matern-like kernel by default rather than a Hamming-like kernel. While not necessarily straightforward for me, it seems possible to swap out the Matern kernel with the Hamming kernel. However, I'm not sure how this could be done only for certain parameters (e.g. only int parameters) while leaving the rest of the parameters (e.g. all float-s) to the default Matern kernel

We pass down some minimal representation (a SearchSpaceDigest) into the modelbridge layer based on which we choose what kind of model/kernel to use. E.g. if there are unordered categoricals we will end up in this branch that chooses a model that uses both a Matern and a Hamming distance kernel:

# Mixed optimization case. Note that presence of categorical
# features in search space digest indicates that downstream in the
# stack we chose not to perform continuous relaxation on those
# features.
elif search_space_digest.categorical_features:
if not all_nan_Yvar:
logger.warning(
"Using `MixedSingleTaskGP` despire the known `Yvar` values. This "
"is a temporary measure while fixed-noise mixed BO is in the works."
)
model_class = MixedSingleTaskGP

Currently this happens automatically based on the parameter type, i.e., there isn't an easy way right now to use a hamming distance kernel for an integer parameter.

Since these are rather complex constraints, I wanted to resurface my previous comment:

Is the application here essentially to select and ordered subset of elements from some ordered list? it seems that in such cases with rather complex constraints it might make sense to think about a more custom way of optimizing this rather than trying to express it in the existing parameter & constraint interface.

It may make the most sense to do something custom here, e.g. using a heuristic mixed-discrete optimization strategy operating directly on the set of feasible orderings in the slot. Something like this could potentially be achieved by passing in some callable that just serves as an feasibility check for whether a given slot configuration is feasible, or it may just itself generate the set of feasible orderings.

@lena-kashtelyan lena-kashtelyan added the question Further information is requested label Dec 10, 2021
@sgbaird
Copy link
Contributor Author

sgbaird commented Dec 15, 2021

It may make the most sense to do something custom here, e.g. using a heuristic mixed-discrete optimization strategy operating directly on the set of feasible orderings in the slot. Something like this could potentially be achieved by passing in some callable that just serves as an feasibility check for whether a given slot configuration is feasible, or it may just itself generate the set of feasible orderings.

@Balandat Is this similar to what you mentioned in #745 (comment)? I agree that a more custom approach seems to be in order.

In terms of application, this goes back to the slot-based approach in #727 where there's some degeneracy associated with the space of compositions. 0.5*A + 0.4*B + 0.1*C is equivalent to 0.4*B + 0.5*A + 0.1*C, but (if I'm thinking of this correctly) would be located at a very different region in feature space (i.e. ["A", "B", "C", 0.5, 0.4, 0.1] far from ["B", "A", "C", 0.4, 0.5, 0.1]). I figured a ChoiceParameter constraint such that ["A", "B", "C", ...] is feasible while ["B", "A", "C", ...] could be a workaround, but this would need a Hamming distance (d(A,B) = 1, d(A,C) = 1, d(A,A)=0. Based on your comments, this seems possible, but as you mentioned an alternative is probably better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants