Skip to content
olgavrou edited this page Jan 10, 2023 · 13 revisions

What: Evaluate exploration algorithms

The goal of explore eval is to evaluate different exploration algorithms using the data from a logged policy. The eval policy does not learn from all of the logged examples but there is some rejection sampling in order to do counterfactual simulation:

  • for each example in the logged policy
    • get the pmf of the prediction of the policy being evaluated (eval policy)
    • for the action that was logged (which has a probability of p_log) find the eval policy probability for that action p_eval
    • calculate a threshold p_eval / p_log
    • flip a biased coin (using threshold)
    • have the eval policy learn from this example or skip to the next example

The --multiplier cli argument can be provided which will be applied to the threshold (threshold *= multiplier) and will affect the sampling rejection rate. If not supplied the multiplier starts at 1 and will adjust to the value of the minimum probability ratio (1 / threshold i.e. p_log / p_eval) in order to make rejection sampling fair across examples.

For all examples the average loss is calculated using the IPS technique (logged_cost * (p_eval / p_log)) and reported at the end.

Explore eval once run, gives us some information about the sampling rate:

weighted update count = <N>
average accepted example weight = <E>
violation count = <V>
final multiplier = <M>

where:

  • weighted update count is the number of weighted examples that were used to update the policy being evaluated; the reason they are weighted is because the threshold can be > 1 and the learning weight for that example is set to that threshold, therefore VW can "see" and learn from an example more than one times
  • average accepted example weight is the average weight of examples during learning. If this is > 1 that means that the logging and eval probabilities were such that the threshold was often > 1
  • violation count is the number of examples that had a threshold > 1 which means the eval policy had a larger probability for the logged action than the logged probability, and therefore is for sure used to update the eval policy
  • final multiplier is the final multiplier used

We can see that for eval policies' that are similar to the logged policy, the rejection rate will be lower than if the eval policies are very different to the logged policy. This can result in very different confidence intervals for different eval policies.

One way to tackle this is playing around with the multiplier for different policies. All policies should be evaluated with a similar rejection rate.

Another way to tackle this is to use the --target_rate argument. The target rate will be used to adjust the rejection rate in order to achieve an update count of #examples * target_rate. To effectively set the target rate, there is a need to pass over the data twice.

The steps that are recommended to follow are:

  • Select the ensemble of exploration policies that are to be evaluated (e.g. different epsilon values, squarecb, large action spaces, etc)
  • First pass over the data for each policy in order to gather the weighted example count of each (when just using --explore_eval)
  • Determine target rate as min-weighted-example-count / max-weighted-example-count
    • When the logged policy is being evaluated along with the new policies then it is expected that the logged policy's max-weighted-example-count is the same as the number of logged examples
  • Second pass over the data for each policy using --target_rate <target_rate>

After the second pass the focus lies on the weighted example count and the average accepted example weight. With enough data the weighted example count should be close for all of the evaluated policies and the average accepted example weight should not be much larger than 1. If those assumptions do not hold it is likely that there is a need for more data and/or the target rate needs to be adjusted. It is worth noting that explore eval is very data hungry.

Clone this wiki locally