-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Explore Eval
What: Evaluate exploration algorithms
The goal of explore eval is to evaluate different exploration algorithms using the data from a logged policy. The eval policy does not learn from all of the logged examples but there is some rejection sampling in order to do counterfactual simulation:
- for each example in the logged policy
- get the pmf of the prediction of the policy being evaluated (eval policy)
- for the action that was logged (which has a probability of
p_log
) find the eval policy probability for that actionp_eval
- calculate a threshold
p_eval
/p_log
- flip a biased coin (using threshold)
- have the eval policy learn from this example or skip to the next example
The --multiplier
cli argument can be provided which will be applied to the threshold (threshold *= multiplier
) and will affect the sampling rejection rate. If not supplied the multiplier starts at 1
and will adjust to the value of the minimum probability ratio (1 / threshold
i.e. p_log
/ p_eval
) in order to make rejection sampling fair across examples.
For all examples the average loss is calculated using the IPS technique (logged_cost * (p_eval / p_log)
) and reported at the end.
Explore eval once run, gives us some information about the sampling rate:
weighted update count = <N>
average accepted example weight = <E>
violation count = <V>
final multiplier = <M>
where:
-
weighted update count
is the number of weighted examples that were used to update the policy being evaluated; the reason they are weighted is because the threshold can be> 1
and the learning weight for that example is set to that threshold, therefore VW can "see" and learn from an example more than one times -
average accepted example weight
is the average weight of examples during learning. If this is> 1
that means that the logging and eval probabilities were such that the threshold was often> 1
-
violation count
is the number of examples that had athreshold > 1
which means the eval policy had a larger probability for the logged action than the logged probability, and therefore is for sure used to update the eval policy -
final multiplier
is the final multiplier used
We can see that for eval policies' that are similar to the logged policy, the rejection rate will be lower than if the eval policies are very different to the logged policy. This can result in very different confidence intervals for different eval policies.
One way to tackle this is playing around with the multiplier for different policies. All policies should be evaluated with a similar rejection rate.
Another way to tackle this is to use the --target_rate
argument. The target rate will be used to adjust the rejection rate in order to achieve an update count of #examples * target_rate
. To effectively set the target rate, there is a need to pass over the data twice.
The steps that are recommended to follow are:
- Select the ensemble of exploration policies that are to be evaluated (e.g. different epsilon values, squarecb, large action spaces, etc)
- First pass over the data for each policy in order to gather the weighted example count of each (when just using
--explore_eval
) - Determine target rate as
min-weighted-example-count / max-weighted-example-count
- When the logged policy is being evaluated along with the new policies then it is expected that the logged policy's max-weighted-example-count is the same as the number of logged examples
- Second pass over the data for each policy using
--target_rate <target_rate>
After the second pass the focus lies on the weighted example count
and the average accepted example weight
. With enough data the weighted example count should be close for all of the evaluated policies and the average accepted example weight should not be much larger than 1
. If those assumptions do not hold it is likely that there is a need for more data and/or the target rate needs to be adjusted. It is worth noting that explore eval is very data hungry.
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: