-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCTS alternatives #860
Comments
Seems interesting. It'll take time to read and understand all that though. I'm not that good at math. This can also interest @killerducky and @evanroberts85. |
We may achieve what AlphaGo team couldn't do. Maybe they didn't think about changing tree search algorithm, due to lack of useful alternatives on that time. |
For your reference, there is an interesting experiment from japanese computer go developer board, done by HiraBot developer. he used negamax algorithm instead of MCTS, and saw it seems working not bad with even small thinking time. http://524.teacup.com/yss/bbs/?page=13& |
Here's a detailed overview that covers a wide variety of things I'm unfamiliar with. A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit.
More love for Thompson Sampling, also covers BART I linked earlier, and lots of other problem areas outside of what we do. |
This paper is also interesting: http://ieeexplore.ieee.org/document/7352306/ |
Optimizing the search parameters is not a problem. CLOP is very good at that. Any paper that doesn't compare to it can safely be ignored. |
The CLOP book chapter is here by the way: This was before neural networks, but is a general tuning algorithm. There is this section in the AlphaZero paper ( https://arxiv.org/abs/1712.01815 ):
I'm not sure what exactly that Bayesian optimisation refers to. In the actual AlphaGo Zero paper ( https://www.nature.com/articles/nature24270 / free https://deepmind.com/documents/119/agz_unformatted_nature.pdf ), the only tuneable parameter mentioned is c_puct. Tuning a single parameter doesn't seem particularly involved or require special techniques:
What would one tune in LeelaZero using CLOP? Puct for sure, but what else? Some of the newly introduced variables like regret, risk, opportunity, FPU reduction? That would still be very ad-hoc even if the tuning is rigorous, since the variables themselves were just 'invented' by fiat. |
Right now only PUCT and (if appropriate) FPU reduction. After the learning is finished, the softmax temperature can also be tuned.
I have no idea what that is supposed to mean.
I wonder if there is no confusion there between the hyperparameters of the training run and those of the search. |
@roy7 I did a lot of experiments with Thompson sampling and other bandit algorithms for my Bachelor’s and Master’s thesis. Implementing it for classical MCTS is as simple as initializing a Beta distribution during node creation and updating it by using +wins and -losses. A decision on which action to take can be obtained by randomly sampling from each distribution and taking the move with the highest value. It is more difficult if you do not evaluate rollouts until the end of the game, because we get the evaluations from a neural network. Thus you lose the conjugacy of the Beta distribution and have to fit another distribution to the values returned during tree search. One could for example think of a normal distribution with certain priors, but one always has to consider the performance of the whole fitting process (remember for Thompson sampling the most important operation is sampling from the distribution). |
Aren't the two entangled, because the networks are trained to predict the outcome of the search? Is it possible that the search algorithm which delivers the highest playing strength for a given network is actually not the best tutor for the next generation of networks? Edit: Just found this interesting-sounding paper "Thinking Fast and Slow with Deep Learning and Tree Search" ( https://arxiv.org/abs/1705.08439 ) |
@barrtgt @odeint Bayesian optimization refers to fitting a stochastic response surface model to the outcome of an experiment (usually the performance metric of a machine learning model). scikit-optimize for example is a Python package which implements that. It is possible to fit it on binary outcomes (e.g. Go matches), but for that you need to fit a latent variable model where you have to integrate over the binary squashing function. For that you usually need MCMC machinery. For binary outcomes one therefore usually applies CLOP or SPSA. |
Why does it matter we use the NN instead of rollouts? I thought a Beta distribution can model the NN value outputs, which are real values in the range [0,1]. |
@killerducky When using binary outcomes the parameters α and β can be updated directly. For example: If a subtree won 10 times and lost 3 times you can directly sample from the posterior distribution |
The interesting thing the authors of the Thompson sampling regret bounds paper do for the stochastic case is randomly choose a win or a loss by the [0,1] result observed and store that as a win or a loss instead of using the value directly. It feels like that's throwing away info, yet it works in practice apparently. The problem with tree search is as new moves are explored and we realize the real value of the board is much different than the earlier NN eval thought it was, the win rates of the node can change a lot and basically be a different distribution than we were sampling from before. |
@roy7 In the MCMC world, that's referred to as the 'burn in' period. Basically, they spend lots and lots of cycles on finding a very accurate approximation to the posterior probability distribution, whereas we are generally trying to build a system that has massive constraints on the number of cycles we can 'burn' to determine the posterior, i.e. the best move. We are trying to find a 'pretty good' approximation ASAP, so we are inherently going to be running into this problem with not-enough-burn-in all the time, IMHO. |
@roy7 Indeed, that is because we assume a fixed distribution over time. Auer and Chiang (2016) provide a recent improvement on the pseudo regret, but without empirical comparison. TL;DR: We should do experiments on which bandit algorithm improves the match results the most. This already incorporates the disadvantage caused by a slower bandit algorithm. |
@kiudee An idea I want to toy with once I figure out how in the code is to adjust the values of nodes to not be the average of all of the values under the node, but only the average of values under the node where the child node averages are within an overlapping confidence interval of the best moves. This is motivated by the other discussion #855 with the MV% stat showing an obvious warning about a blunder that was made, and the traps not discovered until deep in the search only slowly lower the win rate and the blunder move can't be avoided until a different move has more total visits. In a nutshell, consider where a blunder is about to be made and opponent can do a winning move next turn but I don't see it at first. So my move is AA and the opponent has two choices, LL for opponent to lose, and WW to win. LL is much more likely to be searched from my heatmap so at first I'm really confident I'm winning. Eventually I'm at: 1000 visits AA LL (90% win rate for me) The bounds on that 50 might be pretty wide since it's only 50 visits, but for now I see this move as about a 86% win rate move. Since this is the opponent's move they want to minimize my win rate, so future search goes mostly into AA WW. But once we're to this point: 1000 visits AA LL (90%) I probably have enough visits to realize the confidence interval around the 90% and the 15% are not touching any longer. In that case, instead of calling this node's value 75% I should call it 15%. I have enough certainty that the inferior move for the opponent is not the same distribution as the superior move, so the inferior move will not be taken and so it should just be ignored. Sort of like minmax, but not strictly using the value of the best/worst node. Using normal MCTS methodology... except nodes we know with some confidence aren't similar to the best choice for that player at that time won't skew the score up/down. Edit: Ducky linked me to: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html although I was planning to use this from Boost: http://www.boost.org/doc/libs/1_42_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/binom_eg/binom_conf.html. |
Looks like a reasonable idea. A part of me wants to play devil's advocate, though, and say, "This would just be an example where the net needs more training, so that it sees AA WW as higher probability. Then the MCTS would automatically already converge to the right answer more quickly, guided by the smarter net. Should we really be messing around so much with the way the MCTS performs searches, when really we just need to train the net more?" But then again, I can see an argument that supports your proposal, which is that this particular assumption of having an opponent who wants to minimize your score is not specific to Go, and so the minimax utility of such a tweak seems justified in the general case. (Just some random thoughts.) |
@roy7 Yes, this problem can be circumvented by selecting the move with the highest lower confidence bound (or quantile if you are in the Bayesian setting). Note, how this is exactly the reverse methodology to what we are doing during search. There we want to be optimistic in order to balance exploration/exploitation. When selecting the moves we want to be pessimistic since we still have uncertainty in each estimate which we want to account for. Selecting based on visits only makes sense if the bandit algorithm has converged. |
@kiudee Sounds very logical, but how would you change training data if move selection was based on highest lower confidence bound, instead of visit count? At the moment, using visit counts for training is very logical, but with this move selection method there should also be a better way... |
Good question, for the normal search one really should use optimistic methods (UCB, Thompson Sampling etc), because you otherwise miss even more moves. The pessimistic aspect is only useful when you have to make a move in the end. For final move selection one could think about running a shallow "minimax" like search of applying the LCB until depth k. To avoid situations @roy7 talked about (where actually have the correct knowledge already). |
In general, would it likely benefit the search if policy priors were a bit broader (i.e. less lopsided toward what the search determined was the best move)? Would something like exp(-c*LCB) with a slope constant still to be determined work well as a policy training target? Or even much simpler, would it benefit the search if the policy targets were simply flattened by applying something like the square root to them? |
From experience, no, it's the other way around. You can also see this in the original AGZ paper, where the optimal softmax_temp = 0.67, i.e. making the probabilities more extreme. All this worrying about exploring enough is nice but practice simply shows that spending a lot of search effort on bad moves makes the program bad. |
OK, so flattening doesn't work. What about using LCB for move selection and training targets, as the link posted by @kiudee describes? One thing that always bugged me about using simple visit count for move selection and training is the situation when the search has already determined that a more recently explored move is actually better, but there are not enough playouts left to overtake the PV. |
Theorizing about whether it would work is pretty pointless IMHO. Just implement and test it. (This applies to the things I claim will never work too, although generally I did test many of them before myself 😉) |
I mean, implement LCB for move selection and see what it does for the playing strength. That's the first step. What to do with the training targets is trickier (not sure they need changing? as the tree search still uses UCB). |
I think it's definitely worth testing just improving the move selection. My resources are tied up in FPU testing, but for those who want to make this change and test it out, please test at least 100 games. Thank you! Edit: See here for what is SPRT significant at 100 games. #696 (comment) |
For what it's worth, opening game state with latest copy of the Thompson Sampling branch after @gcp's changes (updating #860 (comment)): 5.2 average depth, 12 max depth 5.2 average depth, 11 max depth 5.2 average depth, 10 max depth I didn't test later actual game positions, but the opening search is less deep on average and about half as deep in max depth. For comparison, current /next client: 6.2 average depth, 19 max depth Just a random observation that TS is searching a bit wider at game start than UCT1. |
@gcp It seems current TS branch has a lot of extra stuff in the algorithm like adjusting the beta priors based on psa / cfg_fpu_reduction, sqrt(policy_ratio), etc. Is that just left over UCT1 stuff or intentionally for the TS testing? The cfg_puct = .1 also means the wins/losses are being magnified 10x when creating the distributions. I'd think the low win rate choices would be excluded rapidly but policy_eval must be offsetting that, since I still see low prior moves get lots of visits at game opening. |
You are confusing the policy prior with the value estimation. In UCT1 the policy prior is a very long term bias towards moves that the network likes, which is quite reasonable as it statically predicts the search move after 3200 visits with >64% accuracy these days... |
Intentional.
That's due to the cfg_beta_prior vs cfg_fpu_reduction. Small fpu_reduction and large beta prior means moves are not immediately thrown away after a few losses. (Nor overly liked after a few wins) These were all tuned for strength. |
Has anyone ever tuned the softmax temperature (for the policy prior) for strength? It's unclear to me that t=1 is optimal. But maybe it's not very different from tuning c_puct. |
I did additional testing with @roy7 's implementations and here are all the results i got: -t 11 and -v 1600:
-t 1 and -v 1600 (LCB_full isn't thread safe yet):
-t 1 and -v 200:
LCB_full seems to be at least as strong as Next in all the situations tested, but it's inconclusive whether it's stronger than LCB_root. Its winrate against Next doesn't seem as high, but it might be because I couldn't test it with 11 threads. I was told that increasing thread count makes the search generally wider, and that's where most gains seem to be had with the LCB approach. |
I find the low visit count results fascinating. I was always stuck only thinking about move 1 has tons of visits but move 2 is better and there's no time for move 2 to over take move 1. And as a result, also thinking this would only be useful with higher visit counts where there's time for strong confidence bounds/etc. But the -v 200 results show even with few visits the concept is sound and useful. For my own education I should run a -v 200 game with _root and look over the log to see all the times it overrode the max visit move, to better understand what is happening. I'm both happy and sad _full never does much better than _root. _full does a lot more than _root does, yet sometime _root always evenly matches with it and _root is so naively simple by comparison. |
@gcp Is choosing the move with the highest # of visits still correct move selection in the Thompson Sampling branch? If so I wonder how max_lcb_root merged with thompson_sampling would do. TS searches wider, and max_lcb_root would perhaps make other choices than "most visits" more often. |
@tterava Asked for someone to verify his -v 200 results. So ran it myself with "-t 1 -q -r 5 --timemanage off --visits 200 -w 1586.w --noponder --gtp".
|
@bood provided a ladder game to analyze, and I used elf 224x20 to analyze with leelaz using just symmetry 0 with regular search, lcb and thompson sampling: sgf details
force symmetry 0 patch
diff --git a/src/UCTNode.cpp b/src/UCTNode.cpp
--- a/src/UCTNode.cpp
+++ b/src/UCTNode.cpp
@@ -80,3 +80,3 @@ bool UCTNode::create_children(std::atomic<int>& nodecount,
const auto raw_netlist = Network::get_scored_moves(
- &state, Network::Ensemble::RANDOM_SYMMETRY);
+ &state, Network::Ensemble::DIRECT, 0);
With regular search and 100,000 visits, it finds it after spending a lot of visits with Q6:
Only after visiting Q6 20,139 times does it decide to explore M4 more, and with LCB #883, it would choose M4 with only needing a little bit more visits (whereas normal search would require ~20k more visits to switch, so total ~40k visits):
And with Thompson Sampling https://github.com/gcp/leela-zero/tree/thompson_sampling, fewer than 100 visits to decide on M4:
|
Although we treat the rewards of a move (bandit arm) as if they don't change when doing Thomson Sampling, the truth is the mean is always moving around as the search goes deeper. If we find specifically better or worse moves, the true average of the move will be radically different than what we've collected so far (what I try to address in max_lcb_full branch). I found a paper called Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection which sort of addresses this by detecting when changes to the bandit arm(s) have happened. |
Speaking of which, here's a paper on UCBT with arms that have changing rewards and overcoming UCBT inertia (the whole point of max_lcb_root). Change Point Detection and Meta-Bandits for Online Learning in Dynamic Environments. It's the source paper for an algorithm called Adapt-EvE which is referred to in prior post I made. When we find a refutation in tree search and the win rate will dramatically rise or fall, this a changed reward that they are talking about, the overcoming the inertia is what max_lcb_full does by outright pruning all of the old data that no longer applies. Whereas from the paper "Adapt-EvE incorporates a change-point detection test based on the Page-Hinkley statistics" I was effectively doing a change-point detection in max_lcb_full where if one move's UCB is lower than another moves LCB, the worse move can completely removed. Propagating that removed information back up the tree is in effect a change point detection happening. |
@gcp Since Thompson Sampling goes so wide, and we have so many moves and a high branching factor... an idea you might try is prune all moves with a policy network of .01 or lower? Or even lower than that. But like at game start, we don't need to sample all 361 moves. The heatmap only shows 16 possible moves we should be considering, the rest being rounded to 0. So instead of trying to use policy to adjust the Beta() of the moves, instead we'll use policy to prune the search away from super unlikely moves. |
An Empirical Evaluation of Thompson Sampling has some discussion of use of priors. Prior-free and prior-dependent regret bounds for Thompson Sampling |
I did a number of mcts search experiments on leela zero chess (lc0). KillerDucky asked me to cross-post this here as some of it might be relevant for Go. |
Implementation of a Thompson Sampling + MCTS paper: https://github.com/aijunbai/thompson-sampling Also a paper "CONVOLUTIONAL MONTE CARLO ROLLOUTS FOR THE https://openreview.net/pdf?id=wVqz42X2wfG0qV7mtLqv
Paper "Adapting Improved Upper Confidence Bounds for Monte-Carlo Tree Search": |
I didn't realize it at first, but the core of the last paper in my prior post says:
Which was the "full max_lcb" idea step to follow after #883. |
@roy7 What I gather from this issue is that your branch "LCB-root" is 100 elo stronger than the master. |
@MaurizioDeLeo The results are inconsistent. It's been tested quite extensively and it might be stronger on some systems and engine parameters. However it's not clear whether it provides any benefit overall. |
@MaurizioDeLeo It was reliable before the timemanagement feature was added. But with timemanagement, the results were less consistent. But timemanagement without max_lcb was stronger than max_lcb without timemanagement. So timemanagement is in and max_lcb a fond memory I do hope to experiment with more in the future. :) |
@roy7 , any chance to have both LCB and time management ? Maybe instead of stopping when a move can not be overcome in term of visit, we can stop based on some other criteria ? |
Yeah it's possible in theory. I was just thinking about it the other night, but I'm not sure what stopping criteria would make sense. |
Was scrolling down the list of Deepmind papers and found this interesting one. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
|
kiudee shared a couple new papers with me: Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling
Monte-Carlo Tree Search by Best Arm Identification
|
New Deepmind/Google paper. Empirical Bayes Regret Minimization
|
So the first problem is choosing the best move from A and B. We have some statistics available, like N of visits, the evaluation itself and some measure of uncertainty. We used to pick the highest N, but we came to the conclusion that we pick better lines with max LCB as long as N > 10% of all visits. But this means our search is NOT actually doing what it should when it knows it has N visits. There should be three things it should seek to do:
The second problem is back propagating the values from the nodes we have. We currently average. We can try the "mix" strategy of also including the value of the best node as per Remi Coulom. This should sharply drop the value of a losing line or quickly improve a sharp tesuji line once found. This is actually not a problem at a lot of visits, but at 1600 it doesn't have time to earn win rate quickly enough if found at the end of the search. |
INFINITE ARMS BANDIT: OPTIMALITY VIA CONFIDENCE BOUNDS
|
Since a lot of people are working on tuning FPU at the moment and some people are exploring tweaks to the search algorithm I wanted to share a few areas of research I was looking over this evening, in case the more mathematical programmers among us were interested in trying these with LZ.
After reviewing Multi Armed Bandits and Exploration Strategies I was interested in the mention of Thompson Sampling. It's an alternative to UCB that also has a logarithmic bound on regret and has been known since the 1930s, but the theoretical groundwork on it has only recently been finished in 2012 in the paper Analysis of Thompson Sampling for the Multi-armed Bandit Problem. That paper also covers the stochastic case, which makes it more applicable for us by people like us.
So I searched around some more if anyone has used this with MCTS yet, and it turns out this is a pretty hot current research area. There's a 2014 paper Thompson Sampling Based Monte-Carlo Planning in POMDPs, and two 2017 master's thesis I found, Thompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification and Monte Carlo Tree Search with Thompson Sampling in the Settlers of Catan (two languages).
A recent paper following up on these things is Enhancements in Monte Carlo Tree Search
Algorithms for Biased Game Trees. This paper discusses a way to deal with biases in game trees that might be useful for us, applying to 3 types of MCTS (UCT, TS, and KL-UCB). I hadn't heard of KL-UCB before, this paper mentioned that it performs better and has better theoretical bounds on regret than UCT does. The source for KL-UCB is the paper KULLBACK–LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION.
So for those interested in recent research, you might find some of these interesting. I don't know of any Go projects that use anything other than UCT for MCTS, I imagine most research effort goes into the neural network side of things and not improving the tree search side of things. Perhaps this is an area someone would like to do some original research into.
Edit: @alreadydone mentioned new paper Memory-Augmented Monte Carlo Tree Search.
The text was updated successfully, but these errors were encountered: