Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCTS alternatives #860

Open
roy7 opened this issue Feb 11, 2018 · 136 comments
Open

MCTS alternatives #860

roy7 opened this issue Feb 11, 2018 · 136 comments

Comments

@roy7
Copy link
Collaborator

roy7 commented Feb 11, 2018

Since a lot of people are working on tuning FPU at the moment and some people are exploring tweaks to the search algorithm I wanted to share a few areas of research I was looking over this evening, in case the more mathematical programmers among us were interested in trying these with LZ.

After reviewing Multi Armed Bandits and Exploration Strategies I was interested in the mention of Thompson Sampling. It's an alternative to UCB that also has a logarithmic bound on regret and has been known since the 1930s, but the theoretical groundwork on it has only recently been finished in 2012 in the paper Analysis of Thompson Sampling for the Multi-armed Bandit Problem. That paper also covers the stochastic case, which makes it more applicable for us by people like us.

So I searched around some more if anyone has used this with MCTS yet, and it turns out this is a pretty hot current research area. There's a 2014 paper Thompson Sampling Based Monte-Carlo Planning in POMDPs, and two 2017 master's thesis I found, Thompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification and Monte Carlo Tree Search with Thompson Sampling in the Settlers of Catan (two languages).

A recent paper following up on these things is Enhancements in Monte Carlo Tree Search
Algorithms for Biased Game Trees
. This paper discusses a way to deal with biases in game trees that might be useful for us, applying to 3 types of MCTS (UCT, TS, and KL-UCB). I hadn't heard of KL-UCB before, this paper mentioned that it performs better and has better theoretical bounds on regret than UCT does. The source for KL-UCB is the paper KULLBACK–LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION.

So for those interested in recent research, you might find some of these interesting. I don't know of any Go projects that use anything other than UCT for MCTS, I imagine most research effort goes into the neural network side of things and not improving the tree search side of things. Perhaps this is an area someone would like to do some original research into.

Edit: @alreadydone mentioned new paper Memory-Augmented Monte Carlo Tree Search.

@remdu
Copy link
Contributor

remdu commented Feb 11, 2018

Seems interesting. It'll take time to read and understand all that though. I'm not that good at math. This can also interest @killerducky and @evanroberts85.

@wpstmxhs
Copy link
Contributor

We may achieve what AlphaGo team couldn't do. Maybe they didn't think about changing tree search algorithm, due to lack of useful alternatives on that time.

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 12, 2018

For your reference, there is an interesting experiment from japanese computer go developer board, done by HiraBot developer. he used negamax algorithm instead of MCTS, and saw it seems working not bad with even small thinking time.

http://524.teacup.com/yss/bbs/?page=13&
Translated into English with Google Translator

@roy7
Copy link
Collaborator Author

roy7 commented Feb 13, 2018

Here's a detailed overview that covers a wide variety of things I'm unfamiliar with.

A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit.

Alongside KL-UCB, Bayes-UCB [81] — an explicitly Bayesian variant of UCB
— represents the current state of the art of UCB algorithms. It is an asymptotically
efficient advanced algorithm with empirical results which outperforms
KL-UCB.

[81] show that this Bayesian-derived UCB has a cumulative
regret that empirically outperforms the strongest of the original UCB algorithms
by a substantial margin in a handful of selected problems while having the advantage
of being distribution agnostic and showing the early-iterate flexibility
of a Bayesian approach to knowledge acquisition.

More love for Thompson Sampling, also covers BART I linked earlier, and lots of other problem areas outside of what we do.

@barrtgt
Copy link

barrtgt commented Feb 13, 2018

This paper is also interesting: http://ieeexplore.ieee.org/document/7352306/
It was cited from the zero paper. They optimized the search parameters through a Gaussian process using data from a preliminary run.

@gcp
Copy link
Member

gcp commented Feb 13, 2018

Optimizing the search parameters is not a problem. CLOP is very good at that. Any paper that doesn't compare to it can safely be ignored.

@odeint
Copy link

odeint commented Feb 13, 2018

The CLOP book chapter is here by the way:
https://link.springer.com/chapter/10.1007/978-3-642-31866-5_13
Free version:
https://www.remi-coulom.fr/CLOP/CLOP.pdf

This was before neural networks, but is a general tuning algorithm.
It doesn't have that many citations though:
https://scholar.google.ch/scholar?cites=6152564073462446443&as_sdt=2005&sciodt=0,5&hl=en

There is this section in the AlphaZero paper ( https://arxiv.org/abs/1712.01815 ):

AlphaGo Zero tuned the hyper-parameter of its search by Bayesian optimisation. In AlphaZero
we reuse the same hyper-parameters for all games without game-specific tuning. The
sole exception is the noise that is added to the prior policy to ensure exploration (29); this is
scaled in proportion to the typical number of legal moves for that game type.

I'm not sure what exactly that Bayesian optimisation refers to. In the actual AlphaGo Zero paper ( https://www.nature.com/articles/nature24270 / free https://deepmind.com/documents/119/agz_unformatted_nature.pdf ), the only tuneable parameter mentioned is c_puct. Tuning a single parameter doesn't seem particularly involved or require special techniques:

[...] where c_puct is a constant determining the level of exploration; this search control strategy initially
prefers actions with high prior probability and low visit count, but asympotically prefers actions
with high action-value

What would one tune in LeelaZero using CLOP? Puct for sure, but what else? Some of the newly introduced variables like regret, risk, opportunity, FPU reduction? That would still be very ad-hoc even if the tuning is rigorous, since the variables themselves were just 'invented' by fiat.

@gcp
Copy link
Member

gcp commented Feb 13, 2018

What would one tune in LeelaZero using CLOP? Puct for sure, but what else?

Right now only PUCT and (if appropriate) FPU reduction.

After the learning is finished, the softmax temperature can also be tuned.

That would still be very ad-hoc even if the tuning is rigorous, since the variables themselves were just 'invented' by fiat.

I have no idea what that is supposed to mean.

Tuning a single parameter doesn't seem particularly involved or require special techniques:

I wonder if there is no confusion there between the hyperparameters of the training run and those of the search.

@kiudee
Copy link

kiudee commented Feb 13, 2018

@roy7 I did a lot of experiments with Thompson sampling and other bandit algorithms for my Bachelor’s and Master’s thesis.
In a nutshell, Thompson sampling outperforms other stochastic bandit algorithms (KL-UCB, BayesUCB, UCB, ε-greedy etc) on most bandit problems.

Implementing it for classical MCTS is as simple as initializing a Beta distribution during node creation and updating it by using +wins and -losses. A decision on which action to take can be obtained by randomly sampling from each distribution and taking the move with the highest value.

It is more difficult if you do not evaluate rollouts until the end of the game, because we get the evaluations from a neural network. Thus you lose the conjugacy of the Beta distribution and have to fit another distribution to the values returned during tree search. One could for example think of a normal distribution with certain priors, but one always has to consider the performance of the whole fitting process (remember for Thompson sampling the most important operation is sampling from the distribution).

@odeint
Copy link

odeint commented Feb 13, 2018

I wonder if there is no confusion there between the hyperparameters of the training run and those of the search.

Aren't the two entangled, because the networks are trained to predict the outcome of the search? Is it possible that the search algorithm which delivers the highest playing strength for a given network is actually not the best tutor for the next generation of networks?

Edit: Just found this interesting-sounding paper "Thinking Fast and Slow with Deep Learning and Tree Search" ( https://arxiv.org/abs/1705.08439 )

@kiudee
Copy link

kiudee commented Feb 13, 2018

@barrtgt @odeint Bayesian optimization refers to fitting a stochastic response surface model to the outcome of an experiment (usually the performance metric of a machine learning model). scikit-optimize for example is a Python package which implements that.
This usually works best for real-valued outcomes, since Gaussian processes fit a smooth function.

It is possible to fit it on binary outcomes (e.g. Go matches), but for that you need to fit a latent variable model where you have to integrate over the binary squashing function. For that you usually need MCMC machinery. For binary outcomes one therefore usually applies CLOP or SPSA.

@killerducky
Copy link
Contributor

@kiudee

It is more difficult if you do not evaluate rollouts until the end of the game, because we get the evaluations from a neural network. Thus you lose the conjugacy of the Beta distribution and have to fit another distribution to the values returned during tree search.

Why does it matter we use the NN instead of rollouts? I thought a Beta distribution can model the NN value outputs, which are real values in the range [0,1].

@kiudee
Copy link

kiudee commented Feb 14, 2018

@killerducky When using binary outcomes the parameters α and β can be updated directly. For example: If a subtree won 10 times and lost 3 times you can directly sample from the posterior distribution Beta(1+10, 1+3) (the initial ones constitute a uniform prior distribution over the winrate).
For real-valued outputs in [0,1] you would have to fit the parameters using maximum likelihood or similar inference methods.

@roy7
Copy link
Collaborator Author

roy7 commented Feb 14, 2018

The interesting thing the authors of the Thompson sampling regret bounds paper do for the stochastic case is randomly choose a win or a loss by the [0,1] result observed and store that as a win or a loss instead of using the value directly. It feels like that's throwing away info, yet it works in practice apparently.

The problem with tree search is as new moves are explored and we realize the real value of the board is much different than the earlier NN eval thought it was, the win rates of the node can change a lot and basically be a different distribution than we were sampling from before.

@wctgit
Copy link

wctgit commented Feb 14, 2018

@roy7 In the MCMC world, that's referred to as the 'burn in' period. Basically, they spend lots and lots of cycles on finding a very accurate approximation to the posterior probability distribution, whereas we are generally trying to build a system that has massive constraints on the number of cycles we can 'burn' to determine the posterior, i.e. the best move. We are trying to find a 'pretty good' approximation ASAP, so we are inherently going to be running into this problem with not-enough-burn-in all the time, IMHO.

@kiudee
Copy link

kiudee commented Feb 14, 2018

@roy7 Indeed, that is because we assume a fixed distribution over time.
One could circumvent this problem by modeling each result to be generated by an adversary and use adversarial bandit algorithms (e.g. Exp3) to solve it. But these algorithms have very slow convergence due to the need to be robust against the adversary. There was work done by Sébastien Bubeck and Aleksandrs Slivkins (2012) on finding good hybrid algorithms. Seldin and Slivkins (2014) improve on that by proposing the algorithm Exp3++.
Here is the Figure from the paper and you can see that there is still quite a big gap between Exp3++ and Thompson Sampling when applied on stochastic data:
screenshot-2018-2-14 one practical algorithm for both stochastic and adversarial bandits - seldinb14 pdf
How it works out in MCTS needs to be determined though.
Seldin and Lugosi (2017) further optimized the parameters of Exp3++, but did not perform empirical comparisons.

Auer and Chiang (2016) provide a recent improvement on the pseudo regret, but without empirical comparison.

TL;DR: We should do experiments on which bandit algorithm improves the match results the most. This already incorporates the disadvantage caused by a slower bandit algorithm.

@roy7
Copy link
Collaborator Author

roy7 commented Feb 14, 2018

@kiudee An idea I want to toy with once I figure out how in the code is to adjust the values of nodes to not be the average of all of the values under the node, but only the average of values under the node where the child node averages are within an overlapping confidence interval of the best moves. This is motivated by the other discussion #855 with the MV% stat showing an obvious warning about a blunder that was made, and the traps not discovered until deep in the search only slowly lower the win rate and the blunder move can't be avoided until a different move has more total visits.

In a nutshell, consider where a blunder is about to be made and opponent can do a winning move next turn but I don't see it at first. So my move is AA and the opponent has two choices, LL for opponent to lose, and WW to win. LL is much more likely to be searched from my heatmap so at first I'm really confident I'm winning. Eventually I'm at:

1000 visits AA LL (90% win rate for me)
50 visits AA WW (15% win rate for me)

The bounds on that 50 might be pretty wide since it's only 50 visits, but for now I see this move as about a 86% win rate move. Since this is the opponent's move they want to minimize my win rate, so future search goes mostly into AA WW. But once we're to this point:

1000 visits AA LL (90%)
250 visits AA WW (15%)

I probably have enough visits to realize the confidence interval around the 90% and the 15% are not touching any longer. In that case, instead of calling this node's value 75% I should call it 15%. I have enough certainty that the inferior move for the opponent is not the same distribution as the superior move, so the inferior move will not be taken and so it should just be ignored.

Sort of like minmax, but not strictly using the value of the best/worst node. Using normal MCTS methodology... except nodes we know with some confidence aren't similar to the best choice for that player at that time won't skew the score up/down.

Edit: Ducky linked me to: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html although I was planning to use this from Boost: http://www.boost.org/doc/libs/1_42_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/stat_tut/weg/binom_eg/binom_conf.html.

@wctgit
Copy link

wctgit commented Feb 15, 2018

Looks like a reasonable idea. A part of me wants to play devil's advocate, though, and say, "This would just be an example where the net needs more training, so that it sees AA WW as higher probability. Then the MCTS would automatically already converge to the right answer more quickly, guided by the smarter net. Should we really be messing around so much with the way the MCTS performs searches, when really we just need to train the net more?" But then again, I can see an argument that supports your proposal, which is that this particular assumption of having an opponent who wants to minimize your score is not specific to Go, and so the minimax utility of such a tweak seems justified in the general case. (Just some random thoughts.)

@kiudee
Copy link

kiudee commented Feb 15, 2018

@roy7 Yes, this problem can be circumvented by selecting the move with the highest lower confidence bound (or quantile if you are in the Bayesian setting).

Note, how this is exactly the reverse methodology to what we are doing during search. There we want to be optimistic in order to balance exploration/exploitation. When selecting the moves we want to be pessimistic since we still have uncertainty in each estimate which we want to account for. Selecting based on visits only makes sense if the bandit algorithm has converged.

@jkiliani
Copy link

@kiudee Sounds very logical, but how would you change training data if move selection was based on highest lower confidence bound, instead of visit count? At the moment, using visit counts for training is very logical, but with this move selection method there should also be a better way...

@kiudee
Copy link

kiudee commented Feb 15, 2018

Good question, for the normal search one really should use optimistic methods (UCB, Thompson Sampling etc), because you otherwise miss even more moves. The pessimistic aspect is only useful when you have to make a move in the end.

For final move selection one could think about running a shallow "minimax" like search of applying the LCB until depth k. To avoid situations @roy7 talked about (where actually have the correct knowledge already).

@jkiliani
Copy link

In general, would it likely benefit the search if policy priors were a bit broader (i.e. less lopsided toward what the search determined was the best move)? Would something like exp(-c*LCB) with a slope constant still to be determined work well as a policy training target? Or even much simpler, would it benefit the search if the policy targets were simply flattened by applying something like the square root to them?

@kiudee
Copy link

kiudee commented Feb 15, 2018

@jkiliani That is actually what the adversarial bandit algorithms are doing. The rewards are exponentially weighted and uniform noise is mixed in to be robust. Here is a good overview.

@gcp
Copy link
Member

gcp commented Feb 15, 2018

In general, would it likely benefit the search if policy priors were a bit broader (i.e. less lopsided toward what the search determined was the best move)?

From experience, no, it's the other way around. You can also see this in the original AGZ paper, where the optimal softmax_temp = 0.67, i.e. making the probabilities more extreme.

All this worrying about exploring enough is nice but practice simply shows that spending a lot of search effort on bad moves makes the program bad.

@jkiliani
Copy link

From experience, no, it's the other way around. You can also see this in the original AGZ paper, where the optimal softmax_temp = 0.67, i.e. making the probabilities more extreme.

OK, so flattening doesn't work. What about using LCB for move selection and training targets, as the link posted by @kiudee describes? One thing that always bugged me about using simple visit count for move selection and training is the situation when the search has already determined that a more recently explored move is actually better, but there are not enough playouts left to overtake the PV.

@gcp
Copy link
Member

gcp commented Feb 15, 2018

Theorizing about whether it would work is pretty pointless IMHO. Just implement and test it.

(This applies to the things I claim will never work too, although generally I did test many of them before myself 😉)

@gcp
Copy link
Member

gcp commented Feb 15, 2018

I mean, implement LCB for move selection and see what it does for the playing strength. That's the first step. What to do with the training targets is trickier (not sure they need changing? as the tree search still uses UCB).

@ghost
Copy link

ghost commented Feb 15, 2018

I think it's definitely worth testing just improving the move selection. My resources are tied up in FPU testing, but for those who want to make this change and test it out, please test at least 100 games. Thank you!

Edit: See here for what is SPRT significant at 100 games. #696 (comment)

@roy7
Copy link
Collaborator Author

roy7 commented Apr 24, 2018

For what it's worth, opening game state with latest copy of the Thompson Sampling branch after @gcp's changes (updating #860 (comment)):

5.2 average depth, 12 max depth
1611 non leaf nodes, 1.99 average children
3201 visits, 1145259 nodes, 3200 playouts, 1265 n/s

5.2 average depth, 11 max depth
1629 non leaf nodes, 1.96 average children
3201 visits, 1145255 nodes, 3200 playouts, 1077 n/s

5.2 average depth, 10 max depth
1601 non leaf nodes, 2.00 average children
3201 visits, 1145314 nodes, 3200 playouts, 1285 n/s

I didn't test later actual game positions, but the opening search is less deep on average and about half as deep in max depth.

For comparison, current /next client:

6.2 average depth, 19 max depth
2219 non leaf nodes, 1.44 average children
3201 visits, 1142023 nodes, 3200 playouts, 1123 n/s

Just a random observation that TS is searching a bit wider at game start than UCT1.

@roy7
Copy link
Collaborator Author

roy7 commented Apr 24, 2018

@gcp It seems current TS branch has a lot of extra stuff in the algorithm like adjusting the beta priors based on psa / cfg_fpu_reduction, sqrt(policy_ratio), etc. Is that just left over UCT1 stuff or intentionally for the TS testing? The cfg_puct = .1 also means the wins/losses are being magnified 10x when creating the distributions. I'd think the low win rate choices would be excluded rapidly but policy_eval must be offsetting that, since I still see low prior moves get lots of visits at game opening.

@gcp
Copy link
Member

gcp commented Apr 24, 2018

But isn't our prior just a "better than nothing" value we have to use instead of using the Beta(1,1) non-informative prior? In UCT 1 might use parent eval or parent-.2 or whatever, but it's all an arbitrary way to try and adjust exploration?

You are confusing the policy prior with the value estimation.

In UCT1 the policy prior is a very long term bias towards moves that the network likes, which is quite reasonable as it statically predicts the search move after 3200 visits with >64% accuracy these days...

@gcp
Copy link
Member

gcp commented Apr 24, 2018

intentionally for the TS testing

Intentional.

policy_eval must be offsetting that, since I still see low prior moves get lots of visits at game opening.

That's due to the cfg_beta_prior vs cfg_fpu_reduction. Small fpu_reduction and large beta prior means moves are not immediately thrown away after a few losses. (Nor overly liked after a few wins)

These were all tuned for strength.

@alreadydone
Copy link
Contributor

alreadydone commented Apr 25, 2018

Has anyone ever tuned the softmax temperature (for the policy prior) for strength? It's unclear to me that t=1 is optimal. But maybe it's not very different from tuning c_puct.

@tterava
Copy link
Contributor

tterava commented Apr 26, 2018

I did additional testing with @roy7 's implementations and here are all the results i got:

-t 11 and -v 1600:

                 wins             black          white
LCB_root       249 62.25%       124 62.00%     125 62.50%
Next           151 37.75%       75  37.50%     76  38.00%

-t 1 and -v 1600 (LCB_full isn't thread safe yet):

LCB_full vs LCB_root: 194 - 192 (50.26 %)
LCB_full vs Next: 221 - 181 (54.98 %)

-t 1 and -v 200:

              wins            black           white
lcb_root    56 70.89%       28 70.00%       28 71.79%
next        23 29.11%       12 30.00%       11 28.21%
                            40 50.63%       39 49.37%


              wins            black           white
lcb_full   227 49.35%      105 49.30%      122 49.39%
lcb_root   233 50.65%      108 50.70%      125 50.61%
                           213 46.30%      247 53.70%

              wins            black           white
lcb_full    64 68.09%       31 68.89%       33 67.35%
next        30 31.91%       14 31.11%       16 32.65%
                            45 47.87%       49 52.13%

LCB_full seems to be at least as strong as Next in all the situations tested, but it's inconclusive whether it's stronger than LCB_root. Its winrate against Next doesn't seem as high, but it might be because I couldn't test it with 11 threads. I was told that increasing thread count makes the search generally wider, and that's where most gains seem to be had with the LCB approach.

@roy7
Copy link
Collaborator Author

roy7 commented Apr 26, 2018

I find the low visit count results fascinating. I was always stuck only thinking about move 1 has tons of visits but move 2 is better and there's no time for move 2 to over take move 1. And as a result, also thinking this would only be useful with higher visit counts where there's time for strong confidence bounds/etc.

But the -v 200 results show even with few visits the concept is sound and useful. For my own education I should run a -v 200 game with _root and look over the log to see all the times it overrode the max visit move, to better understand what is happening.

I'm both happy and sad _full never does much better than _root. _full does a lot more than _root does, yet sometime _root always evenly matches with it and _root is so naively simple by comparison.

@roy7
Copy link
Collaborator Author

roy7 commented Apr 26, 2018

@gcp Is choosing the move with the highest # of visits still correct move selection in the Thompson Sampling branch? If so I wonder how max_lcb_root merged with thompson_sampling would do. TS searches wider, and max_lcb_root would perhaps make other choices than "most visits" more often.

@roy7
Copy link
Collaborator Author

roy7 commented Apr 27, 2018

@tterava Asked for someone to verify his -v 200 results. So ran it myself with "-t 1 -q -r 5 --timemanage off --visits 200 -w 1586.w --noponder --gtp".

board size: 19   komi: 7.5
                                    wins              black          white        avg cpu
leelaz-next-e01e187-tmoff            151 37.75%       55  27.50%     96  48.00%     16.45
leelaz-max_lcb_root-8883dac-tmoff    249 62.25%       104 52.00%     145 72.50%     16.42
                                                      159 39.75%     241 60.25%

@Mardak
Copy link
Collaborator

Mardak commented May 5, 2018

@bood provided a ladder game to analyze, and I used elf 224x20 to analyze with leelaz using just symmetry 0 with regular search, lcb and thompson sampling:

sgf details

image

(;B[pp];W[pd];B[cp];W[dd];B[cf];W[fc];B[fp];W[qq];B[qp];W[pq]
;B[nq];W[oq];B[op];W[nr];B[rq];W[rr];B[or];W[rp];B[qr];W[sq]
;B[pr];W[rq];B[mr];W[np];B[mq];W[qn];B[no];W[mp])

force symmetry 0 patch

diff --git a/src/UCTNode.cpp b/src/UCTNode.cpp
--- a/src/UCTNode.cpp
+++ b/src/UCTNode.cpp
@@ -80,3 +80,3 @@ bool UCTNode::create_children(std::atomic<int>& nodecount,
     const auto raw_netlist = Network::get_scored_moves(
-        &state, Network::Ensemble::RANDOM_SYMMETRY);
+        &state, Network::Ensemble::DIRECT, 0);
 

The play for black move 29 should be M4.

With regular search and 100,000 visits, it finds it after spending a lot of visits with Q6:

  M4 ->   79831 (V: 87.72%) (N:  7.95%) PV: M4 C11 C9 D3 D4 F3 E3 E2 E4 G2 G3 F2 D2 H3 G4 K3 K5 D9 C10 D10 D11 C12 D12 C8 B8 D13 C13 B11 B10 A10 B9 B13 A12 B12 E13 B14
  Q6 ->   20139 (V: 51.34%) (N: 88.32%) PV: Q6 N5 R7 S6 Q7 L4 K3 S7 R9 S8 N7 L7 L3 L9 N9 C11 C9 E11 E9 E3 E4 C3

Only after visiting Q6 20,139 times does it decide to explore M4 more, and with LCB #883, it would choose M4 with only needing a little bit more visits (whereas normal search would require ~20k more visits to switch, so total ~40k visits):

20,200 visits just before search decides to switch to M4
  Q6 ->   20102 (V: 51.34%) (N: 88.32%) (LCB: 50.25%) (UCB: 52.44%) PV: Q6 N5 R7 S6 Q7 L4 K3 S7 R9 S8 N7 L7 L3 L9 N9 C11 C9 E11 E9 E3 E4 C3
  M4 ->      68 (V: 38.74%) (N:  7.95%) (LCB: 21.60%) (UCB: 58.10%) PV: M4 N5 Q6 O6 Q7 R7 Q8 M5 L5 L6 L4

20,300 visits with rapidly rising win rate and LCB
  Q6 ->   20139 (V: 51.34%) (N: 88.32%) (LCB: 50.25%) (UCB: 52.43%) PV: Q6 N5 R7 S6 Q7 L4 K3 S7 R9 S8 N7 L7 L3 L9 N9 C11 C9 E11 E9 E3 E4 C3
  M4 ->     131 (V: 61.63%) (N:  7.95%) (LCB: 47.78%) (UCB: 74.26%) PV: M4 N5 N6 M5 L5 M6 M7 L6 K6 L7 L8 K7 J7 K8 K9 J8 H8 J9 J10 H9 G9 H10 H11 G10 F10 G11 G12 F11 E11 F12 F13 E12 D12 E13 E14 D13 C13 D11 D14 T5 S1 T1

20,400 visits to switch to M4
  M4 ->     231 (V: 77.56%) (N:  7.95%) (LCB: 68.12%) (UCB: 85.36%) PV: M4 N5 N6 M5 L5 M6 M7 L6 K6 L7 L8 K7 J7 K8 K9 J8 H8 J9 J10 H9 G9 H10 H11 G10 F10 G11 G12 F11 E11 F12 F13 E12 D12 E13 E14 D13 C13 D11 D14 T5 S1 T1 R1 P1 O1
  Q6 ->   20139 (V: 51.34%) (N: 88.32%) (LCB: 50.25%) (UCB: 52.43%) PV: Q6 N5 R7 S6 Q7 L4 K3 S7 R9 S8 N7 L7 L3 L9 N9 C11 C9 E11 E9 E3 E4 C3

And with Thompson Sampling https://github.com/gcp/leela-zero/tree/thompson_sampling, fewer than 100 visits to decide on M4:

  M4 ->      81 (V: 76.03%) (N:  7.95%) PV: M4 N5 N6 M5 L5 M6 M7 L6 K6 L7 L8 K7 J7 K8 K9 J8 H8 J9 J10 H9 G9 H10 H11 G10
  Q6 ->      18 (V: 40.44%) (N: 88.32%) PV: Q6 R7 M4 N5

@roy7
Copy link
Collaborator Author

roy7 commented May 6, 2018

Although we treat the rewards of a move (bandit arm) as if they don't change when doing Thomson Sampling, the truth is the mean is always moving around as the search goes deeper. If we find specifically better or worse moves, the true average of the move will be radically different than what we've collected so far (what I try to address in max_lcb_full branch). I found a paper called Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection which sort of addresses this by detecting when changes to the bandit arm(s) have happened.

@roy7
Copy link
Collaborator Author

roy7 commented May 6, 2018

Speaking of which, here's a paper on UCBT with arms that have changing rewards and overcoming UCBT inertia (the whole point of max_lcb_root).

Change Point Detection and Meta-Bandits for Online Learning in Dynamic Environments.

It's the source paper for an algorithm called Adapt-EvE which is referred to in prior post I made.

When we find a refutation in tree search and the win rate will dramatically rise or fall, this a changed reward that they are talking about, the overcoming the inertia is what max_lcb_full does by outright pruning all of the old data that no longer applies.

Whereas from the paper "Adapt-EvE incorporates a change-point detection test based on the Page-Hinkley statistics" I was effectively doing a change-point detection in max_lcb_full where if one move's UCB is lower than another moves LCB, the worse move can completely removed. Propagating that removed information back up the tree is in effect a change point detection happening.

@roy7
Copy link
Collaborator Author

roy7 commented May 10, 2018

@gcp Since Thompson Sampling goes so wide, and we have so many moves and a high branching factor... an idea you might try is prune all moves with a policy network of .01 or lower? Or even lower than that. But like at game start, we don't need to sample all 361 moves. The heatmap only shows 16 possible moves we should be considering, the rest being rounded to 0.

So instead of trying to use policy to adjust the Beta() of the moves, instead we'll use policy to prune the search away from super unlikely moves.

@roy7
Copy link
Collaborator Author

roy7 commented May 29, 2018

@Videodr0me
Copy link

I did a number of mcts search experiments on leela zero chess (lc0). KillerDucky asked me to cross-post this here as some of it might be relevant for Go.
https://github.com/Videodr0me/leela-chess-experimental

@roy7
Copy link
Collaborator Author

roy7 commented Aug 21, 2018

Implementation of a Thompson Sampling + MCTS paper:

https://github.com/aijunbai/thompson-sampling

Also a paper "CONVOLUTIONAL MONTE CARLO ROLLOUTS FOR THE
GAME OF GO" draft summary basically:

https://openreview.net/pdf?id=wVqz42X2wfG0qV7mtLqv

In this work, we present a Monte Carlo tree search-based program for playing
Go which uses convolutional rollouts. Our method performs MCTS in batches,
explores the Monte Carlo tree using Thompson sampling and a convolutional policy
network, and evaluates convnet-based rollouts on the GPU.

Paper "Adapting Improved Upper Confidence Bounds for Monte-Carlo Tree Search":

https://arxiv.org/pdf/1505.02830.pdf

@roy7
Copy link
Collaborator Author

roy7 commented Sep 18, 2018

I didn't realize it at first, but the core of the last paper in my prior post says:

Next, the algorithm proceeds to remove the arms whose upper bounds of estimated expected reward are less than the lower bound of the current best arm.

Which was the "full max_lcb" idea step to follow after #883.

@MaurizioDeLeo
Copy link

@roy7 What I gather from this issue is that your branch "LCB-root" is 100 elo stronger than the master.
Any reason why this is not imported back into gcp-master ?

@tterava
Copy link
Contributor

tterava commented Sep 26, 2018

@MaurizioDeLeo The results are inconsistent. It's been tested quite extensively and it might be stronger on some systems and engine parameters. However it's not clear whether it provides any benefit overall.

@roy7
Copy link
Collaborator Author

roy7 commented Sep 26, 2018

@MaurizioDeLeo It was reliable before the timemanagement feature was added. But with timemanagement, the results were less consistent. But timemanagement without max_lcb was stronger than max_lcb without timemanagement. So timemanagement is in and max_lcb a fond memory I do hope to experiment with more in the future. :)

@MaurizioDeLeo
Copy link

@roy7 , any chance to have both LCB and time management ?
Maybe the improvements are additive ?

Maybe instead of stopping when a move can not be overcome in term of visit, we can stop based on some other criteria ?

@roy7
Copy link
Collaborator Author

roy7 commented Sep 26, 2018

Yeah it's possible in theory. I was just thinking about it the other night, but I'm not sure what stopping criteria would make sense.

@roy7
Copy link
Collaborator Author

roy7 commented Dec 22, 2018

Was scrolling down the list of Deepmind papers and found this interesting one. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

We propose a multi-armed bandit algorithm that explores based on randomizing its history. The key idea is to estimate the value of the arm from the bootstrap sample of its history, where we add pseudo observations after each pull of the arm. The pseudo observations seem to be harmful. But on the contrary, they guarantee that the bootstrap sample is optimistic with a high probability. Because of this, we call our algorithm Giro, which is an abbreviation for garbage in, reward out. We analyze Giro in a K-armed Bernoulli bandit and prove a O(KΔ−1logn) bound on its n-round regret, where Δ denotes the difference in the expected rewards of the optimal and best suboptimal arms. The main advantage of our exploration strategy is that it can be applied to any reward function generalization, such as neural networks. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that Giro is comparable to or better than state-of-the-art algorithms.

@roy7
Copy link
Collaborator Author

roy7 commented Feb 19, 2019

kiudee shared a couple new papers with me:

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Learning the minimum/maximum mean among a finite set of distributions is a fundamental sub-problem in planning, game tree search and reinforcement learning. We formalize this learning task as the problem of sequentially testing how the minimum mean among a finite set of distributions compares to a given threshold. We develop refined non-asymptotic lower bounds, which show that optimality mandates very different sampling behavior for a low vs high true minimum. We show that Thompson Sampling and the intuitive Lower Confidence Bounds policy each nail only one of these cases. We develop a novel approach that we call Murphy Sampling. Even though it entertains exclusively low true minima, we prove that MS is optimal for both possibilities. We then design advanced self-normalized deviation inequalities, fueling more aggressive stopping rules. We complement our theoretical guarantees by experiments showing that MS works best in practice.

Monte-Carlo Tree Search by Best Arm Identification

Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game tree search problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a best arm identification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.

@roy7
Copy link
Collaborator Author

roy7 commented Apr 10, 2019

New Deepmind/Google paper.

Empirical Bayes Regret Minimization

The prevalent approach to bandit algorithm design is to have a low-regret algorithm by design. While celebrated, this approach is often conservative because it ignores many intricate properties of actual problem instances. In this work, we pioneer the idea of minimizing an empirical approximation to the Bayes regret, the expected regret with respect to a distribution over problems. This approach can be viewed as an instance of learning-to-learn, it is conceptually straightforward, and easy to implement. We conduct a comprehensive empirical study of empirical Bayes regret minimization in a wide range of bandit problems, from Bernoulli bandits to structured problems, such as generalized linear and Gaussian process bandits. We report significant improvements over state-of-the-art bandit algorithms, often by an order of magnitude, by simply optimizing over a sample from the distribution.

@iopq
Copy link

iopq commented Apr 20, 2019

So the first problem is choosing the best move from A and B.

We have some statistics available, like N of visits, the evaluation itself and some measure of uncertainty. We used to pick the highest N, but we came to the conclusion that we pick better lines with max LCB as long as N > 10% of all visits.

But this means our search is NOT actually doing what it should when it knows it has N visits. There should be three things it should seek to do:

  1. Get a second move 10% visits and keep it there if there's a chance it has the highest win rate. No point in searching the main line when it's increasingly unlikely to change the value after a lot of visits.
  2. Spread visits to low prior moves when the top few options don't yield promising candidate moves. We do this VERY poorly, sometimes it takes 300,000 visits to find the sharp line when the prior is low for it.
  3. When dumping visits to another move since it seems more promising, also give a third candidate up to 10% visits in case the alternative move doesn't do anything. And so on, I expect a lot of moves to get to the 10% mark in case we find a refutation for the current main line. That way we have alternatives.

The second problem is back propagating the values from the nodes we have. We currently average. We can try the "mix" strategy of also including the value of the best node as per Remi Coulom. This should sharply drop the value of a losing line or quickly improve a sharp tesuji line once found. This is actually not a problem at a lot of visits, but at 1600 it doesn't have time to earn win rate quickly enough if found at the end of the search.

@roy7
Copy link
Collaborator Author

roy7 commented Aug 9, 2019

INFINITE ARMS BANDIT: OPTIMALITY VIA CONFIDENCE BOUNDS

The infinite arms bandit problem was initiated by Berry et al.(1997). They derived a regret lower bound of all solutions for Bernoulli rewards with uniform priors, and proposed bandit strategies based on success runs, but which do not achieve this bound. Bonald and Proutiere (2013) showed that the lower bound was achieved by their two-target algorithm, and extended optimality to Bernoulli rewards with general priors. We propose here a confidence bound target (CBT) algorithm that achieves optimality for general, unknown reward distributions. For each arm we require the mean and standard deviation of its rewards to compute a confidence bound. We play the arm with the smallest bound provided it is smaller than a target mean. If the bounds are all larger, then we play a new arm. We show how for a given prior, the target mean can be computed to achieve optimality. In the absence of information on the prior, the target mean can be determined empirically, and we show that the regret achieved is still comparable to the regret lower bound. Numerical studies show that CBT is versatile and outperforms its competitors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests