Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIs calibrated to kyu/dan strength with easier to understand settings #44

Closed
sanderland opened this issue May 30, 2020 · 80 comments
Closed

Comments

@sanderland
Copy link
Owner

Current options are rather mathematical, calibrating some settings -> kyu/dan rank and a slider that sets them would improve usability.

@sanderland sanderland added the 1.2 label May 30, 2020
@bale-go
Copy link
Contributor

bale-go commented Jun 2, 2020

Katrain is an amazing piece of software. Having weaker opponents is a major selling point in my opinion.
I wanted to try out how the weaker AIs fare against GnuGo 3.8, which is 8 kyu.
ScoreLoss uses visits, which makes it slower and its strength depends on the max_visits. I figured it would be better to have a policy based method in this kyu range.
I opted for P:Pick since it seemed to be the most straightforward way to adjust the strength.

I set up a GnuGo AI with level 10 strength (8 kyu) as the opponent. My goal was to find settings where the game is even between the two AIs from the initial play to the endgame.
With the default P:Pick settings (pick_override: 0.95, pick_n=5, pick_frac=0.33) katago was still too strong.
I ran several games. In the beginning P:Pick always gained a huge advantage. However in the endgame P:Pick made obvious blunders that a DDK would clearly spot.

GnuGo 3.8 was black, default P:Pick was white, komi = 6.5
BGnuGo-WDefaultPPick
BGnuGo-WDefaultPPick2

Interestingly, in the beginning pick_override of 0.8 was enough to get rid of the obvious blunders, but in the endgame 0.6 value was needed.
To account for the phenomenon, I changed pick_override to 0.8 and changed line 56 to decrease it over the game: (ai.py)

    elif policy_moves[0][0] > (ai_settings["pick_override"]*(1-(361-len(legal_policy_moves))/361.*.5)):

This needed an earlier definition of legal_policy_moves (I put it it line 46).
After the patch, no more obvious (meaning DDK) blunders were seen. However P:Pick still seemed to be stronger in the beginning, then in the endgame. So I decided against decreasing the number of moves considered. (Originally it gets smaller by the decrease in legal_policy_moves)

I wrote a little script to iteratively find the parameter (total number of moves seen by katago) that gives even games with different strengths of GnuGo. At least 10 games were played to estimate the parameter.
GnuGo 3.8 at level 8 (ca. 10 kyu): 24
GnuGo 3.8 at level 10 (ca. 8 kyu): 30
GnuGo 3.8 at level 10 with 4 handicap stones and 0.5 komi (ca. 4 kyu): 49

A simple linear regression gives: (total number of moves seen by katago) = -4.25 kyu + 66
This equation - with the changing pick_override setting - might be used to make the AI strength slider.

With above listed changes I could get an even game with GnuGo at various strengths.
I also tried it at my level, and the games were very enjoyable. GnuGo even when given handicap stones is too defensive. Modified P:Pick was much more creative.

Black: GnuGo 3.8 level 10
White: Modified PPick
(elif policy_moves[0][0] > (0.8*(1-(361-len(legal_policy_moves))/361.*.5)))
total number of moves seen by katago: 30

BGnuGo-WmodifiedPPick

@sanderland
Copy link
Owner Author

@bale-go amazing work!
If you were setting up these matches manually, you may be interested in https://github.com/sanderland/katrain-bots for the self-play scripts and gtp-esque connectors.

One issue I have with formulas like this is deciding which parameters are exposed to users and how to expose or explain them to users -- or hide them alltogether.

@bale-go
Copy link
Contributor

bale-go commented Jun 3, 2020

Thank you for the suggestion about katrain-bots. I used PyAutoGUI to automate the games.
I wanted to test the modified P:Pick for stronger bots. I opted for the open source Pachi.
"pachi -t =5000 --nodcnn" is 3k
"pachi -t =5000:15000" is 3d currently at KGS.

More than 10 games were run for each bot. After finding iteratively the correct parameter (total number of moves seen by katago), the games were quite balanced, without any serious blunder. None of the bots had an extra advantage at the beginning, middle or the end of the game.

Black: pachi -t =5000:15000 (rank=3d)
White: Modified PPick
(elif policy_moves[0][0] > (0.8*(1-(361-len(legal_policy_moves))/361.*.5)))
total number of moves seen by katago: 115

BPachi3d-WModPPick

Even games with different bots at different total number of moves seen by katago.

GnuGo 3.8 at level 8 (ca. 10 kyu): 24
GnuGo 3.8 at level 10 (ca. 8 kyu): 30
GnuGo 3.8 at level 10 with 4 handicap stones and 0.5 komi (ca. 4 kyu): 49
pachi -t =5000 --nodcnn (3 kyu): 66
pachi -t =5000:15000 (3 dan): 115

Linear regression did not give a good fit in this wider rank range. Theoretically it makes more sense to use the log10 of the total number of moves seen. That way it is not possible to have negative number of seen moves.

regression

The equation: (total number of moves seen by katago) = int(round(10**(-0.05737*kyu + 1.9482)))

The equation works for ranks from 12 kyu to 3 dan, which covers more than 90% of active players.
It should be noted that since there is no 0 kyu/dan, 3 dan = -2 kyu
This equation - with the changing pick_override setting to (0.8*(1-(361-len(legal_policy_moves))/361.*.5)) - might be used to make the AI strength slider for 90% of players.

The equation has an other nice feature. Extrapolating the line gives ca. 10 dan for perfect play, where total number of moves seen is the size of the go board (361).

One issue I have with formulas like this is deciding which parameters are exposed to users and how to expose or explain them to users -- or hide them alltogether.

I think it would be nice to have a simple view, where one could set the kyu level directly.
Maybe a tournament mode could be added later, where one starts at a user set rank. When human players win, they gain a rank, when they lose they lose a rank.

@sanderland
Copy link
Owner Author

any way you could have a look at this for p:weighted and scoreloss? I think they're my prefered AIs and curious how they perform on blunders in early vs end game.

@bale-go
Copy link
Contributor

bale-go commented Jun 3, 2020

The reason I did not use scoreloss, is that it heavily depends on the max_visits and it is much slower.

Theoretically, I find the approach of P:Pick better. The value of the NN_policy seems rather arbitrary in many cases. One can see this by comparing the values of the score_loss and the NN_policy for a given position. The absolute value of NN_policy does not directly reflect the score_loss. For example, NN_polciy(best) = 0.71 and score_loss(best) = 0 points; NN_polciy(second_best) = 0.21 and score_loss(second_best) = 1 points. However, I found that the order of the moves from the best to the worst is very similar for score_loss and NN_Policy. P:Weighted relies on the absolute value of NN_policy. P:Pick relies on the order of the moves of NN_policy.
The latter is more robust.

@sanderland
Copy link
Owner Author

The compute cost and visits conflation is definitely an issue. However, a major weakness of pick over weighted is being blind to 'all' good moves on a regular basis and playing some policy <<1% move, at which point the ordering is quite random.

@bale-go
Copy link
Contributor

bale-go commented Jun 3, 2020

I guess what I try to argue here is that having a policy value less than 1% is not a problem per se.

If you check amateur human vs. human games, there are multiple less than 1% or even less than 0.1% moves. The obvious blunders can be removed by using a shifting pick_override setting (80% initially to 50% endgame).
I looked in the logs of modified P:Pick with gnugo. total number of moves seen was 30, a seemingly low value. But P:Pick did not make clearly bad moves (that gnugo could take advantage of). Only 8% of the moves were below 0.1% and all of them were in the first 25 moves, where the decrease of NN_policy from best to worst is steep. Like placing a stone on the 4th line instead of the 3rd line could result in a NN_poicy << 1%. Only one third of the modified P:Pick moves were less than NN_policy=1%.

In the end the user experience is the most important. The runs with different bots show that the modified P:Pick policy makes a balanced opponent for a wide range of ranks.
I guess you might add a condition to remove NN_policy < 0.1%, but I think humans around 10 kyu make those too from time to time.

@sanderland
Copy link
Owner Author

<<1% is more like 0.1%, which more often is problematic (first line moves and such).
Anwyay, could you PR this as a new AI option into the v1.1.1 branch? If we have them side-by-side we can see where it leads.

@bale-go
Copy link
Contributor

bale-go commented Jun 3, 2020

This is the first time I use GitHub (I only registered to participate in this fascinating project).
I will try my best.

@bale-go bale-go mentioned this issue Jun 3, 2020
@sanderland
Copy link
Owner Author

refactored a bit after the merge and added tests since it was turning into quite the spaghetti. It went all the way to losing by 80 points to near jigo against p:weighted and looks nice -- what bounds do you think there are on the rank?
Will see about running a couple of ogs bots on this and see where their ranks end up.

@bale-go
Copy link
Contributor

bale-go commented Jun 4, 2020

The upper limit currently is the strength of the policy network, around 4d.
I played with it at 20k to check if everything is working at lower strengths. It seemed to play like a beginner as expected. But I do not know of any bots that play in that range to test the balanced play from opening to endgame, like I did with the 3d - 12k range bots.
Running ogs bots at different kyu settings (maybe? 8k, 5k, 2k, 2d) is a great idea. Let's see some real life data.

@sanderland
Copy link
Owner Author

got them working on ogs -- seems to work nicely, but it really shows how bad local is after adding an endgame setting to it!

image

@sanderland
Copy link
Owner Author

sanderland commented Jun 5, 2020

 * ai:p:rank(kyu_rank=2): ELO 1326.7 WINS 249 LOSSES 39 DRAWS 0
 * ai:p:territory(): ELO 1302.0 WINS 185 LOSSES 65 DRAWS 3
 * ai:p:tenuki(): ELO 1276.5 WINS 192 LOSSES 60 DRAWS 0
 * ai:p:weighted(): ELO 1156.7 WINS 206 LOSSES 81 DRAWS 1
 * ai:p:local(): ELO 1106.5 WINS 130 LOSSES 162 DRAWS 0
 * ai:p:pick(): ELO 1044.6 WINS 125 LOSSES 127 DRAWS 2
 * ai:p:rank(kyu_rank=6): ELO 1026.3 WINS 167 LOSSES 120 DRAWS 2
 * ai:p:rank(kyu_rank=10): ELO 761.2 WINS 86 LOSSES 202 DRAWS 0
 * ai:p:rank(kyu_rank=14): ELO 582.2 WINS 45 LOSSES 247 DRAWS 0
 * ai:p:rank(kyu_rank=18): ELO 417.3 WINS 3 LOSSES 289 DRAWS 0

@bale-go
Copy link
Contributor

bale-go commented Jun 5, 2020

Pretty cool!
The estimated kyu_rank for ai:p:rank vs. ELO data changes linearly, a pretty good sign.

@sanderland
Copy link
Owner Author

image
some spot on ranks there, though sample size is still small.

@sanderland
Copy link
Owner Author

after 320 games
image

some real weird stuff in the weaker one though (e.g. 153/155)
https://online-go.com/game/24495021

@Dontbtme
Copy link
Contributor

Dontbtme commented Jun 6, 2020

Isn't Katrain just trying to start a capturing race in the corner?
B18 makes an eye at A19 and c17 takes 1 liberty from White.

@bale-go
Copy link
Contributor

bale-go commented Jun 6, 2020

I didn't think it would work out so so well. All of the ranks are within one stone, except for katrain-6k, which was still 5k in the morning.
18k bot is probably at the limit of the range of usefulness of this method. It is pretty surprising, that the 3d - 12k calibration worked so well at lower kyu.

I was thinking about using this method to assess the overall play strength of a player. I saw something similar in GNU Backgammon. It is possible to estimate your skill by looking at your moves. Currently the analysis mode can help you discover certain very bad decisions, but I think it might also be important to see the consistency of all of your moves.

I'm currently working on dividing the game in 50 move pieces, and calculating the kyu rank for each part of the game (opening (0-50 moves), early middle game(50-100 moves), late middle game (100-150 moves), endgame (150-)) by the median rank of the moves (best move is 1st, second bes is 2nd etc.).
It could give you a feedback on which part of your game needs improvement. For example, I tested it on a few of my games: my opening is better than my rank by two kyu, but my late middle game is terrible (3 kyu weaker).
What do you think?

@sanderland
Copy link
Owner Author

@bale-go I went a bit lower, since especially at those ranks people seem to looooove bots.
interesting idea on ranking parts of the game. I'm not sure how indicative policy rank is (playing the #5 move could be dying horribly, or could be one of many equally good opening moves, right?) - it may be worth trying out some different metrics and see what makes sense. Still, median over 50 moves should stabilize it a lot.

@sanderland
Copy link
Owner Author

sanderland commented Jun 6, 2020

https://github.com/sanderland/katrain-bots/blob/1.2/sgf_ogs/katrain_KaTrain%20aiprank%20vs%20OGS%202020-06-06%2010%2049%2041_W+30.4.sgf

Move 153: B B18
Score: W+40.9
Win rate: W 98.8%
Predicted top move was A17 (W+40.1).
PV: BA17 B19 C19
Move was #11 according to policy (2.42%).
Top policy move was A17 (18.1%).

AI thought process: Using policy based strategy, base top 5 moves are A17 (18.12%), F19 (13.48%), E16 (10.26%), A10 (8.22%), D18 (6.56%). Picked 8 random moves according to weights. Top 5 among these were B18 (2.42%), R7 (0.12%), S11 (0.01%), P15 (0.01%), T12 (0.00%) and picked top B18.

Move 155: B C17
Score: W+36.6
Win rate: W 98.4%
Estimated point loss: 15.9
Predicted top move was F19 (W+17.9).
PV: BF19 E16
Move was #38 according to policy (0.04%).
Top policy move was F19 (25.0%).

AI thought process: Using policy based strategy, base top 5 moves are F19 (24.98%), H18 (24.70%), E16 (15.19%), L6 (10.72%), B13 (6.61%). Picked 8 random moves according to weights. Top 5 among these were C17 (0.04%), Q15 (0.02%), S3 (0.01%), P2 (0.01%), G10 (0.01%) and picked top C17.

didn't realize n=8 at this level, makes more sense now :)

@bale-go
Copy link
Contributor

bale-go commented Jun 6, 2020

The success in covering a wide range of strengths with the policy pick method shows to me that it captures some important aspects of the difference in beginner and expert understanding of the game.
In policy pick method the neural network is only used to rank the moves from best to worst (policy value is only used in weeding out really bad moves).

In line with the p-pick-rank method, it is not far fetched to assert - according to the bot calibration and ogs data - that a 3k player chooses the best move from ~60 possible moves (M).
The total number of legal moves in an empty board is 361 (N).
We can use statistical tools to show that the median of the rank of the best move (mbr) is:
mbr = ceil(N/(sqrt(exp(-1))+(2-sqrt(exp(-1)))*M)) = ceil(361/(sqrt(exp(-1))+(2-sqrt(exp(-1)))*60)) = 5

In other words 3k players will find the 5th best move on average (on median ;) ) during their games.

But we can reverse the question. If we observe by the analysis of a much stronger player (katago) that the median rank of moves is 5 we can argue that the player is ca. 3 kyu.
An important advantage is that this rank estimation does not need further calibration.
If the median rank of played moves is 5, and the median number of legal moves is 300, it is possible to calculate how many moves does the player "see" (M ~ 60). We can use the calibration equation (total number of moves seen by katago) = int(round(10**(-0.05737*kyu + 1.9482))) to calculate the rank.

As I mentioned earlier, we can use this method to evaluate parts of the game.

I wrote a script to calculate the ranks by this method. Here are two examples to showcase it.
GnuGo 3.8 -level 10 (8 kyu)
moves; rank
0-50; 5.5 kyu
50-100; 18 kyu
100-150; 10.5 kyu
150-end; 0.5 kyu

It seems that GnuGo developers did a terrific job with the opening (hardcoding josekis etc.) and the endgame, but the middle game needs some improvement.

pachi -t =5000 --nodcnn (3 kyu):
moves; rank
0-50; 0 kyu
50-100; 1 kyu
100-150; 7 kyu
150-end ; 7 kyu

Pachi was ahead in the first 100 moves in the game with katrain3k, but it made a bad move and MCTS bots are known for playing weirdly when losing. The changing ranks show this shift.

Please let me know if you are interested in a PR.

@sanderland
Copy link
Owner Author

18k seems suspect, no? that's a huge rank difference. Then again, pachi doing well...is it just biased towards MCTS 'style'?
Feel free to PR and we can see where this fits in. It may not make it in as easily as the calibrated rank bot, but it's really interesting to play around with and see how we can integrate it

@bale-go
Copy link
Contributor

bale-go commented Jun 7, 2020

Indeed, 18k is a huge difference. In the long run, maybe it would be better to color code the parts of the game, similarly to the point loss. The calculated rank for the total game would be the reference. If a part of the game is much worse (worse than -5k) it would be purple; -2k to -5k red; -2k to +2k green (neutral); better than +2k blue (congrats!).

However, this scale would be independent of the score loss of individual moves. It would assess the overall quality of that part of the game. Due to the application of median the calculated ranks are resistant to outliers (blunders, lucky guesses etc.). Indeed, it could show that player A was better than player B in the quality of play, but player A made a blunder and lost the game.

@sanderland
Copy link
Owner Author

sanderland commented Jun 8, 2020

What do you think of a short text-based report at the end of a game to start with? It could go into sgfs and even be sent in chat on ogs

@bale-go
Copy link
Contributor

bale-go commented Jun 8, 2020

I think that would be awesome.

I made two analysis of two recent games on ogs.
katrain-6k(W) lost in the first game. It did not play at 6k level during the game.
The rank consistency analysis could correctly evaluate the rank of elplatypus[9k] user.
It seems that B is pretty good at joseki, but the endgame might need some improvement.

File name: elplatypus vs katrain-6k B.csv
Player: elplatypus[9k] (B)
Move quality for the entire game: 9 kyu
Move quality from move 0 to 50: 3 kyu
Move quality from move 50 to 100: 9 kyu
Move quality from move 100 to 150: 5 kyu
Move quality from move 150 to 200: 13 kyu
Move quality from move 200 to the end: 14 kyu

File name: elplatypus vs katrain-6k W.csv
Player: katrain-6k (W)
Move quality for the entire game: 9 kyu
Move quality from move 0 to 50: 9 kyu
Move quality from move 50 to 100: 6 kyu
Move quality from move 100 to 150: 10 kyu
Move quality from move 150 to 200: 13 kyu
Move quality from move 200 to the end: 9 kyu

katrain-10k(B) won the second game in a very close match (B+1.5). It played at ca. 7k level during the game.
The rank consistency analysis showed that W played a really strong game. W was better than their rank over the entire game.

File name: katrain-10k vs LadoTheBored B.csv
Player: katrain-10k (B)
Move quality for the entire game: 7 kyu
Move quality from move 0 to 50: 6 kyu
Move quality from move 50 to 100: 12 kyu
Move quality from move 100 to 150: 10 kyu
Move quality from move 150 to the end: 5 kyu

File name: katrain-10k vs LadoTheBored W.csv
Player: LadoTheBored[10k] (W)
Move quality for the entire game: 7 kyu
Move quality from move 0 to 50: -0 kyu
Move quality from move 50 to 100: 9 kyu
Move quality from move 100 to 150: 7 kyu
Move quality from move 150 to the end: 8 kyu

@sanderland
Copy link
Owner Author

It's strange that the bots don't play at their level -- are you sure you're not off by some factor due to it being 'the best among n moves' and not 'this rank'?

@bale-go
Copy link
Contributor

bale-go commented Jun 8, 2020

I think it is due to the underlying randomness of the p-pick-rank method.
I tested the consistency analysis on some test cases. For example, when I take out randomness, by fixing the move rank of every single move to a certain number (this number will slowly decrease due to the decrease in the number of legal moves), the calculated kyu level did not change over the game.
In higher ranks (lower kyus) the analysis becomes more noisy due to the application of median, which can only be integer (except for lists with even number of elements).
I will upload the test cases, a gnumeric spreadsheet with the equations, and a small fix for the script.

File name: 12k_not_random.csv
Move quality for the entire game: 12 kyu
Move quality from move 0 to 50: 12 kyu
Move quality from move 50 to 100: 12 kyu
Move quality from move 100 to 150: 12 kyu
Move quality from move 150 to 200: 12 kyu
Move quality from move 200 to the end: 12 kyu

File name: 8k_not_random.csv
Move quality for the entire game: 8 kyu
Move quality from move 0 to 50: 8 kyu
Move quality from move 50 to 100: 8 kyu
Move quality from move 100 to 150: 8 kyu
Move quality from move 150 to the end: 8 kyu

File name: 4k_not_random.csv
Move quality for the entire game: 4 kyu
Move quality from move 0 to 50: 4 kyu
Move quality from move 50 to 100: 3 kyu
Move quality from move 100 to 150: 5 kyu
Move quality from move 150 to the end: 5 kyu

@bale-go
Copy link
Contributor

bale-go commented Jun 14, 2020

@SimonLewis7407 Thank you for the kind words!
You can try the calibrated rank bots by cloning the v1.2 branch as @sanderland has already merged it. You can also play with them at OGS.
About the distribution of best moves. That is the important question, isn't it.
We know for sure that it is not gaussian (bell curve), since we cannot have negative value for the best move rank, and since gaussian distribution is symmetric, we would get negative ranks from time to time.
It seems that the distribution of the best of M selected from N legal (aka p-pick) captures something about the human play as it works so well for a wide range of strengths with adjusting only a single parameter. You can check it out in @sanderland's plot (In [26]): https://github.com/sanderland/katrain-bots/blob/master/analyze_games.ipynb

@sanderland
Copy link
Owner Author

sanderland commented Jun 14, 2020

Found a mistake in the mean/median curves (copy-pasted var), updated ones below.
I think I'd like to try a 'score loss' calibrated AI in the next version.

image

even more so, there is a difference in the top policy value for human and bot moves: i.e. the bots/humans leave the game in a significantly different state for their opponent.

image

@sanderland
Copy link
Owner Author

And I'll attempt a specific comment. If you have established that a an n-kyu player selects the Kth best alternative according to the policy system on average, then could you simply build an AI player by casting a probability distribution (or bell curve) around that?
Like, I think you were saying a 2-kyu player plays on average the 5th best policy move? (For this particular edition of Katago) Well then could you make 5th-best policy move the average, and the mode, with a standard deviation of, uh, 2 or so, and then turn it loose? So on any given turn, the AI would play maybe the 5th best policy move, or with slightly less probability the 4th or 6th, and with still less probability the 3rd or 6th, etc.? So it would be consistently playing at a 2-kyu level, with the variance and blips that we expects from humans?

I think the 'move picking' effectively does this, it has some expected value (which is in the thread) and deviation (which we don't know)

@SimonLewis7407
Copy link

Thank you bale-go. Yeah, I shouldn't have said gaussian, what I should have said was some curve or formula (or even a brute-force list of approximations) that matches those blue distribution charts (the six that correspond to humans) shown by sanderland a little higher up on this page. If those distributions represent human play, and the bots can approximate those distributions in their own play, that is great!

I will check them out on OGS like you suggested, or maybe clone v1.2 like you mentioned.

P.S. That said, still there is some unavoidable change that needs to be made to a bot's playing when you reach the endgame, right? I don't know how to define "endgame" but up until then, the average difference in quality between best policy move and second-best policy move [or among the top n moves for small values of n] is very small, but then it swells very large for the endgame.

@bale-go
Copy link
Contributor

bale-go commented Jun 14, 2020

@sanderland In move rank vs. kyu plots you need to use the same x-axis.
If you plot the users between -5 and 20 (like the bots), you will find that bots and users do not have significantly different median and 20-80 percentile mean. That is why removing outliers for the kyu rank estimation is so important (effectively removing the >40 moves). Using simple mean would predict worse kyu ranks.

@sanderland sanderland changed the title AI strength slider AIs calibrated to kyu/dan strength with easier to understand settings Jun 14, 2020
@sanderland
Copy link
Owner Author

@sanderland In move rank vs. kyu plots you need to use the same x-axis.
If you plot the users between -5 and 20 (like the bots), you will find that bots and users do not have significantly different median and 20-80 percentile mean. That is why removing outliers for the kyu rank estimation is so important (effectively removing the >40 moves). Using simple mean would predict worse kyu ranks.

aaahhh! of course, fixed :)

@sanderland
Copy link
Owner Author

P.S. That said, still there is some unavoidable change that needs to be made to a bot's playing when you reach the endgame, right? I don't know how to define "endgame" but up until then, the average difference in quality between best policy move and second-best policy move [or among the top n moves for small values of n] is very small, but then it swells very large for the endgame.

We decrease override, so it more readily plays the top move when there are fewer available moves
override = 0.8 * (1 - 0.5 * (board_squares - len(legal_policy_moves)) / board_squares)

@SimonLewis7407
Copy link

Thanks sanderland! I am not very sophisticated about these things, but I see that formula for decreasing override will decrease it gently, over a large number of moves. That definitely points things in the right direction. But my suspicion is that in real go play there is more of a "quantum jump" that might need to be reflected. Like, when the bot recognizes some combination of conditions (fewer available moves, for sure, but maybe some other factors too), then it needs to make an extra, further decrease of the override for the remainder of the game.
Anyway, you are doing fantastic work!

@bale-go
Copy link
Contributor

bale-go commented Jun 14, 2020

@SimonLewis7407 Decreasing the override is not the only strategy p-pick-rank utilizes to improve the endgame.
The number of legal moves also decreases over the game, but the number of moves seen by katago is kept constant. Resulting in an improvement of the best move from the selection over the game.
For example in the opening we have 361 legal moves an 8k player "observes" 31 moves. The 20-80 percentile average of the best move is: 8.86
In the endgame the number of legal moves decreases to say 150, while the observed moves stay 31. So the 20-80 percentile average of the best move is: 3.78

@bale-go
Copy link
Contributor

bale-go commented Jun 15, 2020

@sanderland The new score loss histograms are amazing!
There is virtually no difference between the score loss of users and bots at the same kyu rank level. I had to double check if there was a mistake and they were mistakenly copied over. :)

Updated calculated kyu vs. ogs kyu with 2182 games for users.
Figure_2

@sanderland
Copy link
Owner Author

image
I think they show the 18k bot is definitely a tad too weak, and the 2d bot a tiiiny bit too strong, but other than that they line up amazingly well i nboth shape and mean.

@sanderland
Copy link
Owner Author

sanderland commented Jun 15, 2020

I spent some time looking at sgf_ogs/katrain_BZSolarDot (28k) vs katrain-18k (21k) 2020-06-14 10 43 52_W+60.4.sgf and why it has so many -60 point moves, but it seems it really confused the 15b net!

@sanderland
Copy link
Owner Author

sanderland commented Jun 15, 2020

this one is amazing, but a bit strange -- everybody is bad at endgame ?

image

@sanderland
Copy link
Owner Author

same plots when 20b is doing the analysis -- this is weird!

image

@SimonLewis7407
Copy link

With the charts just above, what is the difference between the two charts on the left and the two on the right? The legend doesn't say, sorry it's not obvious to me.

@SimonLewis7407
Copy link

Oh crap, they do say! It's mean and median, sorry, I missed that.

@bale-go
Copy link
Contributor

bale-go commented Jun 16, 2020

Again it is pretty nice how the plots for users and bots line up.

It seems that point loss estimation with 1 visit is not perfect for 15b. It is not terribly bad though. 1 point miscalculation is pretty tolerable in most <3d games.

I wanted to check if move ranks are more robust to the change of models. They certainly are.
The plot also shows what we discussed earlier. Namely that users memorize josekis, so the first 50 or so moves are better than expected.

ranks-15b-20b

I think in the next version we could use the move rank vs. # of legal moves curves of users to create bots that mimic human play even better (e.g. we could let katago see slightly more moves in the opening and slightly less in the middle game).

However, I would not change the kyu rank estimation script. I think it is a really important for the user to know objectively which part of the game they excel at.

BTW, @sanderland do you plan to introduce the kyu rank estimation in the form of a short text message in the current branch?

@sanderland
Copy link
Owner Author

@bale-go I think it's better to polish it a bit more and put it into 1.3

@sanderland
Copy link
Owner Author

Thanks for checking how robust rank moves are to model changes, I was afraid the 20b was getting overfitted / to narrow but it seems we don't have to worry.

@bale-go
Copy link
Contributor

bale-go commented Jun 16, 2020

If only the p-pick-rank bots are included the move rank vs. # of legal moves curves are even more similar between users and bots (except for the first 50 moves).
ebbe-rank20b_rankbot-out

@sanderland
Copy link
Owner Author

Split up into #72 #73 #74 as this is becoming a bit long and going in different directions

@killerducky
Copy link

@sanderland
Regarding the histograms in: #44 (comment)

Do you have the raw data for them? I'm looking to analyze them a bit and get some rough guidelines for users how much they need to reduce various errors types/sizes to reach the next rank level. This can provide some context for how bad mistakes of various sizes are.

@sanderland
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants