Improving rank feedback to user #73

sanderland · 2020-06-16T11:51:38Z

Give the user feedback on their game , such as 'your opening/middle game/endgame was around 8k'.

bale-go · 2020-06-17T09:13:25Z

I did some coding to test the move quality prediction for consecutive segments of the script.
It seems that the predictions are in line with what we have learned from the move_rank/point_loss of users. Between 0 - 50 moves the bulk of user games are below the y=x line, meaning that memorizing joseki gives an additional 2-3 kyu to our strength. In the middle game most games are near the y=x line. While in the late middle game/endgame users play slightly weaker than their kyu rank.
It is important to note that the mean quality of moves never shifts more than 3 kyu ranks from y=x line for any segment.

sanderland · 2020-06-17T12:03:04Z

What's the code you are using for these histograms? I built a little NN to predict rank and would be curious how it compares. It did immediately find one cheater who was OGS 5d but around 8k and timing out all his losses.

sanderland · 2020-06-18T09:25:53Z

thanks for the code. fitting to <25k gives me:

bale-go · 2020-06-18T09:40:43Z

Pretty cool!
However, I would not give up on the move rank based kyu estimation.
It is based on a pretty solid mathematical background, where we simply invert the working calibration of the p-pick-rank bots to get the kyu rank. (no further training/calibration is necessary)

On the other hand, a neural network is always a blackbox. It can be really useful as a universal function approximator for complex functions, where symbolic/analytic functions are not available. But I think that is not the case here.

sanderland · 2020-06-18T10:58:08Z

It was mostly to see what the difference is / how well it performs. It is only a 2 layer, 15->20->10->1 NN using the histogram features as inputs, as I'm trying to avoid overfitting.

bale-go · 2020-06-18T11:01:47Z

It is interesting that the NN prediction deviates from y=x similarly to the move rank based kyu estimation.

sanderland · 2020-06-21T11:16:36Z

Let's try to get your estimate into v1.3 - any ideas on the user interface? just putting it in the 'info' box will be easiest, but maybe a bit hidden?

bale-go · 2020-06-21T11:36:28Z

Maybe a spline connected curve next to the score and win rate (under the timer).
With kyu rank on the y axis and move number on the x axis.
I would suggest to use PCHIP interpolator, since it does not allow unrealistic max and min values that plague regular spline interpolation.

And under the plot, next to Score/Win Rate/Point Loss there could be the overall kyu rank of the game from move 1 to the current move.

bale-go · 2020-06-21T14:43:29Z

I created a PR that generates the following plot.

sanderland · 2020-06-21T16:46:07Z

frankly that looks like we could do with linear interpolation (=just save the points and let the graphics primitives deal with it -- the score graph is definitely not matplotlib based)

bale-go · 2020-06-21T16:55:04Z

I don't think spline is crucial for the plot either.
Spline might be closer to the underlying function, but the stdev is large anyway.

sanderland · 2020-06-22T21:05:58Z

start of a layout

Dontbtme · 2020-06-23T08:31:43Z

What about an option to be able to show dots according to their rank estimates rather than point losses?
Say I'm an overall 2kyu, I could look for bad moves that were weaker than my rank, or double digits kyu blunders and so on.

sanderland · 2020-06-23T08:34:11Z

@Dontbtme A single move does not have a rank estimate -- it's a statistical estimate that requires many moves to be even close.

bale-go · 2020-06-23T08:34:16Z

The problem is that kyu estimation is not a single move statistic.
It needs at least 25/50 moves to get a reasonable estimation for a given segment of the game.

Dontbtme · 2020-06-23T08:44:57Z

Gotcha. That's a shame :p

Dontbtme · 2020-06-23T09:14:35Z

I'll give you one last idea before I stop wasting your time and then I'll call it a day :p
Maybe a rank loses a given number of points per move on average depending on the stage of the game.
Suppose 2kyu players lose 2pts per move on average in the opening, 3pts per move on average in the middle game, and 1pt per move on average in the endgame... then that would give an order of magnitude to rank mistakes according to the stage of the game.
That would be useful I think, because the closer we are to the end of the game, the bigger of a mistake a 3 points loss would be, for example.

bale-go · 2020-06-23T09:23:53Z

The idea here is that average move rank is a more robust statistic than the average score loss.
It seems that it is possible to build a rank adjustable bot on top of the p-pick bot, which chooses the best of M (which depends on the kyu rank) out of N legal moves.
The cool thing about it is that this method gives us a score loss distribution that is very similar to the human play (see #44 (comment))

The rank estimation "simply" inverts the method used in the p-pick bot (after removing outliers).

sanderland · 2020-06-23T10:26:04Z

The idea here is that average move rank is a more robust statistic than the average score loss.

There is no particular evidence for this though, you can probably do well with score loss as well, or both. What is definitely true though is that the 15b single visit scoreLoss is very noisy/biased in endgame.

bale-go · 2020-06-23T11:21:46Z

I am pretty sure that using score loss alone or in combination with move rank can be used to create a human-like player that is even closer to the human style than the current bot.
The advantage of using move ranks (p-pick) is that it gives us a really simple and surprisingly well working first model. But it is definitely not the last word in the quest for mimicking human play.

What is definitely true though is that the 15b single visit scoreLoss is very noisy/biased in endgame.

This is exactly what I meant by more robust. It works even at policy level throughout the game.

sanderland · 2020-06-23T20:06:04Z

random 5k vs 6k game which black won by 131 points -- something definitely seems off with the rank estimate

Dontbtme · 2020-06-23T22:22:00Z

ROFL
Look at the winrate and score graphs: they're all over the place! I wouldn't worry, though. Seems like White played better overall but messed up big time at some point, so the rank graph might be working just fine but you picked the wrong game to try it on :D

bale-go · 2020-06-23T22:36:13Z

I cloned the v1.3 branch. I really like the way you solved to show the rank estimate.
I checked a random [sgf](sgf_ogs/katrain_power66 (7k) vs katrain-6k (4k) 2020-06-15 11 24 28_W+26.3.sgf).
The analyze_rank.py does not give the same result as the generated plot. The first and last segments were different by 4-5 kyus.

bale-go · 2020-06-24T07:50:55Z

I found a game that shows the issue nicely.
It is played between two really strong bots on computer go server.

The first segment is only 30 moves long (15 each) resulting in a poor estimation of that part of the game.

I fixed it by not plotting estimations that are made from less than 75% of the total segment length.

Using 20b model the estimation of strong bots is even more accurate. Please notice that the calibration of calibrated rank bot does not apply for 20b, but the trends are the same.

sanderland · 2020-06-24T09:47:02Z

yeah I was a bit to aggressive in wanting the line to look nice across the whole length. maybe we can fake it by extrapolating the first point backward ;)

bale-go · 2020-06-25T09:00:17Z

What do you mean by taking it all the way?
Making a histogram of the move ranks and using that to predict kyu rank might work for the full game, but I'm afraid that it would not be accurate for segments. Average helps with uncertainty of the prediction a lot (divides it by sqrt(N)).

sanderland · 2020-06-25T09:25:56Z

Well a course histogram with a few bins may be better, it's worth considering

bale-go · 2020-06-25T12:02:24Z

I did some number crunching.
I analysed 3000 sgfs (19x19 at least 100 moves).
The 2D kernel density plots are pretty good to show the effect of the patch.

2D kernel density plot without the move_cap_patch:

2D kernel density plot with the move_cap_patch, factor used is 0.07:

2D kernel density plot with the move_cap_patch, factor used is 0.09:

From the results I think 0.07 factor is a little bit too small, since it hinders the accurate estimation of sub 15k ranks.
I made histograms for different kyu games to show the effect of the patch in an other way (moves between 75 and 125).

As expected the patch does not affect SDK players too much. However at DDK levels it really helps with the spread of the distribution.

bale-go · 2020-06-26T16:50:13Z

The user data based AI (#74 (comment)) can be used for estimating user ranks more accurately.
It works much better at higher strengths than the previous calibrated rank bot. I created a PR, which includes the changes.

bale-go · 2020-06-27T14:45:01Z

The rank estimation of segments became much more accurate with the new user data based AI.
Due to the non-linearity of "outlier free mean of move ranks vs. the # of legal moves on board" function (#74 (comment)) for the new bot it is not possible to easily calculate the kyu rank of the entire game. Thus I used the median of the move quality of the segments to calculate the estimated kyu rank/game kyu rank.

I think it is pretty convincing, especially comparing it to previous estimations of user kyu ranks.

sanderland · 2020-06-27T21:59:40Z

It looks pretty good, and there are some impressive outputs when I try it on the OGS games. Incredible noise as well though, two games from the same player.

Move quality for moves 1 to 178 B: 10.4k W: 4.1k
Move quality for moves 1 to 285 B: 3.4d W: 4.6d

bale-go · 2020-06-28T15:00:53Z

I think a way to further decrease the noise could be by running the policy analysis a few times (3-5). The rank estimation function would be fed by the average of the reported move ranks.

sanderland · 2020-06-28T16:12:53Z

I think a way to further decrease the noise could be by running the policy analysis a few times (3-5). The rank estimation function would be fed by the average of the reported move ranks.

The policy is deterministic!

bale-go · 2020-06-28T16:23:45Z

That is interesting.
When I run analyse_rank.py multiple times on the same sgf, I get slightly different result each time.
1st run:

File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
Move quality for moves 1 to 324 B: 6.4k W: 6.8k
Move quality for moves 1 to 80 B: 3.2k W: 5.0k
Move quality for moves 41 to 120 B: 6.4k W: 7.0k
Move quality for moves 81 to 160 B: 6.4k W: 6.8k
Move quality for moves 121 to 200 B: 9.2k W: 4.7k
Move quality for moves 161 to 240 B: 7.2k W: 6.1k
Move quality for moves 201 to 280 B: 4.1k W: 8.7k
Move quality for moves 241 to 320 B: 2.9k W: 9.0k

2nd run:

File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
Move quality for moves 1 to 324 B: 6.7k W: 6.7k
Move quality for moves 1 to 80 B: 3.3k W: 4.9k
Move quality for moves 41 to 120 B: 7.2k W: 7.2k
Move quality for moves 81 to 160 B: 7.2k W: 6.7k
Move quality for moves 121 to 200 B: 8.6k W: 3.7k
Move quality for moves 161 to 240 B: 6.7k W: 5.2k
Move quality for moves 201 to 280 B: 6.1k W: 8.3k
Move quality for moves 241 to 320 B: 3.9k W: 7.9k

3rd run:

File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
Move quality for moves 1 to 324 B: 7.3k W: 5.9k
Move quality for moves 1 to 80 B: 3.9k W: 3.7k
Move quality for moves 41 to 120 B: 7.3k W: 5.4k
Move quality for moves 81 to 160 B: 7.8k W: 5.9k
Move quality for moves 121 to 200 B: 9.4k W: 4.4k
Move quality for moves 161 to 240 B: 7.9k W: 5.9k
Move quality for moves 201 to 280 B: 4.9k W: 9.7k
Move quality for moves 241 to 320 B: 3.1k W: 7.2k

sanderland · 2020-06-28T16:28:06Z

Aha, that may be because of the random rotations it does! Other than rotations it's deterministic, and there is no real way to force them currently.

bale-go · 2020-06-28T16:34:04Z

I see. Probably using the stronger 20b model will help in this respect.

sanderland · 2020-06-28T16:53:12Z

Yes, it's more consistent in that respect

sanderland · 2020-07-01T18:19:42Z

20b

lee sedol vs alphago game 4

alphazero vs alphago master

(15b) on more alphazero vs alphago master

alphago ddk confirmed

bale-go · 2020-07-02T07:02:20Z

Yes, it is weird to see so many yellow and red dots in games like that.
Then again, alphago style bots play strange moves when they are losing as they only care about winning the game.
In all of the above mentioned games the losing player played weaker moves.

Katago on the other hand plays to maximize the score, making its style more similar to humans.
I checked several Katago vs. Katago games on CGOS and they were all 5d+
If you look at Lee Sedol vs. AlphaGo games where Alphago won, you do not see the decrease in move quality near the end of the game.

1st game (Lee Sedol black):

2nd game (Lee Sedol white):

3rd game (Lee Sedol black):

5th game (Lee Sedol black):

Only in the last game did AlphaGo dip below 5d (except for the 4th game where it lost).
According to Michael Redmond it is where alphago missed the tombstone squeeze.

sanderland · 2020-07-03T18:43:35Z

I was trying to estimate the effect of the strength parameter of score loss and found even a fairly low number to beat the highest calibrated rank. Even though on OGS I found strength=0.5 to be around 5k maybe -- something is weird!

SGF: https://gokibitz.com/kifu/BJ4N3xpCI?path=341

bale-go · 2020-07-04T11:59:10Z

I could also reproduce the behaviour.
Interestingly, the rank estimation, which is based on the same calibration, is quite accurate for score loss.
After checking the log of several games I found that calibrated rank plays a quite strong game overall, but at certain moves it makes serious blunders leading to losing the game. That is why the rank estimation could not take them into account, the really bad moves were treated as outliers.

I found that it almost exclusively happens when two equally good moves are possible with high policy values, for example: top move: 50%, second best 42% (third best 3% etc.)
The obvious move detector does not see it, since the top move is only 50%. The P:Pick then chooses randomly the third or the fourth best. See red ellipses in the game between a 5d calibrated rank (B) bot and 0.5 score loss (W).

I decided to solve the issue by using the top policy when the sum of the two best policy value was over a certain threshold. After the patch, calibrated rank bot (5d) won all of its games against ScoreLoss.

I tested calibrated rank bot (2k) against score loss (0.5) too. The rank estimation worked and the two bots played an even game.

I created a PR for ai.py. Unfortunately, I could not simply apply the patch for the rank estimation since the policy value of the second best move is not available in graph.py.

sanderland · 2020-07-04T12:20:11Z

Let's give it a try. I'm a bit concerned about whether this eventually goes to obvious_n or something. Also, whether this is appropriate for the lower strength ais as I'm sure this case happens a lot in joseki, and it won't even play the second best move!

sanderland · 2020-07-04T15:47:37Z

still looks kind of similar, strength=0.5 murders calibrated rank 5d while being judged ddk :o

bale-go · 2020-07-04T16:09:06Z

Strange. I ran 6 games, 5d calibrated rank won all of them.
I use 15b model and 500 max visits.

sanderland · 2020-07-04T16:31:04Z

I ran two more and calibrated rank won, perhaps it was a fluke

bale-go · 2020-07-05T12:06:24Z

It seems that the bots became too strong (OGS ranks).
I run several games to see the percentage of overrides.
Obvious best moves get overridden in 16% of the moves. The best two moves patch caused an additional 12% of the moves overridden (28% in total). That caused a significant change in the strength.
I opted to set a constant and pretty high threshold for the two best moves to keep the original calibration, but remove obvious cases. Now only 3% of the moves are overridden due to two obvious moves.

sanderland · 2020-07-05T12:10:14Z

Yes, particularly the weaker bots are getting a lot stronger. the effect on the higher ranks seems less, perhaps because policy alone becomes a less good strategy. updated ogs bots, let's see them plummet

sanderland added the 1.3 label Jun 16, 2020

sanderland mentioned this issue Jun 16, 2020

AIs calibrated to kyu/dan strength with easier to understand settings #44

Closed

bale-go mentioned this issue Jun 17, 2020

plot_segment_quality sanderland/katrain-bots#3

Merged

bale-go mentioned this issue Jun 21, 2020

move-eval-plot sanderland/katrain-bots#4

Merged

bale-go mentioned this issue Jun 24, 2020

segment_patch #97

Closed

bale-go mentioned this issue Jun 27, 2020

user_data_based_AI_message_at_the_end_of_sgf sanderland/katrain-bots#5

Merged

bale-go mentioned this issue Jul 4, 2020

obvious_two #121

Merged

bale-go mentioned this issue Jul 5, 2020

two_obvious_stricter #125

Merged

sanderland removed the 1.3 label Jul 7, 2020

sanderland changed the title ~~Rank feedback to user~~ Improving rank feedback to user Jul 7, 2020

sanderland added the rank-calibration label Jul 7, 2020

portkata mentioned this issue Sep 27, 2020

Android binary lightvector/KataGo#321

Open

sanderland closed this as completed Oct 27, 2020

Improving rank feedback to user #73

Improving rank feedback to user #73

Comments

sanderland commented Jun 16, 2020

bale-go commented Jun 17, 2020

sanderland commented Jun 17, 2020

sanderland commented Jun 18, 2020

bale-go commented Jun 18, 2020

sanderland commented Jun 18, 2020

bale-go commented Jun 18, 2020

sanderland commented Jun 21, 2020

bale-go commented Jun 21, 2020 • edited Loading

bale-go commented Jun 21, 2020

sanderland commented Jun 21, 2020 • edited Loading

bale-go commented Jun 21, 2020

sanderland commented Jun 22, 2020

Dontbtme commented Jun 23, 2020 • edited Loading

sanderland commented Jun 23, 2020

bale-go commented Jun 23, 2020

Dontbtme commented Jun 23, 2020 • edited Loading

Dontbtme commented Jun 23, 2020 • edited Loading

bale-go commented Jun 23, 2020

sanderland commented Jun 23, 2020

bale-go commented Jun 23, 2020 • edited Loading

sanderland commented Jun 23, 2020

Dontbtme commented Jun 23, 2020 • edited Loading

bale-go commented Jun 23, 2020

bale-go commented Jun 24, 2020 • edited Loading

sanderland commented Jun 24, 2020 • edited Loading

bale-go commented Jun 25, 2020 • edited Loading

sanderland commented Jun 25, 2020

bale-go commented Jun 25, 2020 • edited Loading

bale-go commented Jun 26, 2020

bale-go commented Jun 27, 2020 • edited Loading

sanderland commented Jun 27, 2020

bale-go commented Jun 28, 2020

sanderland commented Jun 28, 2020

bale-go commented Jun 28, 2020

sanderland commented Jun 28, 2020

bale-go commented Jun 28, 2020

sanderland commented Jun 28, 2020

sanderland commented Jul 1, 2020

bale-go commented Jul 2, 2020 • edited Loading

sanderland commented Jul 3, 2020

bale-go commented Jul 4, 2020

sanderland commented Jul 4, 2020

sanderland commented Jul 4, 2020

bale-go commented Jul 4, 2020

sanderland commented Jul 4, 2020

bale-go commented Jul 5, 2020 • edited Loading

sanderland commented Jul 5, 2020

bale-go commented Jun 21, 2020 •

edited

Loading

sanderland commented Jun 21, 2020 •

edited

Loading

Dontbtme commented Jun 23, 2020 •

edited

Loading

Dontbtme commented Jun 23, 2020 •

edited

Loading

Dontbtme commented Jun 23, 2020 •

edited

Loading

bale-go commented Jun 23, 2020 •

edited

Loading

Dontbtme commented Jun 23, 2020 •

edited

Loading

bale-go commented Jun 24, 2020 •

edited

Loading

sanderland commented Jun 24, 2020 •

edited

Loading

bale-go commented Jun 25, 2020 •

edited

Loading

bale-go commented Jun 25, 2020 •

edited

Loading

bale-go commented Jun 27, 2020 •

edited

Loading

bale-go commented Jul 2, 2020 •

edited

Loading

bale-go commented Jul 5, 2020 •

edited

Loading