Regression analysis of the eval-result relationship #4150
Replies: 9 comments 9 replies
-
Beta Was this translation helpful? Give feedback.
-
The optimism patch tries to remove optimism right away (though probably not in the best way). Analysis of optimism-master (STC)Number of games: 23428 master-97860cb575Sample size: 1797654 Data:
optimism-c2b9ffbf55Sample size: 1797515 Data:
|
Beta Was this translation helpful? Give feedback.
-
Now I've tested a version, which removes all the parameters which make the eval-result relationship asymmetric. As it's seen, the new function is much more symmetric, although still not ideally symmetric. Maybe the NNUE evaluation is asymmetric too? Despite this, the new evaluation is much better. Analysis of symmetric_eval vs masterNew-55cd95d3d2Sample size: 8314374 Data:
Base-97860cb575Sample size: 8314220 Data:
|
Beta Was this translation helpful? Give feedback.
-
Here is the analysis of the master based on two tests, where it plays against itself. Note that the first test finished with an unusual result: Analysis of 160000 games - master vs master and SPRT master vs masterSample size: 42135786 Data:
|
Beta Was this translation helpful? Give feedback.
-
This is a LTC version of this test. Analysis of symmetric_eval vs master (LTC)New-af6b887ffdSample size: 3193181 Data:
Base-82bb21dc7aSample size: 3193059 Data:
|
Beta Was this translation helpful? Give feedback.
-
The same engine versions as above, but with the data from two tests. SPRT symmetric_eval vs master, LTC: one and twoNew-af6b887ffdSample size: 13339501 Data:
Base-82bb21dc7aSample size: 13339355 Data:
|
Beta Was this translation helpful? Give feedback.
-
The test with the recent version shows that the relationship is still asymmetric. I don't understand the reason yet. The graph plots the prediction function, which is f(x), along with 1 - f(-x) to demonstrate the asymmetry. 160000 games - master vs master (STC)master-dc0c441b7cSample size: 25035456 Data:
|
Beta Was this translation helpful? Give feedback.
-
The version with optimism removed plays against itself. The prediction function is still asymmetric, which suggests the source of the asymmetricity is elsewhere. However, it is evident that optimism significantly modifies the shape of the function. Analysis of 100000 games - optimism vs optimismoptimism-ac8fe925d5Sample size: 14916048 Data:
|
Beta Was this translation helpful? Give feedback.
-
The version without optimism is tested against master. As the results show, optimism significantly changes the shape of the prediction function, while the MSD is lower in the master (both in the asymmetric and symmetric models). Analysis of 100000 games - optimism vs masterNew-5ec8691e82Sample size: 7401187 Data:
Base-da937e219eSample size: 7400780 Data:
|
Beta Was this translation helpful? Give feedback.
-
I'm currently analyzing PGNs of games played on Fishtest to get the idea of how the displayed eval of an engine predicts the game result.
The method is as follows. I collect all the games played in one test, and do a separate analysis of each of the two engine versions. The eval (from the engine's perspective) for all moves the engine in question plays in all games are recorded along with the result of the game (again, from the current engine's perspective). The result of one game is included in the data multiple times, for all moves played by the engine in the game. Then I do isotonic regression analysis on the data. The evals are taken as independent variables and the game results as dependent ones. That is, we obtain a non-decreasing function which, applied to evals, best predicts the game results.
The tabular data of an analysis is interpreted as follows. The first two columns show the start and end of an interval of eval in centipawns. The mate score
+M<n>
is 100000 - n and-M<n>
is -100000 + n. The third column shows the value of the predicted game result on this interval, which lies in the 0..1 interval.This analysis highlights problems with the current version. The function is clearly asymmetric, which is a problem while using the engine for analysis. For instance, if the eval is 200 cp, the expected result is 0.8014, while in the -200 cp case it is 0.1258. This means that when we make a move analysing a game, the eval will jump even when the position doesn't actually improve or become worse. The cause of this asymmetric is the optimism parameter in the evaluation function of the Stockfish code. I've already described the issue here: #4142
I'm planning to test the engine against the same version to get purer results.
Analysis of SPRT sqlmrNt1 vs master (STC)
master-97860cb575
Sample size: 15621508
Mean square deviation: 0.03362709949940663
Data:
sqlmrNt1-335f369d4c
I don't put the graph, since it is nearly identical to that of the first engine, shown above.
Sample size: 15621571
Mean square deviation: 0.033600383699658516
Data:
Beta Was this translation helpful? Give feedback.
All reactions