Performance comparison between TPE multivariate and CMAES

A. Setup

The goal is to find which sampler is better tpe multivariate or cmaes, when the objective is to find the best values of knights and rooks using Deuterium engine. The objective value is calculated from the result of engine vs engine match at 200 games with a time control of 2s+50ms. This study has 100 trials, once the two studies are completed the resulting optimized parameter values from both samplers will be tested against the default parameter values of Deuterium.

Default knight and rook values of Deuterium

KnightValueOp = 325
KnightValueEn = 315
RookValueOp = 493
RookValueEn = 525

Op stands for Opening while En stands for Ending.

The calculation of performance as verification of the optimized parameter values will be done by a round-robin tournament among the default, tpe multivariate and that of cmaes. Each pair will be a match of 2000 games at a time control of 10s+50ms.

Optuna version that will be used in this optimization is 2.6.0.

B. Parameters to be optimized

Rook values are close to 500, in RookValueEn the max is 700, this is a test of how the samplers would handle such a wider range since both are limited to only 100 trials.

--input-param "{'KnightValueOp': {'default':225, 'min':200, 'max':400, 'step':1}, 'KnightValueEn': {'default':215, 'min':200, 'max':400, 'step':1}, 'RookValueOp': {'default':400, 'min':300, 'max':600, 'step':1}, 'RookValueEn': {'default':625, 'min':400, 'max':700, 'step':1}}"

C. The initial best value

This optimization requires that the best parameter values has to beat the old best parameter values by more than 55%.
--initial-best-value=0.55
That means in a match of 200 games the best parameter should score more than 110 points. This high margin of winning increases the probability that the parameter output after the optimization is truly the best.

D. Command line

tpe multivariate batch file

set study_name=tpe_multivariate
set engine_file=./engines/deuterium/deuterium.exe
set threshold_pruner_result=0.35

python tuner.py --study-name %study_name% --sampler name=tpe multivariate=true ^
--initial-best-value=0.55 ^
--engine %engine_file% --common-param "{'Hash': 128}" ^
--concurrency 6 --opening-file ./start_opening/ogpt_chess_startpos.epd --opening-format epd ^
--input-param "{'KnightValueOp': {'default':225, 'min':200, 'max':400, 'step':1}, 'KnightValueEn': {'default':215, 'min':200, 'max':400, 'step':1}, 'RookValueOp': {'default':400, 'min':300, 'max':600, 'step':1}, 'RookValueEn': {'default':625, 'min':400, 'max':700, 'step':1}}" ^
--games-per-trial 200 --trials 100 ^
--base-time-sec 2 --inc-time-sec 0.05 ^
--pgn-output %study_name%.pgn ^
--threshold-pruner result=%threshold_pruner_result%

cmaes batch file

set study_name=cmaes
set engine_file=./engines/deuterium/deuterium.exe
set threshold_pruner_result=0.35

python tuner.py --study-name %study_name% --sampler name=cmaes ^
--initial-best-value=0.55 ^
--engine %engine_file% --common-param "{'Hash': 128}" ^
--concurrency 6 --opening-file ./start_opening/ogpt_chess_startpos.epd --opening-format epd ^
--input-param "{'KnightValueOp': {'default':225, 'min':200, 'max':400, 'step':1}, 'KnightValueEn': {'default':215, 'min':200, 'max':400, 'step':1}, 'RookValueOp': {'default':400, 'min':300, 'max':600, 'step':1}, 'RookValueEn': {'default':625, 'min':400, 'max':700, 'step':1}}" ^
--games-per-trial 200 --trials 100 ^
--base-time-sec 2 --inc-time-sec 0.05 ^
--pgn-output %study_name%.pgn ^
--threshold-pruner result=%threshold_pruner_result%

E. Optimization Results

tpe multivariate best parameters

The best trial found was at 88, that means succeeding trials such as 89 to 100 could not beat trial 88 by more than 55% score from engine vs engine match.

study best param: {'KnightValueEn': 293, 'KnightValueOp': 329, 'RookValueEn': 526, 'RookValueOp': 543}
study best value: 0.5511718750000001
study best trial number: 88

cmaes best parameters

study best param: {'KnightValueEn': 337, 'KnightValueOp': 329, 'RookValueEn': 508, 'RookValueOp': 496}
study best value: 0.5511718750000001
study best trial number: 7

F. Game verification

Test condition

format: round-robin
games per pair: 2000
start pgn: mabigat.pgn
each start position is played twice, side reversed: Yes
time control: 10s+50ms
tournament manager: cutechess-cli

Final result table

cmaes is better with a score of 2008/4000, tpe multivariate scored 1970/4000. In terms of rating cmaes leads by +4.5, not a statistically significant as this lead is still within the error margin of +/- 9 at 95% confidence level.

In the head to head encounter cmaes won over tpe multivariate with a stats of 2000 ( 519, 1010, 471), that is 2000 games, 519 wins, 1010 draws and 471 loses.

Both optimized parameters are still behind the default values but default lead is also not statistically significant as the error of +/-9 is still larger that its lead.

Summary:

   # PLAYER             :  RATING  ERROR  POINTS  PLAYED   (%)
   1 default            :     0.0   ----  2022.0    4000    51
   2 cmaes              :    -1.6    8.7  2008.0    4000    50
   3 tpemultivariate    :    -6.1    9.1  1970.0    4000    49

Head to head statistics:

1) default          0.0 :   4000 (+997,=2050,-953),  50.5 %

   vs.                    :  games (   +,    =,   -),   (%) :    Diff,    SD, CFS (%)
   cmaes                  :   2000 ( 506, 1020, 474),  50.8 :    +1.6,   4.5,   64.4
   tpemultivariate        :   2000 ( 491, 1030, 479),  50.3 :    +6.1,   4.7,   90.5

2) cmaes           -1.6 :   4000 (+993,=2030,-977),  50.2 %

   vs.                    :  games (   +,    =,   -),   (%) :    Diff,    SD, CFS (%)
   default                :   2000 ( 474, 1020, 506),  49.2 :    -1.6,   4.5,   35.6
   tpemultivariate        :   2000 ( 519, 1010, 471),  51.2 :    +4.5,   4.5,   83.9

3) tpemultivariate -6.1 :   4000 (+950,=2040,-1010),  49.3 %

   vs.                    :  games (   +,    =,    -),   (%) :    Diff,    SD, CFS (%)
   default                :   2000 ( 479, 1030,  491),  49.7 :    -6.1,   4.7,    9.5
   cmaes                  :   2000 ( 471, 1010,  519),  48.8 :    -4.5,   4.5,

One reason why the samplers could not beat the default is that the calculation of objective value is only 200 games, whereas in our game verification tests it is 4000 games. Surely there can be some positions in game verification that are not represented during objective measurement. Also the default values had been tested to more than 10k games during its development, it would be difficult for it be defeated easily.

The samplers are not really doing bad, they managed to perform close to the default even if they are given a wider range of search space.

--input-param "{'KnightValueOp': {'default':225, 'min':200, 'max':400, 'step':1}, 'KnightValueEn': {'default':215, 'min':200, 'max':400, 'step':1}, 'RookValueOp': {'default':400, 'min':300, 'max':600, 'step':1}, 'RookValueEn': {'default':625, 'min':400, 'max':700, 'step':1}}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly