SPRT parameters improvement #1859

mcostalba · 2018-12-09T07:55:34Z

After a very long and interesting discussion #1804 (comment) I think it's time to summarize in quantitative way what people have found and proceed in a practical / executive fashion.

The topic is the improvement of SPRT parameters aimed at:

Improve precision, i.e. discard zero or negative patches and accept positive ones with higher margin
Manage resource consumption.

It is immediately clear that the 2 goals go in opposite directions, anyhow from the discussion it seems there is some consensus to allow a reasonable increase of test resources in exchange for stricter parameters.

To allow the discussion to be quantitative and on the point, I'd like people post here only simulation results in a simple and standardized format:

Limits	SPRT[0, 5] STC + SPRT[0, 5] LTC
0 ELO pass prob	xxx
1 ELO pass prob	xxx
<= 0 ELO pass prob	xxx
>= 1 ELO fail prob	xxx
Avg. STC cost	xxx
Avg. STC + LTC Cost	xxx

The first 2 records give a taste of the sensibility of the scheme, the 3 and 4 give a taste of the failure rate of the scheme and the last 2 of the cost of the scheme.

Now, on the cost. Patch's ELO is not uniformly distributed, namely the biggest part of tested patches are within [-2, 0] ELO (neutral or slightly negative). We have to consider this to compute a standardized cost.

I propose the following. Given:

ag(x) = Average number of games for pacthes with x ELO

We define the cost as:

STC Cost = 10 * ag(-2) + 35 * ag(-1) + 40 * ag(0) + 25 * ag(1) + 5 * ag(2)

For the STC + LTC cost we have to consider that:

LTC Cost = STC Cost * 6

Given that all the above is not trivial to implement, I would split this project in 2 phases;

First find a standard tool to reliably compute all the above. Publish it and let people refer to the same measuring tool.
Once we have an open, shared and validated tool then compute the different schemes.

The standard tool should be sound. I would prefer a tool based on simulations more than on formulas, it sounds to me simpler and easier to review for a wider audience of people, and also more flexible. Everybody is encouraged to submit their script / simulator. Once the process has stabilized on a shared and validated single tool, then we will proceed to the second phase.

It won't be quick, but this is an engineering approach to the problem: solid, slow and boring :-)

vondele · 2018-12-09T08:36:39Z

@mcostalba did you have a look at the actual tool I made available #1804 (comment) ? Just click the mybinder icon, and have a look.

yes, it took about two weeks to write, so at least the slow bit is there ;-)

mcostalba · 2018-12-09T09:52:39Z

@vondele Great job! The tool is here: http://nbviewer.jupyter.org/github/vondele/Stockfish/blob/tools/tools/Fishtest_SPRT_opimization.ipynb

Some questions:

You use Gaussian distribution for the ELO of the tested patch, but this is not a god model because ELO is clearly asymmetric, more patches in [-2, 0] than in [0, 2] ELO intervals. So could your tool be changed to account for real distribution in [-4, 4]? Outside of this we don't care in practice.
You lost me on total cost calculation, indeed I understand the cost for a patch of a given X ELO, but I don't understand if you weight the total cost along real ELO distribution.
Would be possible to fill the above table (at the top of the post) with the data of your tool? (for the cost it is ok to use avg. number of games as you have defined it).

Yery · 2018-12-09T09:57:28Z

In addition to changing the bounds, it would probably also make sense to increase the time control of the STC to make the difference between STC and LTC smaller, and to reduce the need and surprises with "speculative LTCs". I would propose 20 + 0.2.

Alayan-stk-2 · 2018-12-09T10:21:57Z

@mcostalba Having done tweaks and bound testing with vondele's tool, I think I can give decent answers to your questions :

The gaussian distribution is indeed not a good model because elo losing patches exceed elo-gaining patches. The tool can absolutely be updated to use a different model. As I wrote here : [merged] Pawn and Piece Values Tuned #1804 (comment) ; I could easily replace the simple gaussian with a function returning the gaussian value for x/3 when x<0. This quick hack helped to better approximate real distribution,and the rest of the tool adapted fine to the change. I think we can devise a better formula than this for the patch elo distribution, and then it's very easy to put it in the code and have everything use the new distribution.
The total cost in vondele's tool depends on the elo distribution of the patches. Doing my quick hack mentionned above to get a better approximation of the elo distribution of the patches did change the total cost result, and in the expected way. I also had a look at that code then and I noticed nothing wrong.
All the values in the table can be computed with it.

Also notice that the tool focus on the cost of 5xSTC + LTC more than STC + LTC (LTC only cost resources if the STC pass, the tool accounts for this) ; because it is observed in practice that several similar variations of a tweak often get submitted to fishtest. But this can be changed easily.

@Yery While longer TC for testing is attractive in reducing low-depth flukes, it is also more costly in resources. As much as I'd like fishtest to use longer STC and LTC, we have to be careful with this. A ratio of only 3 between STC (which is supposed to act first as a filter) and LTC is likely not a good compromise between resource use and result quality. STC 15/LTC 75 would likely consume similar or smaller resources than 20/60 with better overall results.

While discussing TC increase fits here (the optimal bounds depend directly on the time ratio between STC and LTC), the increased precision which is the main focus here already will increase resource usage. Adding another multiplicative resource usage increase on top of it may be too much.

mcostalba · 2018-12-09T10:57:03Z

@Yery thanks for your comment but I'd like firstly to land on a validated and shared tool, one we have it we can do all the discussion we want, but for the moment, to avoiding lacking of focus, please let's stick on the tool implementation details.

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

@vondele could you please copy / paste your tool and simplify it down to just compute the values of a single SPRT combination? We will call it the official simulation tool, and it should be simple: no need to compute both new and old SPRT parameters, just one set, whose results we will report here.

Alayan-stk-2 · 2018-12-09T11:40:22Z

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

Vondele's tool work using continuous functions at many places. Turning it into the evaluation of a few discrete elo patch value would likely be more work for a worse result.

The array idea based on observed fishtest data is still applicable, however. We can get a table with the values hardcoded, then do a linear interpolation between the two closest value in the table.

Trivial example in pseudocode to get the value of the patch elo probability function at x :

array[] = {0.1,0.15,0.08} // values at -1, 0, 1 elo

if (x <  -1 || x > 1)
    return 0

x++
for (i=0;i<3;i++)
{
    if (x > i)
    {
        return ((x-i)*array[i] + (i+1-x)*array[i+1])
    }
}

vondele · 2018-12-09T12:49:33Z

@mcostalba

let me have a look, the table can be 'easily' computed with the notebook.

your: STC Cost = 10 * ag(-2) + 35 * ag(-1) + 40 * ag(0) + 25 * ag(1) + 5 * ag(2)
is quite close to a slightly different Gaussian model (mu = -0.21 sigma = 1.112) of patch distributions, which I will use (it is quite realistic, even if optimistic). It is easier to work with distributions.

ghost · 2018-12-09T12:54:35Z

What are arguments against tighter time controls like 1+0.2 and 2+1? They should in theory use less time for shorter games and have less variation in searched depth(more reliable results for SPRT).
#1804 (comment)

ghost · 2018-12-09T13:29:39Z

I'm currently running this test on low throughput to see how variable win-loss records lead to SPRT distortion(random depths searched due time slice in 10+0.1 being widely different)
http://tests.stockfishchess.org/tests/view/5c0c43c10ebc5902bceed541
and its comments here:
https://github.com/Chess13234/Stockfish/commit/23c7a4101afd6e407ced0b2e92ecd5002cc19434

mcostalba · 2018-12-09T14:03:07Z

@vondele thanks. Actually my numbers are just out of my hat. I'd guess the best ones are the histograms you plotted in the "ELO estimates of STC tests" graph, in case you got them from real tests (indeed I don't know the source of them).

mcostalba · 2018-12-09T14:05:53Z

@vondele I am not an expert in statistic but maybe we could find a distribution that better resembles the data, for instance a skewed one: https://en.wikipedia.org/wiki/Skewness

vondele · 2018-12-09T14:21:55Z

@mcostalba yes, I got them from the fishtest analysis (elo estimates of the past X-thousand tests or so). However, there is no need to be overly concerned to fit that data (it is easy), the part below zero matters little (it is quickly all rejected with the testing schemes we have, most resources go to positive part of the elo spectrum).

The most important aspect is that the distribution of patches tested on fishtest is not absolute, but depends on the test scheme anyway (i.e. unavoidably, patches close to passing will be tweaked and resubmitted). Therefore the notebook is just testing reasonable assumptions and can guide a policy decision, but can't predict the output of a bounds change (that's more a social experiment than natural science).

I have the notebook with the table implemented in a couple of minutes, but will avoid the simplification of it, that's more work, and as you mentioned the graphs are IMO more informative than the table.

vondele · 2018-12-09T14:38:32Z

I've updated the notebook, at the end there is the table Marco asked for. (the links posted previously should still work, but note that these are cached for a few minutes) Feel free to mention possible errors in the notebook.

Now, the table for a couple of bounds is below.
(row three and for slightly redefined, and all based on 1 STC try, and Marco's patch distribution)

Limits	[0, 5] STC + [0, 5] LTC	[1,4] + [0,3]	[0.5, 4] + [0, 3]
0 ELO pass prob	0.0025	0.00036	0.001
1 ELO pass prob	0.0743	0.08051	0.145
+1 ELO rejectance ratio	0.705	0.613	0.532
-0 ELO acceptance ratio	0.0004	3.5e-05	0.0001
Avg. STC cost	23622	47822	41507
Avg. STC + LTC Cost	47505	80060	90712

vdbergh · 2018-12-09T15:01:13Z

@vondele How did you compute the elo estimates? I am asking since there are basically three options: the MLE one (which is the naive one), the median unbiased estimator (which is the one currently used and which is too high/low in 50% of the cases) and the mean unbiased estimator (whose expectation value is equal to the true elo value). For a fixed length test these three estimators coincide but they are (quite) different for SPRT tests.

The median biased estimator (which you proposed!) is intuitively quite satisfying and easy to compute. However for further statistical processing a mean unbiased estimator is in principle more desirable. I have a tool which computes the mean unbiased estimator for SPRT tests but I have not released it yet since currently it only works robustly for small elo differences. However it could already be useful in this case.

To have an idea of what I am talking about here are estimates for the expected values of the three estimators for an SPRT(0,5) with elo=-1 (they were obtained by simulation so they are not completely accurate).

mean=-0.99 (2.27) median=-1.52 (2.36) MLE=-2.04 (2.56)

(the values between parenthesis are the RMS errors between the estimator and the true value, tautologically the mean unbiased estimator has minimal RMS error).

Here is a sample for elo=2. Now the difference between the median/mean unbiased estimators is less dramatic.

mean=2.00 (2.00) median=2.15 (2.31) MLE=2.33 (2.66)

mcostalba · 2018-12-09T15:39:09Z

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

@vondele could you please:

I don't understand what it means: 1 ELO pass prob = 0.0743. It means 7%? Could you please print % instead of floating point values?
Compute also with the follwing cases: STC[0, 5] + LTC[1, 5] and STC[0, 5] + LTC[2, 5]
Add a row in the table:
+2 ELO pass prob

Indeed the +2 ELO patches are the threshold between good stuff and almost useless but burdening stuff. So I am curious of what happens there.

vondele · 2018-12-09T15:40:43Z

@vdbergh, the elo estimates are based on your script (i.e. sprt_elo.py simply run on all results), so the median unbiased estimator... IIUC. Note that the challenge is even more difficult than the estimation of the elo value of a given test. We would like to get the underlying distribution of all tests.

vondele · 2018-12-09T15:44:52Z

@mcostalba the notebook is there to do all kind of experimentation... there are unlimited possibilities. With the mybinder link, it will open in your browser, no need to install anything.

concerning 0.0743 yes converting to percent means 7.43 %.

BTW, I think there are essentially no 2 Elo patches that are still being found.

vdbergh · 2018-12-09T15:47:14Z

@vondele I agree it is an interesting challenge. Do you still have the raw data (or is there an easy way to get it)?

EDIT. With whatever estimator we use we can calculate the transformation

true elo distribution -> (expected) observed elo distribution

So the challenge is to invert this. I wrote earlier that it is a convolution but this is not true. But in any case it seems to be just linear algebra (since we can discretize the problem) which numpy can do very efficiently.

Less naive and probably more robust would be to compute the MLE for the elo distribution directly. I have not thought how hard this would be.

Alayan-stk-2 · 2018-12-09T16:07:24Z

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

You have to divide the elo gains by the resource costs.

Also, as was pointed out in the other thread, the elo-losers have practically no importance when it comes to the elo impact of passing test, but they have an importance when it comes to resource usage.

The proposed new bounds would reject elo-losing patches as fast or faster than now, so the relative increase in total resource cost is smaller than in vondele's data in the table.

You can get about +50% elo passing for +80% resource usage (and good patches are the bottleneck much more than resources at fishtest currently, especially as resources sitting unused push people to try random things).

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

This is certainly a valid reasoning, but what defines "complex patch" ? Is a one-liner a complex patch ?

I think it is a better approach to ask for more tests when a complex change gets submitted and passes testing (which is not often) than to make fail many simple good patches.

7,43% of +1 elo patches passing is quite low, and this also means a majority of +1.5 and +2 elo patches failing. In practice, we will often get multiple minor variations, which increase the likelihood to pass but also the resource usage at testing (which isn't more efficient at the end).

And as SF improves, elo gains get harder. Regression tests are showing that the average elo-gaining patch is much closer to +1 elo than to +3 elo (count the number of elo-gainer during SF10's development cycle, substract some for simplifications, and compare to the total elo gain).

This indicates that elo-gaining patches merged are very often +1/2 elo gainer which gets lucky during STC/LTC rather than outstanding +3/4 elo gainers. Having a lottery to decide which good patches get merged and which get rejected isn't great.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

See my points above. The elo/resource ratio goes down with the newer bounds proposal, but not as much as a division by 2, far from it.

* Compute also with the follwing cases: STC[0, 5] + LTC[1, 5] and STC[0, 5] + LTC[2, 5]

This will reject even more good patches all the while not improving resource usage (closer bounds usually means more, but higher means quicker rejection for patches below the lower bound so not sure which effect dominates). It's 100% guaranteed that those bounds would be awful.

vondele · 2018-12-09T16:34:32Z

@vdbergh https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests_elo is the raw elo data for 6719 tests.

vdbergh · 2018-12-09T16:39:38Z

@vondele Thx. This is in principle already useful. But you did not keep the original (W,D,L) data?

EDIT. I guess I can collect this data myself but some indication how you did it might be useful (then I do not have to think).

vondele · 2018-12-09T17:11:25Z

here we go https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests
that was obtained from downloading all results and grepping for the useful part

vdbergh · 2018-12-09T17:12:22Z

@vondele Great! Thx.

EDIT: This is fascinating data. I did some preliminary processing. It appears that the average submitted patch is -2.5 elo (this should be a statistically correct result but it is probably heavily influenced by outliers). Next weekend or so I will try to get an estimate for the true underlying elo distribution (I am not yet sure if it is feasible).

xoto10 · 2018-12-09T18:23:46Z

BTW, I think there are essentially no 2 Elo patches that are still being found.

+1

I don't think there are many "complex" patches, either, although different people might have different views on what counts as "complex".

Vizvezdenec · 2018-12-09T19:11:45Z

well, we just had 2 elo gaining patches resulting in +4.4 elo in RT, so at least one of them is 2 elo :)

vondele · 2018-12-09T19:38:53Z

@Vizvezdenec ... hopefully, but don't forget the error bars on the RT. It is 4.4 +- 2

Vizvezdenec · 2018-12-09T20:45:26Z

I've just counted actually.
We had 80 elo gainers in sf9->10 that resulted in 55 elo on 8 moves.pgn and probably slightly more than 60 on 2 moves.pgn.
So, IMHO, borderline of garbage patch should be 1 elo or maybe even lower - since on average our elo gainers from latest times are passing with this elo or less.

mcostalba · 2018-12-09T22:58:23Z

Limits	[0, 5] + [0, 5]	[0.5, 4.5] + [0,3.5]	[1,4] + [0,3]	[0.5, 4] + [0, 3]
0 ELO pass prob	0.0024	0.0012	0.00036	0.001
1 ELO pass prob	0.0744	0.1229	0.08051	0.145
+2 ELO rejectance ratio	0.3101	0.2034	0.1423	0.1224
-0 ELO acceptance ratio	0.0004	0.0001	3.5e-05	0.0001
Avg. STC cost	23622	32401	47822	41507
Avg. STC + LTC Cost	47505	67451	80060	90712

I would say that STC[0.5, 4.5] + LTC[0,3.5] is a good compromise between resources used and performance. It increases average test time by a 42% (that is not small) but detects 4 good patches out of 5 instead of the currently 2 out of 3 (reject ratio 0.2034 vs 0.310).

Alayan-stk-2 · 2018-12-10T00:40:14Z

The tool is still using a gaussian curve to simulate elo patch distribution

I went ahead and used the array method for patch elo distribution :

def patch_elo_dist(x):
    #linear array from -5.25 to 5.25, every 0.5
    elo_array = [140, 150, 200, 230, 260, 300, 330, 410, 530, 570, 600, 680, 610, 490, 80, 45, 35, 30, 25, 20, 15, 10]
    z = (x+5.25)*2
    for i in range (0, 22):
        if (z > i and z <= i+1):
            return ((i+1-z)*elo_array[i] + (z-i)*elo_array[i+1])/5690
                        
    return 0

The granularity is quite problematic around 2 elo, the result is still interesting.

Limits	[0,5] + [0,5]	[0.5, 4.5] + [0,3.5]	[1,4] + [0,3]	[0.5, 4] + [0, 3]
0 ELO pass prob	0.0025	0.00123	0.00036	0.0011
1 ELO pass prob	0.0744	0.1002	0.0805	0.1451
+1 ELO rejectance ratio	0.6647	0.5972	0.5950	0.5207
+1.5 ELO rejectance ratio	0.3705	0.2866	0.2563	0.2116
+2 ELO rejectance ratio	0.1589	0.0968	0.0620	0.0553
total ELO gain ratio	1.0	1.1354	1.1436	1.2841
-0 ELO acceptance ratio	0.00022	8.1e-05	2.0e-0.5	0.0001
Avg. STC cost	10107	13799	20293	17595
Avg. STC + LTC Cost	20332	28959	34261	38942

Note that while elo-loser acceptance ratio goes down, the number of elo-loser goes up with this distribution.

It appears that while most of the potential elo gain in submitted patch is in +1 elo patches, the bulk of the "easy to grab" elo gain is still in the +2 elo patches.

noobpwnftw · 2019-01-10T23:52:32Z

Why is having a full queue of crap better than letting it go idle?
One of the benefits of having spare resources is you can test something that you wouldn't try when there are something more useful to test with. If people are spending more time to prove crap is crap more confidently then this is going to the wrong direction.

NKONSTANTAKIS · 2019-01-10T23:57:29Z

For a multi step strategy (pre-filter + filter) it would be in our best interest to gradually increase confidence + narrow selection of the lower end. For example a nice 3 step plan would be:

Prefilter, (-3 , 3) at 5"+ 0.05" // Cheaply cutting off about half of tested patches.
Filter, (-2 , 3.5) at 20" + 0.2" // Gradual rise of requirements + quality.
Selection, (-1 , 4) at 80" + 0.8" // High quality selection.

This scheme is also very healthy for scaling. Starting with a fast precondition of a requirement to just be more likely positive on low TC and gradually increasing requirement, it gives opportunity for the good scalers to show their worth. The filters are doing their job of saving resources but with minimal oppression. In the end the selection is of very high 80" quality with middle ground confidence.

In general this scheme seems to me both cheaper and better than the current one. I would welcome some validations with the tools. @vondele @Alayan-stk-2

NKONSTANTAKIS · 2019-01-11T00:32:14Z

For the current 2-phase scheme, I regard the new LTC bounds very appropriate, but the STC ones unnecessarily expensive, too thick filtering and hostile to scaling.

We are currently spending so much in order to have a very accurate estimation on very low quality STC information, why?

The STC would be better in the opposite direction, faster and less strict, to save us resources in general but to block as few potentially good stuff as possible.

NKONSTANTAKIS · 2019-01-11T02:14:01Z

But, to give credit where its worth, for neutral scaling stuff the derived confidence from the STCs is of equal worth and obtained cheaply. So the sheer amount of increased confidence gives a solid flukeless progress in this patch category.

The benefited target group are patches with positive elo + neutral/mildly negative scaling. The (0.5 , 4.5) is more demanding than the (0, 3.5) LTC, but the LTC exists, safeguarding us from negative scalers

It feels like the new STC is not at all a filter, but the basis of our selection process, being harder to pass than the LTC. In this viewpoint the LTC acts as a precaution measure, maybe overly expensive. So the change has been from: STC filter + LTC selection
to: STC selection + LTC verification

I am not aware of the distribution regarding neutral/negative/positive scaling, but we have observed far too many STC greens + LTC reds with the old bounds. Naturally, part of this inconsistency is due to statistical flukes and part is due to bad scaling. Our methodology was and is hostile towards good scalers. I believe that this category, even if uncommon, is the most important one. But how to spot it? Speculative LTC's has been a mild workaround STC suppression, the equivalent of lowering STC requirements. Now the combination of not using speculative LTCs + high STC confidence will completely block the slow-starting high-scaling category. This is very concerning. I can think of 2 solutions:

1. To lower the STC elo requirements. This will allow more patches to be tested at LTC, but it will also lower the elo value of STC-derived confidence.

2. To use a VSTC scaling check after mediocre STC results. Instead of costly speculative LTCs, a VSTC check can cheaply spot positive scaling. The more negative the result, the better for scaling. New STC offers a very solid base regarding confidence for comparison with the VSTC. Here not only negative results will be promising but also positive results would be alarming. Maybe its a good idea to check how often STC greens which fail LTC perform too good at VSTC. If this is the case (and I think its illogical for the scaling curve to behave chaotically) we can save ourselves from too expensive LTCs by locating scaling through cheap 2" VSTCs.

vondele · 2019-01-11T06:27:41Z

@noobpwnftw I agree that there is no big problem having the queue go idle, and there is no need to fill it with crap.. I think it is good (and fun!) to take some time to think about a patch.

However, the increase in time is relatively small for patches that are crap. Patches < 0 fail quickly also in the new scheme. The patches that now run (much) longer are those in the middle of the interval (i.e. near 2.5 BayesElo at STC). The hope is that the more careful testing will increase the certainty with which patches > 2.5 Elo pass.

What I think we need to do now is to look into how fishtest assigns its priorities. The current situation in which patches are in a pending state for days just because they need >80k games (which is now more common, and actually means they are likely good) is annoying. Fishtest should be more like a FIFO system.... possibly with a priority related to the number of patches by the same author that are running simultaneously.

vondele · 2019-01-11T06:53:08Z

Let me post a link to fishcooking where this discussion on priorities is taking place:
https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/BxWmrJ9yK64

From this, it looks like just increasing the throughput would weight the starting time more, and avoid the issues mentioned above. Still needs a closer look.

mcostalba · 2019-01-11T06:58:53Z

@vondele I agree on limiting resubmitting the (almost) same patch again and again.

I see many people is going wild with repeated attempts on the same idea. Limiting this misbehavior it should be the first target to both decrease the average waiting time and to avoid the people to abuse the donated and free CPU resources.

Sorry to be rude, I intend to be provocative here, but I don't care about complex schemes to handle the quantity of tests, I care about avoiding people to create useless tests in first place, the latter point is more valuable to me because it goes beyond a pure technical merit, and addresses the fairness and conscious usage of free donated resources. Many people seem to not care about this and this is bad.

NKONSTANTAKIS · 2019-01-11T12:33:32Z

There is a rationale for testing many times the same idea because if the idea is promising but just not enough to pass, the quest is to find the perfect scheme that will make it.

Inspiration is limited and if we are to abandon ideas only after a couple of tests, we will never know if there was elo gain in some form of the idea or not. Imo its much better to be thorough and sure.

Everybody is sharing the same goal, to make SF better, a lot of people are working 24/7 for this, and SF is indeed progressing in one way or the other. Not only the resources are free but people are also working for free, lets not forget that.

Having said that, there are of course cases where the resources are put into bad use. Those imo are not used maliciously but unconsciously, ie the intention is good but there is lack of judgment so for those its good for everyone to be addressed specifically and to the point, exactly when they take place. A generic call for all to behave is usually interpreted subjectively, everyone may think that he is using resources well :)

ghost · 2019-01-11T12:40:50Z

The problem isn't that people make lots of test of the same idea, the real problem is running them all at the same time.
There could be a mechanism to reduce priority of tests of a single user dependent on their number, so resources are shared equally.
I.e. current scheme:
user A runs 10 tests
user B runs 1 tests
90% of resources will go to A.
Instead of allocation resource by tests, it should allocate resource by users so A and B got both 50% resources, and
in case of N users they all get 1/N resources.

noobpwnftw · 2019-01-11T12:41:46Z

To not discourage patch writing, I think we can emphasis on proper priority when submitting tests: if they are just wild ideas, put them on -1, if you are trying a few variants, run one at a time. For those long promising tests, we can manually put them on high priority and finish them first.

Since we have more confident STC tests now, I would recommend all LTC tests get high priority by default.

xoto10 · 2019-01-11T12:56:01Z

Yes. I generally limit myself to 2 tests at a time and put later ones at prio -1 so that they can get approved and then i can release them later.

Alayan-stk-2 · 2019-01-11T14:07:16Z

-1 priority tests may take a lot of time to be tested, because as long as there is at least one 0 priority test in the queue, it won't be run.

A potential solution is indeed to run variants one after another to get insight from the first attempts. Otherwise, setting low throughput should work too.

I agree LTC tests should be run with a higher priority to avoid them being stalled for days (like for the voting2 patch).

I will rerun some more 3-4 stage simuls with more parameter tweaks attempts. The formula for the elo distribution of submitted patches is not perfect (and doesn't attempt to simulating scaling because we don't have data on this), but rather good, enough for the results to be relevant.

Alayan-stk-2 · 2019-01-12T14:30:45Z

Here are a few simulations results.

The table contains :

The old and current bounds for comparison
[-0.5, 4.5] + [0, 3.5] as requested by James_TCEC. As can be seen, this improves passing rates noticeably, but increase regression/useless patch probability and make LTC costs explode
My first 3-staged and 4-staged parameters (VSTC : 4+0.04 ; STC : 10+0.10 ; MTC (4-staged only) : 25+0.25 ; LTC : 60+0.6)
Two new 3-staged parameters tests where I sacrificed the non-regression probability in favor of reducing resource cost and increasing expected elo gain.

No 4-staged parameters new test yet as the table wouldn't fit here.

Limits	[0,5] + [0,5]	[-0.5,4.5] + [0,3.5] (new)	[0.5, 4.5] + [0,3.5]	[-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5]	[-1.3, 3.0] + [-0.9, 3.25] + [-0.5, 3.5] (new)	[-1.5, 3.2] + [-1.0, 3.5] + [-0.7, 3.8] (new)	[-1.5, 3.0] + [-1.1, 3.2] + [-0.9, 3.5] + [-0.9, 3.7]
0 ELO pass prob	0.0025	0.00433	0.00123	0.00081	0.00374	0.00486	0.00112
0.5 ELO pass prob	?	0.03773	?	0.01593	0.04674	0.04663	?
1 ELO pass prob	0.0744	0.1928	0.1002	0.1443	0.2551	0.2258	0.1945
1.5 ELO pass prob	0.2460	0.4831	0.3422	0.4667	0.5928	0.5323	0.5291
2 ELO pass prob	0.5135	0.7320	0.6438	0.7538	0.8284	0.7790	0.7919
total ELO gain ratio	1.0	2.0706	1.3192	1.7646	2.5934	2.3489	2.1326
-0 ELO acceptance ratio	2.5 e-04	3.56 e-03	9.1e-05	4.7 e-05	2.4 e-04	3.5 e-04	5.8 e-05
Avg. STC cost	18431	20586	24456	VSTC : 10887 ; STC : 5803 ; LTC 16727	VSTC : 12649 ; STC : 7537 ; LTC 21119	VSTC : 11001 ; STC : 6707 ; LTC 16449	VSTC : 11991 ; STC : 7820 ; MTC : 8096 ; LTC : 10114
Avg. total Cost (in STC games)	27931	50640	38039	33418	41306	34157	38021

I have not done computations for the scaling behavior of the newly proposed bounds. I expect the "favors better scalers but has a bias towards shorter TC" to stay true. For 3-staged testing, 15+0.15 STC would only moderately increase the overall cost (it's the cheapest phase) and would improve scaling, but maybe not enough to justify the cost.

For scaling confirmation, it could be advantageous rather than running a SPRT at a very long TC (which can be very expensive) to run a fixed number of games and only reject patches going below a certain threshold. This step would be a safety net and would only reject a small number of patches/simplifications.

NKONSTANTAKIS · 2019-01-25T01:22:02Z

I think we have all experienced a painful resource bottleneck of the workflow: everything takes too long, while the creative human availability is in abundance.
It seems that @snicolet and @Rocky640 where correctly concerned for this.

On the other hand the elo progress looks decent, at the same levels with before.

I think that the new LTC is nice, but the STC should resolve faster so that the developers are not limited.
Such high STC confidence imo is not worth it, because there is the higher quality LTC test as safeguard.
STC flukes can be a window for good scalers. An occasional double STC+LTC fluke will not be a disaster: 0.5 elo gainers are not generally wanted but are also not too bad of a deal, they will eventually be simplified away. Also with @protonspring that active on simplifications makes adding code for lower elo less dangerous, maybe even beneficial, while with a very high passed simplification to elo gainer ratio slow regression is a bigger concern.

With the old bounds framework was often idle and poor @Vizvezdenec had to throw whatever he could think off, but new LTC already uses much more resources than old one and its rightfully with lower requirements because with the old bounds the LTC greens to STC greens ratio was incredibly low.

Contemplating all this, and regarding that enough time has passed for evaluating the new scheme, the propositions are:

1. Rollback of STC to (0,5) + Keeping LTC to (0,3.5) and/or

2. Research for faster pacing STC, ideally with less suppression of low starting good scalers.

Alayan-stk-2 · 2019-01-26T12:26:34Z

@vondele Could you add to your tool something to simulate patches not having the same elo at STC and LTC ?

Because we only submit to LTC patches having passed STC, we have very limited data on the actual distribution, but even with this caveat some parametrized formula allowing to test a few hypothesis would help in designing bounds better suited for good scalers.

Also maybe could use an update considering @vdbergh findings on correlation in game pairs which trinomial frequencies miss ? I'd expect e.g. that the confidence level for STC being in reality higher than predicted by the tool is reducing a lot the rate of "flukes" for patches between 1 and 1.5 elo and so the expected elo gain. This invalidates pretty much all the previous tables...

Ideally too, native 3-staged testing support (with 1 stage being easy to make inactive with guaranteed pass to support 2-staged) so I don't have to paste manually my code tweaks. 🙂

vondele · 2019-01-26T16:47:34Z

@Alayan-stk-2 right now I want to focus on other stuff (cluster), but feel free to fork that branch. This should be relatively easy to add.

However, TC scaling of patches is not quite clear.. there is some data here:
https://github.com/glinscott/fishtest/wiki/UsefulData#elo-change-with-respect-to-tc
showing it can go either way.

vdbergh · 2019-01-26T17:05:25Z

And unfortunately these tests do not necessarily show that the average scaling ratio may be assumed to be 1:1 because of selection bias (well scaling patches are more likely to make it into master). So about scaling absolutely nothing concrete is known. It would require a dedicated effort consuming a large amount of ressources to get reliable information.

NKONSTANTAKIS · 2019-01-28T14:26:25Z

@vdbergh Actually what is helped to make it into master is neutral scaling. Both good scalers and bad scalers are blocked, the former at STC and the latter at LTC. For an equal LTC performance of 2 patches, the better scaler would most likely be the one which performs worse at STC, but from a statistical confidence perspective we prefer 2 positive results.

I find it very annoying that 2 beneficial concepts lie in opposite directions, scaling promotion and cheap confidence.

vdbergh · 2019-01-30T10:08:32Z

@NKONSTANTAKIS You make a good point but you are talking about something different than I was. I assume you are trying to compare the behaviour with respect to scaling of the new bounds versus the old bounds. I didn't say anything about this.

I was pointing out that within a given set of bounds a better scaling patch is more likely to pass STC+LTC. This seems a tautology to me. The pool of patches submitted to LTC is independent of the scaling (obviously) but then the LTC filter has a preference for the better scaling patches (again obviously).

Alayan-stk-2 · 2019-01-30T10:16:36Z

I was pointing out that within a given set of bounds a better scaling patch is more likely to pass STC+LTC. This seems a tautology to me. The pool of patches submitted to LTC is independent of the scaling (obviously) but then the LTC filter has a preference for the better scaling patches (again obviously).

I think you are mistaken.

Yes, for patches having equal STC performance, better scalers will be selected by LTC. This is completely obvious.

But if you look another step back to the patches which are submitted at STC, you will find that patch scaling matter for passing STC. Basically, this is wrong :

The pool of patches submitted to LTC is independent of the scaling (obviously)

It is not.

A patch giving +1 elo at STC and +2 elo at LTC will fail much more often at the STC step than a patch giving +2 elo at STC and +1 elo at LTC. Hence compared to the patches submitted for STC (all fishtest patches), the pool of patches submitted at LTC will be biased towards better STC performance.

vdbergh · 2019-01-30T10:57:49Z

@Alayan-stk-2 I understand your point. It depends on the model. If you assume that STC and LTC elo are negatively correlated (e.g. STC elo+LTC elo=contant as in your example) then you are right. But this is an extremely pessimistic assumption. I find it more reasonable to assume that LTC elo=r*STC elo+(some random variable with expectation value 0) where r is some constant near 1.

So for me it is more realistic to assume that the elo distribution is something like (1,0.5),(1,1),(1,1.5),(2,1.5),(2,2),(2,2.5). Among those the ones with higher scaling are more likely to pass.

But I am ready to be proven wrong :)

vdbergh · 2019-01-30T12:01:23Z

If SE, LE are Stc and Ltc elo then the scaling model I would like to propose is

LE=r*SE+X

where X~N(0,sigma^2) and r is a constant near 1.

Assuming additivity of elo then given sigma2 and the elo prior one may get an estimate for r by doing STC+LTC tests of sf10 vs sf9 (as I have already been doing). However the difficulty lies in determining sigma^2. I don't see an approach which would not cost a large amount of resources.

In any case the total number of parameters in the "fishtest model" would be 5.

mu of the elo prior
sigma^2 of the elo prior
the nTries parameter (to compensate for the fact that many patches are trivial variations of each
other).
the average scaling ratio r.
sigma^2(scaling), to account for the variation in scaling behaviour.

Sadly 5 parameters is too many to fit with the available data I think.

https://web.ma.utexas.edu/users/mks/statmistakes/ovefitting.html

Alayan-stk-2 · 2019-01-30T12:08:03Z

Of course STC elo + LTC elo is not a constant. I've never assumed patches submitted at fishtest to follow this, and my point doesn't depend on this.

This was to give a quick example of how two patches with different scaling behavior get (with old bounds) equal STC+LTC passing behaviour with higher STC passing (and so higher proportion of LTC runs) for the bad scalers. With new (current) bounds, the STC/LTC 2/1 is actually more likely to pass STC+LTC than the 1/2.

Your fundamental assumption here seems to be :

"For any given STC level of performance, the average scaling behavior at LTC of patches submitted at fishtest is ~ neutral."

Let's suppose for the purposes of the demonstration that SF is at the ceiling of STC efficiency (best elo achievable given this amount of computations). We know that this will be different from the optimal code to reach LTC ceiling. Functional code change improving LTC behavior will hurt STC behavior, and the average scaling behavior assumption can't hold. Tons of things losing STC elo will lose LTC elo, but an handful of things losing STC elo will gain LTC elo.

Now, of course, SF is not at that ceiling, and likely still far from it. But it is also very far from a random mover, so the tension between the optimums exists. Tuning also commonly shows that 20+0.2 results can be noticeably different from 60+0.6. Protonspring recent serie of TM patch tries, with several passing STC before failing LTC, and one LTC tune result passing LTC clearly but failing STC, is a good example.

Things which can help SF spot something a ply earlier but not many plies earlier will be better at STC where each ply of search is critical, while things which help SF spot things many plies in advance (often allowing to spot it at all when shuffling can get the engine to push the truth beyond the horizon) get better with a longer TC.

Alayan-stk-2 · 2019-01-31T19:14:51Z

I forked @vondele SPRT simul tool as suggested. So far, I have only added a few more pass probabilities in the final results to not have to add them manually, and basic support for up to 4-staged testing (vstc and mtc can be disabled by switching a bool so 2-staged and 3-staged are easy to test too, I actually set the default to the current fishtest bounds to confirm I get the same results as with vondele's tool in that case).

The binder is available here : https://mybinder.org/v2/gh/Alayan-stk-2/Stockfish/SPRT_tool?filepath=tools%2FFishtest_SPRT_opimization.ipynb

I will update it later to get more stats related to the different stages at the end and to have adapted graphs.

Alayan-stk-2 · 2019-02-09T00:48:24Z

Recent patches being merged into Stockfish confirm the trend : progress is driven by small gainers. There was 5 elo gainers between the 22nd January regression test and the one of 3rd February, with an overall gain of ~4 elo (big error bars). A similar picture emerges with a longer time frame.

New LTC bounds have definitely helped to validate and merge some of these recent gainers, but the STC requirements continue to be very high with at least 1.67 elo with a high confidence degree.

A large majority of STC tests do not even manage to break even to get yellow result. Hence, laxer bounds at 5+0.05 TC would certainly not reject more good patches (especially good scalers) than current STC bounds and would be much cheaper resource-wise as a first filter. Then better STC and LTC testing could be done.

mcostalba mentioned this issue Dec 9, 2018

[merged] Pawn and Piece Values Tuned #1804

Closed

Alayan-stk-2 mentioned this issue Mar 2, 2019

Combo #2028

Closed

mcostalba closed this as completed Mar 9, 2019

This was referenced Oct 19, 2019

New discussion regarding testing bounds, testing book, etc. #2283

Closed

Change STC bounds to [-1..5] official-stockfish/fishtest#417

Closed

Change STC bounds to [-1.5 , 4.5] official-stockfish/fishtest#420

Merged

SPRT parameters improvement #1859

SPRT parameters improvement #1859

Comments

mcostalba commented Dec 9, 2018 • edited Loading

vondele commented Dec 9, 2018

mcostalba commented Dec 9, 2018

Yery commented Dec 9, 2018

Alayan-stk-2 commented Dec 9, 2018

mcostalba commented Dec 9, 2018 • edited Loading

Alayan-stk-2 commented Dec 9, 2018

vondele commented Dec 9, 2018

ghost commented Dec 9, 2018 • edited by ghost Loading

ghost commented Dec 9, 2018

mcostalba commented Dec 9, 2018

mcostalba commented Dec 9, 2018

vondele commented Dec 9, 2018

vondele commented Dec 9, 2018

vdbergh commented Dec 9, 2018 • edited Loading

mcostalba commented Dec 9, 2018 • edited Loading

vondele commented Dec 9, 2018 • edited Loading

vondele commented Dec 9, 2018

vdbergh commented Dec 9, 2018 • edited Loading

Alayan-stk-2 commented Dec 9, 2018 • edited Loading

vondele commented Dec 9, 2018

vdbergh commented Dec 9, 2018 • edited Loading

vondele commented Dec 9, 2018

vdbergh commented Dec 9, 2018 • edited Loading

xoto10 commented Dec 9, 2018 • edited Loading

Vizvezdenec commented Dec 9, 2018

vondele commented Dec 9, 2018

Vizvezdenec commented Dec 9, 2018

mcostalba commented Dec 9, 2018

Alayan-stk-2 commented Dec 10, 2018

noobpwnftw commented Jan 10, 2019 • edited Loading

NKONSTANTAKIS commented Jan 10, 2019

NKONSTANTAKIS commented Jan 11, 2019 • edited Loading

NKONSTANTAKIS commented Jan 11, 2019 • edited Loading

vondele commented Jan 11, 2019

vondele commented Jan 11, 2019

mcostalba commented Jan 11, 2019

NKONSTANTAKIS commented Jan 11, 2019

ghost commented Jan 11, 2019

noobpwnftw commented Jan 11, 2019 • edited Loading

xoto10 commented Jan 11, 2019 via email

Alayan-stk-2 commented Jan 11, 2019

Alayan-stk-2 commented Jan 12, 2019

NKONSTANTAKIS commented Jan 25, 2019 • edited Loading

Alayan-stk-2 commented Jan 26, 2019

vondele commented Jan 26, 2019

vdbergh commented Jan 26, 2019

NKONSTANTAKIS commented Jan 28, 2019

vdbergh commented Jan 30, 2019

Alayan-stk-2 commented Jan 30, 2019

vdbergh commented Jan 30, 2019 • edited Loading

vdbergh commented Jan 30, 2019 • edited Loading

Alayan-stk-2 commented Jan 30, 2019

Alayan-stk-2 commented Jan 31, 2019

Alayan-stk-2 commented Feb 9, 2019

mcostalba commented Dec 9, 2018 •

edited

Loading

mcostalba commented Dec 9, 2018 •

edited

Loading

ghost commented Dec 9, 2018 •

edited by ghost

Loading

vdbergh commented Dec 9, 2018 •

edited

Loading

mcostalba commented Dec 9, 2018 •

edited

Loading

vondele commented Dec 9, 2018 •

edited

Loading

vdbergh commented Dec 9, 2018 •

edited

Loading

Alayan-stk-2 commented Dec 9, 2018 •

edited

Loading

vdbergh commented Dec 9, 2018 •

edited

Loading

vdbergh commented Dec 9, 2018 •

edited

Loading

xoto10 commented Dec 9, 2018 •

edited

Loading

noobpwnftw commented Jan 10, 2019 •

edited

Loading

NKONSTANTAKIS commented Jan 11, 2019 •

edited

Loading

NKONSTANTAKIS commented Jan 11, 2019 •

edited

Loading

noobpwnftw commented Jan 11, 2019 •

edited

Loading

NKONSTANTAKIS commented Jan 25, 2019 •

edited

Loading

vdbergh commented Jan 30, 2019 •

edited

Loading

vdbergh commented Jan 30, 2019 •

edited

Loading