Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[merged] Pawn and Piece Values Tuned #1804

Closed
wants to merge 1 commit into from

Conversation

SFisGOD
Copy link
Contributor

@SFisGOD SFisGOD commented Nov 7, 2018

Pawn and Piece Values Tuned at LTC

Failed STC
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 27487 W: 5846 L: 5903 D: 15738
http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8

Passed 1st LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 38503 W: 6270 L: 5999 D: 26234
http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750

Passed 2nd LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 34016 W: 5584 L: 5326 D: 23106
http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4

Bench: 3460731

@SFisGOD
Copy link
Contributor Author

SFisGOD commented Nov 7, 2018

Related discussion #1511
My 2nd LTC test is ready in case @snicolet wants another LTC.

@FauziAkram
Copy link
Contributor

For both the patches that we have, that failed STC but passed LTC, I would suggest two runs with (-3,1) bounds, one at STC and one at VLTC just to make sure that we are not losing any elo there.
And if they pass, then merge them because they already proved to be very good in the LTC test, while if they fail any of the two (-3,1) run , then reject the patch.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Nov 7, 2018

@FauziAkram STC carries no validity whatsoever and (-3,1) VLTC is a total waste of resources as it will surely pass

@mcostalba
Copy link

Firstly the easy part: @SFisGOD please in the future add also the STC result even if negative, we (all the people) would like to have the full information, no need to "hide" it (we are grown enough to stand a failed STC and a successful LTC).

Now the difficult part.

To me STC is a way to quickly filter out bad patches so that we go through the resource consuming LTC only with good candidates. So to me a LTC success is a strong result technically speaking.

On the other hand, if we apply this PR blindly, nobody can prevent people from submitting to LTC failed STC patches, yielding to a nightmare.

So we have to deal with 2 topics here: a technical one (is a good LTC enough?) and a guideline breaking one (Is it ok to roll your dices again going for a LTC after you failed a STC?)

@SFisGOD
Copy link
Contributor Author

SFisGOD commented Nov 7, 2018

@mcostalba Thanks for the feedback. It was not my intention at all to hide that this patch failed STC. I thought it was enough that vondele pointed it out in his comment. However, I understand that people might want to know the full information so I edited the PR to include this missing information about the failed STC.

On the other hand, if we apply this PR blindly, nobody can prevent people from submitting to LTC failed STC patches, yielding to a nightmare.

I think LTC-tuning patches should be an exception in my opinion. LTC-tuning patches should at least be tried in LTC even if it failed STC.

@Vizvezdenec
Copy link
Contributor

Vizvezdenec commented Nov 7, 2018

From my point of view 1 passed LTC is not enough. Because usually when we try speculative LTCs is for patches that "almost" passed STC, which have LOS like 94% and had LLR>2 thus basically guaranteeing being non-regression with 90+% probability, especially if it's some stuff that can scale well (based on intuition mostly :) ).
But having patch that has <50% LOS and passed LTC committed is, imho, a big stretch - 5% are real, my last committed patch was 15th LTC I run after STCs passed - so basically I had like 15 STCs that passed and were not elogainers in a row! So positive LTC can for sure be a fluke in case of this patch.
Last time we did commit such a patch we had 2nd LTC to verify it's not a fluke.
But now I suggest some modification - running like 120+1.2 VLTC to verify not only that it's a gain but that it still will be a gain even on higher TC and now some weird artifact that is good only on average depth 21 but also with higher depths.
I think we for sure can afford 120+1.2 for 2 tests - after all now fistest is not extremely busy.

@Alayan-stk-2
Copy link

So we have to deal with 2 topics here: a technical one (is a good LTC enough?) and a guideline breaking one (Is it ok to roll your dices again going for a LTC after you failed a STC?)

When the patch is tuning done with tons of LTC games, the waste of resources would be to not do LTC testing because of a failed STC.

LTC tunings should always be tested at LTC no matter the STC result.

To me STC is a way to quickly filter out bad patches so that we go through the resource consuming LTC only with good candidates. So to me a LTC success is a strong result technically speaking.

If STC is only a first filter, it doesn't need to be excessively stringent. The statistical requirements there could be relaxed.

On the other hand, the 95% at LTC seem to not generate enough confidence in some cases, so there should be guidelines for additional testing when needed.

Imho, the 120+1.2 test in cases such as this one would bring more information than a second 60+0.6 run, so the additional resource use serve a purpose. Patches failing STC but succeeding clearly at LTC are not very common, this won't overwhelm the testing framework.

@ElbertoOne
Copy link

I'm also of the opinion that we should have 2 successful tests to be more sure that the patch isn't a fluke. Those tests should be on 2 different time controls, the 60+0.6 run being one of them.

@MichaelB7
Copy link
Contributor

The current STC of 10 + 0.1 has outlived it’s usefulness. I favor the two pass method. Perhaps 30 + 0.3 and 60 + 0.6 but my preference would 60 + 0.6 and 90 + 0.9. A double pass at those two time controls would lessen the possibility of the “rogue” patch getting through. This change should not be considered lightly - we need to re-run some tests with both failed and passed patches to see if the result would be different. I would especially be interested in patches approved with yellow or failed stc results. I suspect that you will find some patches that failed previously , would now pass and some that passed would now fail. Historically ( from many years ago), it was always thought that patches impacting the search function needed to be tested at longer time controls and patches that impacted evaluation could be tested at shorter time controls. Also I believe when a patch very quickly - say less than 20,000 games - we may need to keep it running until a fixed of games have also been run. My thought is that 30,000 or 40,000 games should be the minimum. My own anecdotal experience is that ones that pass the quickest - say only 10,000 games or so are most proned to be a lucky result.

@SFfan876
Copy link

SFfan876 commented Nov 7, 2018

what do we have? one problem.
the problem is that people make a tuning on the TC which they want.
20 sec 30 sec 60 sec.
but in rules (https://github.com/glinscott/fishtest/wiki/Creating-my-first-test)
clearly written
for SPSA tests
set the time control to 20+0.2
so if we want to prevent such problems in the future
everyone should follow the rules

@xoto10
Copy link
Contributor

xoto10 commented Nov 7, 2018

The current STC of 10 + 0.1 has outlived it’s usefulness.

I disagree, I don't think we have the hw to run all tests at 30+ or 60+

... I favor the two pass method. Perhaps 30 + 0.3 and 60 + 0.6 but my preference would 60 + 0.6 and 90 + 0.9. ... Also I believe when a patch [passes] very quickly - say less than 20,000 games - we may need to keep it running until a fixed of games have also been run. My thought is that 30,000 or 40,000 games should be the minimum. My own anecdotal experience is that ones that pass the quickest - say only 10,000 games or so are most proned to be a lucky result

These 2 ideas seem contradictory to me. I seriously doubt we have enough hw to run 30+0.3 for stc and 60+0.6 for ltc, and certainly not for 60 and 90. And clearly not to run longer tc AND use more games per test.

But I agree that more games would give more reliable tests and that flukes often come from the shortest tests. Because more games would be good, I think we should aim for that, presumably by amending the SPRT parameters somehow, and keep the existing stc and ltc time controls.

If tuning runs are best run and tested at ltc, then maybe they are the one case where there is no point in using stc. I defer to others on what extra tests (if any) are required in this case.

... Historically ( from many years ago), it was always thought that patches impacting the search function needed to be tested at longer time controls and patches that impacted evaluation could be tested at shorter time controls.

Interesting, if senior devs this think this is still a useful idea, perhaps they could say so, and it could form part of the "grey area" guidance from them around any parts of the test rules that are not completely automatic.

@NKONSTANTAKIS
Copy link

  1. In regards to scaling (and not to statistical flukes) a failed STC + passed LTC is superior result to a passed STC + passed LTC. Or if you rather, a twice failed STC + twice passed LTC is superior to a twice passed STC + twice passed LTC. For the scaling curve it is an asset to start low.

  2. The draw rate increases from STC to LTC to VLTC, making it harder to pass regardless of scaling. @Vizvezdenec 's 15 green STC's in a row failing LTC is not only due to scaling but to the increased draw rate as well. For Vondele's randomizer patch draw rates: VVLTC: 75.4% VLTC: 72.33%, LTC: 67.47%, STC: 56.83 Randomize draw eval. #1770 VVLTC failed here. Its harder to prove elo superiority on high drawrate.

  3. Current STC is a cheap low quality filter with extremely low credibility. Depth 13 and 0.1 sec per move is not chess, sorry. There is a reason that the last 200+ attempts to tune at this TC all failed. There is a problem when passing or failing this filter is given too much importance.

  4. I understand the need for rules to prevent people from abuse, but we select every patch on the basis of 1 passed LTC and we accept the 5% flukes as well. STC has nothing to do with this, it does NOT bring more or less confidence in regards to the LTC test. If we need more confidence we can raise the 95%. Mixing STC and LTC confidence is plain wrong.

@mcostalba Mr Costalba, like yesterday I remember the fight against people who were advocating that 10" games is adequate for everything and facilitated the increase of the LTC from 40" to 60", which in my opinion upgraded the chess quality of SF. What do you suggest now? There have been many discussions on improving the TC's but it seems to me that there are too many different good ideas.
https://groups.google.com/forum/#!topic/fishcooking/f1yHu2fhrbI
https://groups.google.com/forum/#!topic/fishcooking/oaJQEM4wck0
It seems to me that it is very difficult to agree on something, while someone like you could facilitate a beneficial readjustment. Stephane Nicolet expressed a concern for resources about raising the STC quality. Besides the fishcooking recommendations, how about if instead of the STC filter we used an LTC filter of much lower confidence and a little easier bounds, soas to use the same resources that we currently use for STC but without the lowered game quality. Something like (-1,3) 90%.

We have much more computational power than in the past, I think its a great opportunity to make good use of it and form a superior product.

@Vizvezdenec
Copy link
Contributor

2nd statement is NOT true actually.
Increasing drawrate wouldn't make positive patch to pass with less probability, we use bayes elo model in which elo needed to pass is LOWERED with longer TC.
patch of 1.2 elo on STC will pass with the same probability as 1.0 elo on LTC,
3.7 on STC ~ 3.0 on LTC and further on.
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test
So if patch scales well (be "well" I mean that it gives ~ the same elo at LTC as STC) it will pass longer TC more easily compared to shorter TC because of elo model fishtest uses.

@Rocky640
Copy link

Rocky640 commented Nov 7, 2018

Let's look at the test. The piece values have been changed by +-2. Not significant.

However the pawnValueMg have decreased by 6. This may have a significant (and interesting impact)

This is also used in search.cpp

// Quiet best move: update move sorting heuristics
        if (!pos.capture_or_promotion(bestMove))
            update_quiet_stats(pos, ss, bestMove, quietsSearched, quietCount,

This is not adding any code, and STC was almost yellow.

I don't see any problem committing this one.

@NKONSTANTAKIS
Copy link

The compensation that we get through bayes elo is not clear if it outweighs the effect of the increased drawrate because this also depends on many other factors. The bayes elo formula is not linked to the drawrate but to the TC. Depending on the condition of SF it could be a lot easier or a lot harder to pass. ie SF with a base negative contempt or zero contempt or SF5 would have a different easiness-of-passing STC-LTC relation. We can not theoretically conclude that LTC is easier to pass, because in practice we observe the opposite effect for SFdev.

But if I am wrong on this and you are right and it is indeed the case that with neutral scaling it is easier to pass LTC than STC, then we are doing something terribly wrong for passing LTC only a very small % of our passed STC's. With your theory we would pass >50% but we pass how much? Not even 20%? Someone has this info? You simply can not fail 15 in a row at 50%. Maybe the terrible thing we are doing is that by starting with STC pass demand we catch bad scalers and exclude good scalers, as a good scaling curve would start low and rise and a bad one would start high and fall.

No matter how someone sees it a change of the STC is direly needed. Given the amount of our resources maybe a more expensive methodology is sensible (like we did in the past with 40"->60" and @IIvec proposes), but if we want to keep the same gameflow we still have many options:

  1. Increase the quality of the filter by making it more LTC-like, compensating for the resources with lowering confidence and requirements. Keep the same LTC.
  2. Completely dump the filter, making it a one-go testing compensating for the resources by using an intermediate TC like 40" or 50" +all the savings from not using double tests. Since its a single test confidence should be higher.
  3. Keep the low-quality low TC filter, but lower requirements so that it doesn't block as many good scaling patches. Lower confidence too for saving resources. Increase both TC and confidence of LTC so that we increase the long term quality but also lower requirements slightly because the success rate on current bounds is too low to justify the increase of resources. Hence higher confidence smaller elo steps.

@mcostalba
Copy link

@SFisGOD we could introduce an additional rule:

LTC-tuning patches should be testes at LTC only

Indeed the STC step for these kind of patches that are green lighted to LTC in any case is a waste of time/resources.

For all the other kind of patches, if fail STC then should be dropped. Indeed the general rule still applyes:

A failed STC patch can be tested at LTC only in special cases after approval

To summarize:

  • A standard patch is tested at STC and if successful at LTC
  • If a patch fails at STC then should be dropped. Exceptions are allowed only for special cases and only after approval.
  • LTC-tuning patches should be testes at LTC only

@snicolet what do you think?

@SFisGOD SFisGOD changed the title Piece Values Tuned Pawn and Piece Values Tuned Nov 8, 2018
@SFisGOD
Copy link
Contributor Author

SFisGOD commented Nov 8, 2018

@mcostalba PR updated with Passed 2nd LTC

@NKONSTANTAKIS
Copy link

@mcostalba The proposition for LTC tuning patches is to be tested at LTC and pass once or twice?

@MichaelB7
Copy link
Contributor

@SFisGOD good work , I was pretty confident that this would pass a 2nd LTC.

@vondele
Copy link
Member

vondele commented Nov 8, 2018

@mcostalba : concerning your suggestions

  1. 'If a patch fails at STC then should be dropped. Exceptions are allowed only for special cases and only after approval.' ---> approval here refers to maintainer approval or fishtest-approver approval ?

I personally think that a semi-automatic rule for these 'speculative LTC' tests is more practical (less discussion, quicker etc). E.g require the STC to have a LOS of >85% (or 90%) at the point of failing (85% is ~50000 games or more).

  1. LTC-only testing after LTC tuning is likely going to flood fishtest with a lot of tuning runs and corresponding tests. This not bad, but we should have a rule what the throughput of these tunes and tests should be. I think a tp of ~ 100 would be appropriate.... this is good stuff for background filling.

@Vizvezdenec
Copy link
Contributor

I agree about speculative LTCs, especially if STC at some point had like 2.7 LLR - it's obvious that this patch is with 90+% is not a regression, so if it passes LTC it's not regression with like 99,6 instead of usual 99,75%.

@NKONSTANTAKIS
Copy link

NKONSTANTAKIS commented Nov 8, 2018

@vondele This is a good idea, but if the preconditions for testing LTC are met anyway, it should not waste anymore games. I mean why to manually interpret the sprt result afterwards in a more wide manner which serves us better when we can just use slightly easier STC bounds to have the same effect, fully automatic for terminating at the right point instead of running full? This way we will be more accurate, more economical and with less need for discussions. This can be easily achieved with the bounds, are we in stone age or something?

@vondele
Copy link
Member

vondele commented Nov 8, 2018

@NKONSTANTAKIS why not, nice mathematical derivation and corresponding patches welcome. My suggestion merely matches what has been the common approach of some of the longstanding contributors when doing speculative LTC.

@vondele
Copy link
Member

vondele commented Nov 8, 2018

concerning testing LTC-tuned patches once or twice to remove flukes. It is important to realize that while testing twice is very effective at removing patches with negative elo, it is equally effective at removing patches with (small) positive elo. A patch that brings ~2.5BayesElo (~1Elo at LTC) has roughly 50% chance of passing [0,5], but only 25% of passing two consecutive LTC tests. [Obviously that holds similary for testing STC and LTC, assuming the Elo performance of the patch is the same at STC and LTC].

To give a fair chance to small elo gaining patches and to remove flukes, the better approach would be to increase the confidence levels. There are some estimates for the increase in cost of this in fishcooking.

snicolet pushed a commit that referenced this pull request Nov 8, 2018
Failed STC
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 27487 W: 5846 L: 5903 D: 15738
http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8

Passed 1st LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 38503 W: 6270 L: 5999 D: 26234
http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750

Passed 2nd LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 34016 W: 5584 L: 5326 D: 23106
http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4

This pull request lead to an interesting discussion about testing
methodology for Stockfish:
#1804

Bench: 3647775
@NKONSTANTAKIS
Copy link

@Vizvezdenec @vondele @IIvec Yes lets do this, instead of having to ask and decide for every patch when to do speculative LTC while having to wait for tons of games for the STC to "fail" yellow which would be translated into "passing for LTC". We can just use the bounds and LLR which will do it for us. The goal is to make STC more flexible for what we test LTC, so that we don't cut off as many promising patches and not having to run so many speculative LTC's.

@mcostalba @snicolet What you think about that?

@vondele I am not so worried about confidence, because our current bounds are high. So it does not matter much if we accept a regression 1 in 20, we definitely move forward. Confidence is very expensive. If we wanted to accept smaller elo gains then we would indeed need higher confidence. The need for this would be when we are unable to pass LTC's. My worry is completely locking ourselves out of patches which scale good but start negative/neutral/not-positive-enough at this STC.

@snicolet
Copy link
Member

snicolet commented Nov 8, 2018

I have merged the change after the second LTC run (congrats and thanks!) in commit cd732c0, but I leave the PR open to let the discussion flow...

@FauziAkram
Copy link
Contributor

By the way, we can continue this discussion for as long as you want, but shouldn't this pull request be merged meanwhile?

@Alayan-stk-2
Copy link

It is already, snicolet wrote :

I have merged the change after the second LTC run (congrats and thanks!) in commit cd732c0, but I leave the PR open to let the discussion flow...

@FauziAkram
Copy link
Contributor

Oh yes, sorry, I wanted to write this message in 1809 pull request, not here.

@snicolet snicolet changed the title Pawn and Piece Values Tuned Pawn and Piece Values Tuned [merged] Nov 20, 2018
@vondele
Copy link
Member

vondele commented Nov 25, 2018

I'm still thinking about this thread, despite slow progress. Just for future reference, these are the current pass probabilities for a patch (STC, LTC, and LTC after STC, including repeated STC testing, direct LTC == LTC, and double LTC)

image

@Alayan-stk-2
Copy link

Is something planned to evaluate a change ? So far, it seems ignored because SF is still able to make progress, even if that progress is harder.

I also question [-3, 1] for functional simplifications. Obviously, a simplification is not expected to win elo, and an elo loss is worth it useful when it leads to greater maintainability and easier improvements in the future. However, -1 elo simplification is about as likely as a +2 elo patch to pass. When the code that is removed takes 1 or 2 lines, a 1 elo loss is actually very significant. If it is suspected that the code simplified away could be replaced by a more efficient more general version, it should not be taken out until such a better version is submitted and proves superior.

I am almost certain that going through the code, selecting some minor eval factors at random, taking it out, and testing the result at fishtest, there is a non-trivial likelihood to get it to pass...

@vondele
Copy link
Member

vondele commented Nov 27, 2018

After extensive testing, I propose new bounds [1,4]@stc(10+0.1) and [0,3]@ltc(60+0.6). These bounds significantly reduce (5x) the probability to pass 0 Elo patches, while improving at the same time the probability to pass 1 or 2 Elo patches by 35-20%. The number of patches applied as well as the Elo gained is expected to increase by 30%. The testing resources needed increase by roughly 2x, assuming the number of patches tested remains the same. To illustrate the proposal the pass probability between old and new scheme can be compared:

image

There is quite more detail in the notebooks that I used to derive this proposal. You can view it here:
http://nbviewer.jupyter.org/github/vondele/Stockfish/blob/tools/tools/Fishtest_SPRT_opimization.ipynb
or actually play with the bounds and see the effect here Binder

@Alayan-stk-2
Copy link

The higher bounds at stc than at ltc to counter the repeated testing of minor variations while minimizing harm to real elo-gainers which go to ltc is clever.

And your page detailing your testing procedure and allowing to easily try variants is awesome.

I tested some possibilities. Getting a range smaller than 3 elo between both bounds quickly increases the resource cost, making it not worthwhile.

The two most important variables are probably "elo gain of accepted patches" and "cost estimate".

Your proposed bounds give 1.3057 and 1.9594, which will hence be used as a reference - the goal is to squeeze more elo at a similar resource cost, or to keep the same elo gain at a lower resource cost. The resource/elo ratio of your proposal is 0,6663, which as expected is lower than the current setup (each increase in precision tends to lower this ratio), but allow to squeeze more elo of the submitted patches, which is beneficial if the available resources are not a strong bottleneck (which they are currently not). Hence, at a similar ratio, the higher the elo gain compared to the current bounds, the better.

Here are the results for some interesting combinations I did, tweaking with decimal bounds. I include some bounds resulting in poor results :

  • STC [0.8, 4] LTC [0,3] : 1.3781 ; 1.9984 ; 0,6896
  • STC [0.8, 4] LTC [0.1,3.2] : 1.3272 ; 1.9707 ; 0,6735 (inferior to the variant above)
  • STC [0.6, 3.9] LTC [0,3] : 1.4511 ; 2.1042 ; 0,6896
  • STC [0.7, 3.8] LTC [0,3] : 1.4345 ; 2.1473 ; 0,6680 (inferior to the variant above)
  • STC [0.5, 3.8] LTC [0,3] : 1.4883 ; 2.1851 ; 0,6811
  • STC [0.5, 3.8] LTC [0,3.2] : 1.4403 ; 2.0649 ; 0.6975
  • STC [0.5, 3.8] LTC [-0.1,3.2] : 1.5284 ; 1.9944 ; 0.72846 - @vondele Can you check if there is a blind spot in your formulas here ? I suspect, by looking at the surface under the graph graph, that your formula fails to account for elo-losing patch which manage to get in. Relaxing the LTC bounds results in a noticeable resource usage decrease, but when testing with LTC [-10,3] (which any junk patch would pass), I get results indicative of an even stronger gain, which is obviously wrong.
  • STC [0.3, 3.8] LTC [0,3.1] : 1,5064 ; 2.1564 ; 0,6986
  • STC [0.5, 4] LTC [0,3.2] : 1.4193 ; 1.9557 ; 0.7257
  • STC [0.4, 4] LTC [0,3.2] : 1.4417 ; 1.9741 ; 0.7303

The last bounds in my list are the strongest proposals I found, but with the elo-losing patches not taken into account in the elo-gain ratio, and the number of patches between -1 and 0 elo underestimated by the gaussian estimate, this is inconclusive.

All of this does not take into account the case of patches which get better/worse depending on TC. Arguably, while more favorable than the previous bounds, these new bounds are not ideal for good scalers which will still have difficulty clearing the STC barrier if they don't bring enough elo at low TC. But I don't see a great solution for those, and in any case, there is no way that the current STC/LTC tests can inform us if an apparently small gain is not a loss at VLTC, or if an apparent small loss is not a gain at VLTC. The filtering process itself makes detecting discrepancies between STC and LTC results (which could prompt additional testing) more or less impossible.

@vondele
Copy link
Member

vondele commented Nov 28, 2018

@Alayan-stk-2 thanks for testing. I tried to stick to integer sprt bounds, but that's just for convenience.

If you want to explore formulas that allow non-negligible pass rate of negative elo patches, you have to adjust the range of values (used for both plots and numerical integration in the summary stats) defined in input cell 5:

 # basic graphics in the 0-5 logistic Elo range, also used for quick numerical integration below.
x = np.linspace(0, 5, 1000)

to cover the range below zero. All variants I've tested safely allowed to truncate the below zero range.

@Alayan-stk-2
Copy link

Alayan-stk-2 commented Nov 28, 2018

I extended it to a range including elo losers. This is actually quite interesting from a resource cost perspective, because there are many patches in that range, so how quickly they are rejected depending on bounds also influence the bounds overall effectiveness, and the simple increase in resource usage here reduces the relative increase overall.

The gaussian distribution is also a quite rough approximation. It overestimates significantly patches in the 1,5-2,5 range and underestimate significantly patches in the -2, -0,5 range ; both of which are quite relevant.

I used a quick hack : I put the mu at 0.15 to lower the proportion of expected elo-gaining patches, and if x<0, then return the gaussian distribution value for x/3. This is rough, but at least a better match of the data.

With (1,4) and (0,3) :

elo-graph

I got an actually surprising result : using stupidly lousy bounds increase so much the proportion of elo-gaining passing patches that the loss from many regressions is considered gaining more elo for less resources usage than the current setup. This obviously ignore the question of code quality and ease of future improvements, but speaks volume about just how much the current bounds reject.

Now, I got those results for some previously tested bounds (elo gain ratio ; resource cost ratio ; elo gain/resource), testing on [-2,5] range (the lower bound is helpful to better appreciate resource usage, the ratio is noticeably better than at [0,5]) :

  • STC [1, 4] LTC [0,3] : 1.3255 ; 1.8241 ; 0.7267
  • STC [0.6, 3.9] LTC [0,3] : 1.5290 ; 1.9448 ; 0.7862
  • STC [0.5, 3.8] LTC [0,3.2] : 1.519 ; 1.9444 ; 0.7812
  • STC [0.4, 4] LTC [0,3.2] : 1.5237 ; 1.8332 ; 0.8312

Tuning the bounds is inaccurate because the distribution of patches in the simulation lacks accuracy.

I got this :

  • STC [0.3, 3.9] LTC [0,2.8] : 1,7015 ; 2.0661 ; 0.8235

It achieves this by being less strict about rejecting bad patches than vondele's proposal (36% less regression than now instead of 85% less - still stricter than current bounds); and by accepting 80% more +0.5 elo patches than now, 145% more +1 elo patches (not a typo), 70% more +1.5 elo patches and 25% more +2 elo patches (basically accepting almost all of them) (all these numbers carry a significant inaccuracy, but the point stands)

This makes the importance of the bounds change even more obvious.

@xoto10
Copy link
Contributor

xoto10 commented Dec 1, 2018

After extensive testing, I propose new bounds [1,4]@stc(10+0.1) and [0,3]@ltc(60+0.6).

Great news! I am surprised that bounds only 3 Elo apart appear possible, but it would be great if that works.
Do you think a phased approach would be best, e.g. use [0,4] for both stc and ltc for a while, and then possibly change to stc [1,4] and ltc [0,3] if we are still happy, or would you suggest moving straight to your proposed scheme?

@Alayan-stk-2
Copy link

If we want to keep things "simple" (not using too much decimals in the bounds), STC [0.5, 4] and LTC [0, 3] would still give better results/resource efficiency than sticking to integers with [1,4] and [0,3].

I know your question is for vondele, but I don't see the point of that intermediate phase. According to vondele's tool, a [0,4] [0,4] would give around two-thirds of the elo-benefit at a similar efficiency than the bounds I tested, which is in itself worse (in theory, the two would perform similarly if fishtest resources become a bottleneck, but that would discourage more the useless random tests), but an additional downside is that the confidence that the patch is not just a lucky one is not increased, while debate about the lack of confidence in passing patches is what prompted this whole discussion.

Elo bounds Increasing the lower bound of STC is an important part of the proposed change, to filter out more effectively random variations which don't add elo.

Arguably, "speculative LTC", or LTC tests for LTC tunings, should pass something like [0.5, 3.5] rather than [0, 3].

And basic simplifications should probably pass something like [-2, 1] (massive simplifications are another beast), there is no point in adding 1 line of code gaining 1 elo if this same line can be removed 6 months later as a simplification.

@snicolet snicolet changed the title Pawn and Piece Values Tuned [merged] [merged] Pawn and Piece Values Tuned Dec 1, 2018
@xoto10
Copy link
Contributor

xoto10 commented Dec 1, 2018

Besides, the current bounds have been chosen to ensure that any added code
is worth a lot of Elo, because we would see the code quality slowly deteriorate
over time if we accepted small Elo gainers patches leading to spaghetti code.

But is this is actually a problem in practice?
I think the sf code is very neat, with almost no "spaghetti".

What I do think is that simplifications need to be made harder and/or discouraged a bit.
A big problem for sf in the last TCEC season was Kf8/Kf1 type moves, contributing to a number of losses. A couple of weeks ago Vondele made a patch on extensions to possibly help this, but it has already been simplified so that it doesn't address those moves any more.
There is a PR for 8x8 pawn psqt tables, which may help, but fishtest resoures are already being used by a tuning job to simplify them away. What is the point in trying to pass constructive patches if people are just going to try and simplify them away immediately?
I think new patches should be left in for a while to bed in. This is in general, but especially when they are aimed at particular weakspots in sf's play.

The problem we have currently is too much simplification, not too much spaghetti.

@Alayan-stk-2
Copy link

Alayan-stk-2 commented Dec 1, 2018

What I do think is that simplifications need to be made harder and/or discouraged a bit.

I absolutely agree.

I'm almost tempted to randomly choose some code line somewhere, to remove it and get the result to pass [-3, 1] to prove this point.

If a single code line adds 0.5 elo to stockfish, that's huge. But with the current setup for simplifications, it is very easy to get it simplified away. If one simplification doesn't pass, just try a few at random and at some point you'll remove useful code.

The simplification tests make no distinction between trivial or minor simplifications, which only help the project if they are almost elo-neutral ; and big simplifications where losing 1 or 2 elo is justifiable by making place for a better replacement.

There is a PR for 8x8 pawn psqt tables, which may help, but fishtest resoures are already being used by a tuning job to simplify them away. What is the point in trying to pass constructive patches if people are just going to try and simplify them away immediately?

I saw the same thing on fishtest, and it rubbed me the wrong way.

Chess is not a fully symmetrical game, and it stands to reason that kingside and queenside aren't identical. The 8x8 PSQT are somewhat harder to tune as there is more values, but that's pretty much the only downside. The time at execution to poll a value from the 8x8 PSQT is virtually identical (the operations are the same, the 8x8 only takes a slightly bigger amount of cache, but if it's sitting in L2 that's very unlikely to matter).

Now that StockfishTest has a lot of resources at its disposal, 8x8 PSQT are an easy way to grab a few more elo.

Simplification for the sake of simplification is misguided.

@IIvec
Copy link
Contributor

IIvec commented Dec 2, 2018

I just want to thank everybody for fruitful discussion here.

@xoto10
Copy link
Contributor

xoto10 commented Dec 4, 2018

After grumbling about simplifications earlier, I ran a test on simplifications merged over the last 2 months:
http://tests.stockfishchess.org/tests/view/5c03d48b0ebc5902bcee1a79
ELO: 0.73 +-1.4 (95%) LOS: 84.4%
Total: 100000 W: 21550 L: 21341 D: 57109

There was one Elo gaining patch included, Vondele's castling extension change (around +1 Elo), so that the simplifications to that would apply. Considering the margin of error we have to consider this as essentially zero gain/loss in Elo from a fair number of simplifications, which seems like a good result. Given that, I'm much more comfortable with simplifications now, at least in them not having a negative effect on sf's Elo rating. I do think it was a mistake to simplify Vondele's castling patch, but what's done is done. I tried a test to reinstate it, but it was nowhere near passing.

@snicolet
Copy link
Member

snicolet commented Dec 6, 2018

I just want to thank everybody for fruitful discussion here.

Me too!

@miguel-l
Copy link
Contributor

miguel-l commented Dec 8, 2018

I was playing around with the notebook shared by @vondele (great work btw!). I got interesting results with the following:

I kept vondele's bounds for LTC [0,3], and STC[1,4], but for STC, I changed the confidence to 0.10 (from 0.05), while keeping LTC at 0.05. Here are the summary statistics from the bottom of the page:

Summary Statistic STC[1,4], LTC[0,3] STC [1,4] with confidence 0.10, LTC[0,3]
Ratio of the number of patches applied 1.2916780496741891 1.469629481321409
0 ELO pass prob ratio vs now 0.15981868569261556 0.5263906363340538
1 ELO pass prob ratio vs now 1.3492599755101955 1.8246169955999425
2 ELO pass prob ratio vs now 1.2278448437822465 1.2250158728112241
Ratio of ELO gain 1.305735373232747 1.4093217719572717
Cost Estimate 1.9593957641773145 1.8397600331221065

It's not strictly better, since it doesn't reduce 0 ELO patches as much (although it still does well compared to current values), but the 1 ELO pass ratio goes up quite a bit. In terms of resource use and ELO gain ratio, it does well enough. I think there's something to be said about tweaking confidence levels for STC (since LTC is at 0.05 anyway). Anyway, maybe this will encourage others to play around with it some more and see if there's a better way by varying the STC confidence jointly with the bounds. I see that others have already proposed other bounds so maybe later on I'll start playing with this again.

@Alayan-stk-2
Copy link

@miguel-l Technically, the degree of confidence and the bounds are interchangeable. That is, you can achieve a result mathematically equivalent to increasing/lowering confidence by increasing/lowering the bound ; because bound+confidence is used to determine a single value. For the lower SPRT bound, it is a value A under which the SPRT test is considered failed. For the upper SPRT bound, it is a value B over which the SPRT test i considered successful. B must be smaller than A. Then if x < B ; test failed. If x > A, test successful. If B < x < A, result still uncertain, keep testing.

All that to say that your confidence tweak is not fundamentally different from my tweaks on bounds a few messages above 🙂

I think at this point, tweaking for the best possible bounds is limited by assumptions about the repartitions of elo patches.

Also, something I had in mind lately : how many of the +1 elo patches are duplicates ? If this is a significant amount (including minor variations), then maybe the elo benefit of increasing +1 elo passing will be smaller than expected, which changes the optimal bounds.

@snicolet It is nice that you appreciate the discussion here, but can you tell us if there are plans to test new bounds based on vondele's work and additional tweaking ? If you have some reservations, what are they ? 🙂

@miguel-l
Copy link
Contributor

miguel-l commented Dec 8, 2018

Ahh, interesting. I should probably review statistics. Now that I think about it more, I guess it makes sense how the tweaks could be interchangeable and only tweaking the bounds might be easier. Thanks for taking the time to explain.

Regarding the elo distribution, I can see how it affects what kind of weight we give to patches, but at the same time, I can't really think of a better way than the current estimate using normal distribution. But also we can hope that the duplicates will be lessened/discouraged by new bounds. After all, Fishtest is dynamic and people are encouraged/discouraged to test more depending on how empty/full it is. New bounds which utilize more resources may actually affect this behavior as well.

@vdbergh
Copy link
Contributor

vdbergh commented Dec 8, 2018

The precise result appears to be that if

(elo0+elo1)/2=(elo0'+elo1')/2

then in first order an SPRT(elo0,elo1,alpha,beta) is equivalent to an SPRT(elo0',elo1',alpha',beta') for suitable alpha',beta'.

The relation between (alpha,beta) and (alpha',beta') is as follows: let

theta=(elo1'-elo0')/(elo1-elo0).

Then

(beta/(1-alpha))^theta=beta'/(1-alpha'), 
((1-beta)/alpha)^theta=(1-beta')/alpha'.

Note that these are approximations but they should be quite accurate for the type of SPRTs that are run on fishtest.

Here is a graph that gives the equivalence between bounds and confidence (I am assuming as reference alpha=beta=0.05 so that confidence=alpha'=beta').

http://hardy.uhasselt.be/Toga/ratio_vs_confidence.png

E.g. going from SPRT(0,5) to SPRT(0.5,4.5) (ratio=0.8) entails a reduction in confidence from 0.05 to 0.087 to maintain the same pass probabilities.

EDIT: I don't think there are more equivalences between tests than those indicated by the above formulas. If this is true then it shows that there is a limit to the interchangeability of confidence levels and bounds: a symmetric test (alpha=beta) can not be equivalent to an asymmetric one (alpha<>beta).

EDIT2: The graph also shows that increasing confidence is actually quite cheap. To go from alpha=beta=0.05 to alpha=beta=0.01 one may equivalently decrease the length of the elo interval by a factor 1.561 (and keep alpha=beta=0.05). So ressource usage increases by a factor 1.561^2=2.43. This is much lower than I had expected.

@ghost
Copy link

ghost commented Dec 8, 2018

The randomness of STC is because of time management code interacting with eval and different patches push time management to allocate very different time slices, which influences searched depth. LTC allows some minimum depth and plenty of hash space, so its naturally more stable.
STC time slice variability can be reduced with tighter time controls like 1+0.2 which leaves less room for move time changes.
Example for game durations:`
Moves 10,20,30..250

10+0.1:",22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70"

1+0.15:",5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59,62,65,68,71,74,77"

1+0.2: ",6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102"
`
60+0.6:
",132,144,156,168,180,192,204,216,228,240,252,264,276,288,300,312,324,336,348,360,372,384,396,408,420"
2+0.9: ",22,40,58,76,94,112,130,148,166,184,202,220,238,256,274,292,310,328,346,364,382,400,418,436,454"
2+1:
",24,44,64,84,104,124,144,164,184,204,224,244,264,284,304,324,344,364,384,404,424,444,464,484,504"

I've made a test here to check if 10+0.1 time control has ELO effects(silly, but it would be interesting to see if random move times lead to ELO changes )
http://tests.stockfishchess.org/tests/view/5c0c41430ebc5902bceed4a4

@mcostalba
Copy link

Thanks people!

I have opened a specific issue #1859

Please keep moving on there. Thanks.

@mcostalba mcostalba closed this Dec 23, 2018
CounterPly pushed a commit to CounterPly/Stockfish that referenced this pull request May 19, 2019
Failed STC
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 27487 W: 5846 L: 5903 D: 15738
http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8

Passed 1st LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 38503 W: 6270 L: 5999 D: 26234
http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750

Passed 2nd LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 34016 W: 5584 L: 5326 D: 23106
http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4

This pull request lead to an interesting discussion about testing
methodology for Stockfish:
official-stockfish#1804

Bench: 3647775
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.