-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[merged] Pawn and Piece Values Tuned #1804
Conversation
For both the patches that we have, that failed STC but passed LTC, I would suggest two runs with (-3,1) bounds, one at STC and one at VLTC just to make sure that we are not losing any elo there. |
@FauziAkram STC carries no validity whatsoever and (-3,1) VLTC is a total waste of resources as it will surely pass |
Firstly the easy part: @SFisGOD please in the future add also the STC result even if negative, we (all the people) would like to have the full information, no need to "hide" it (we are grown enough to stand a failed STC and a successful LTC). Now the difficult part. To me STC is a way to quickly filter out bad patches so that we go through the resource consuming LTC only with good candidates. So to me a LTC success is a strong result technically speaking. On the other hand, if we apply this PR blindly, nobody can prevent people from submitting to LTC failed STC patches, yielding to a nightmare. So we have to deal with 2 topics here: a technical one (is a good LTC enough?) and a guideline breaking one (Is it ok to roll your dices again going for a LTC after you failed a STC?) |
@mcostalba Thanks for the feedback. It was not my intention at all to hide that this patch failed STC. I thought it was enough that vondele pointed it out in his comment. However, I understand that people might want to know the full information so I edited the PR to include this missing information about the failed STC.
I think LTC-tuning patches should be an exception in my opinion. LTC-tuning patches should at least be tried in LTC even if it failed STC. |
From my point of view 1 passed LTC is not enough. Because usually when we try speculative LTCs is for patches that "almost" passed STC, which have LOS like 94% and had LLR>2 thus basically guaranteeing being non-regression with 90+% probability, especially if it's some stuff that can scale well (based on intuition mostly :) ). |
When the patch is tuning done with tons of LTC games, the waste of resources would be to not do LTC testing because of a failed STC. LTC tunings should always be tested at LTC no matter the STC result.
If STC is only a first filter, it doesn't need to be excessively stringent. The statistical requirements there could be relaxed. On the other hand, the 95% at LTC seem to not generate enough confidence in some cases, so there should be guidelines for additional testing when needed. Imho, the 120+1.2 test in cases such as this one would bring more information than a second 60+0.6 run, so the additional resource use serve a purpose. Patches failing STC but succeeding clearly at LTC are not very common, this won't overwhelm the testing framework. |
I'm also of the opinion that we should have 2 successful tests to be more sure that the patch isn't a fluke. Those tests should be on 2 different time controls, the 60+0.6 run being one of them. |
The current STC of 10 + 0.1 has outlived it’s usefulness. I favor the two pass method. Perhaps 30 + 0.3 and 60 + 0.6 but my preference would 60 + 0.6 and 90 + 0.9. A double pass at those two time controls would lessen the possibility of the “rogue” patch getting through. This change should not be considered lightly - we need to re-run some tests with both failed and passed patches to see if the result would be different. I would especially be interested in patches approved with yellow or failed stc results. I suspect that you will find some patches that failed previously , would now pass and some that passed would now fail. Historically ( from many years ago), it was always thought that patches impacting the search function needed to be tested at longer time controls and patches that impacted evaluation could be tested at shorter time controls. Also I believe when a patch very quickly - say less than 20,000 games - we may need to keep it running until a fixed of games have also been run. My thought is that 30,000 or 40,000 games should be the minimum. My own anecdotal experience is that ones that pass the quickest - say only 10,000 games or so are most proned to be a lucky result. |
what do we have? one problem. |
I disagree, I don't think we have the hw to run all tests at 30+ or 60+
These 2 ideas seem contradictory to me. I seriously doubt we have enough hw to run 30+0.3 for stc and 60+0.6 for ltc, and certainly not for 60 and 90. And clearly not to run longer tc AND use more games per test. But I agree that more games would give more reliable tests and that flukes often come from the shortest tests. Because more games would be good, I think we should aim for that, presumably by amending the SPRT parameters somehow, and keep the existing stc and ltc time controls. If tuning runs are best run and tested at ltc, then maybe they are the one case where there is no point in using stc. I defer to others on what extra tests (if any) are required in this case.
Interesting, if senior devs this think this is still a useful idea, perhaps they could say so, and it could form part of the "grey area" guidance from them around any parts of the test rules that are not completely automatic. |
@mcostalba Mr Costalba, like yesterday I remember the fight against people who were advocating that 10" games is adequate for everything and facilitated the increase of the LTC from 40" to 60", which in my opinion upgraded the chess quality of SF. What do you suggest now? There have been many discussions on improving the TC's but it seems to me that there are too many different good ideas. We have much more computational power than in the past, I think its a great opportunity to make good use of it and form a superior product. |
2nd statement is NOT true actually. |
Let's look at the test. The piece values have been changed by +-2. Not significant. However the pawnValueMg have decreased by 6. This may have a significant (and interesting impact) This is also used in search.cpp
This is not adding any code, and STC was almost yellow. I don't see any problem committing this one. |
The compensation that we get through bayes elo is not clear if it outweighs the effect of the increased drawrate because this also depends on many other factors. The bayes elo formula is not linked to the drawrate but to the TC. Depending on the condition of SF it could be a lot easier or a lot harder to pass. ie SF with a base negative contempt or zero contempt or SF5 would have a different easiness-of-passing STC-LTC relation. We can not theoretically conclude that LTC is easier to pass, because in practice we observe the opposite effect for SFdev. But if I am wrong on this and you are right and it is indeed the case that with neutral scaling it is easier to pass LTC than STC, then we are doing something terribly wrong for passing LTC only a very small % of our passed STC's. With your theory we would pass >50% but we pass how much? Not even 20%? Someone has this info? You simply can not fail 15 in a row at 50%. Maybe the terrible thing we are doing is that by starting with STC pass demand we catch bad scalers and exclude good scalers, as a good scaling curve would start low and rise and a bad one would start high and fall. No matter how someone sees it a change of the STC is direly needed. Given the amount of our resources maybe a more expensive methodology is sensible (like we did in the past with 40"->60" and @IIvec proposes), but if we want to keep the same gameflow we still have many options:
|
@SFisGOD we could introduce an additional rule: LTC-tuning patches should be testes at LTC only Indeed the STC step for these kind of patches that are green lighted to LTC in any case is a waste of time/resources. For all the other kind of patches, if fail STC then should be dropped. Indeed the general rule still applyes: A failed STC patch can be tested at LTC only in special cases after approval To summarize:
@snicolet what do you think? |
@mcostalba PR updated with Passed 2nd LTC |
@mcostalba The proposition for LTC tuning patches is to be tested at LTC and pass once or twice? |
@SFisGOD good work , I was pretty confident that this would pass a 2nd LTC. |
@mcostalba : concerning your suggestions
I personally think that a semi-automatic rule for these 'speculative LTC' tests is more practical (less discussion, quicker etc). E.g require the STC to have a LOS of >85% (or 90%) at the point of failing (85% is ~50000 games or more).
|
I agree about speculative LTCs, especially if STC at some point had like 2.7 LLR - it's obvious that this patch is with 90+% is not a regression, so if it passes LTC it's not regression with like 99,6 instead of usual 99,75%. |
@vondele This is a good idea, but if the preconditions for testing LTC are met anyway, it should not waste anymore games. I mean why to manually interpret the sprt result afterwards in a more wide manner which serves us better when we can just use slightly easier STC bounds to have the same effect, fully automatic for terminating at the right point instead of running full? This way we will be more accurate, more economical and with less need for discussions. This can be easily achieved with the bounds, are we in stone age or something? |
@NKONSTANTAKIS why not, nice mathematical derivation and corresponding patches welcome. My suggestion merely matches what has been the common approach of some of the longstanding contributors when doing speculative LTC. |
concerning testing LTC-tuned patches once or twice to remove flukes. It is important to realize that while testing twice is very effective at removing patches with negative elo, it is equally effective at removing patches with (small) positive elo. A patch that brings ~2.5BayesElo (~1Elo at LTC) has roughly 50% chance of passing [0,5], but only 25% of passing two consecutive LTC tests. [Obviously that holds similary for testing STC and LTC, assuming the Elo performance of the patch is the same at STC and LTC]. To give a fair chance to small elo gaining patches and to remove flukes, the better approach would be to increase the confidence levels. There are some estimates for the increase in cost of this in fishcooking. |
Failed STC LLR: -2.96 (-2.94,2.94) [0.00,4.00] Total: 27487 W: 5846 L: 5903 D: 15738 http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8 Passed 1st LTC LLR: 2.95 (-2.94,2.94) [0.00,4.00] Total: 38503 W: 6270 L: 5999 D: 26234 http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750 Passed 2nd LTC LLR: 2.95 (-2.94,2.94) [0.00,4.00] Total: 34016 W: 5584 L: 5326 D: 23106 http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4 This pull request lead to an interesting discussion about testing methodology for Stockfish: #1804 Bench: 3647775
@Vizvezdenec @vondele @IIvec Yes lets do this, instead of having to ask and decide for every patch when to do speculative LTC while having to wait for tons of games for the STC to "fail" yellow which would be translated into "passing for LTC". We can just use the bounds and LLR which will do it for us. The goal is to make STC more flexible for what we test LTC, so that we don't cut off as many promising patches and not having to run so many speculative LTC's. @mcostalba @snicolet What you think about that? @vondele I am not so worried about confidence, because our current bounds are high. So it does not matter much if we accept a regression 1 in 20, we definitely move forward. Confidence is very expensive. If we wanted to accept smaller elo gains then we would indeed need higher confidence. The need for this would be when we are unable to pass LTC's. My worry is completely locking ourselves out of patches which scale good but start negative/neutral/not-positive-enough at this STC. |
I have merged the change after the second LTC run (congrats and thanks!) in commit cd732c0, but I leave the PR open to let the discussion flow... |
By the way, we can continue this discussion for as long as you want, but shouldn't this pull request be merged meanwhile? |
It is already, snicolet wrote :
|
Oh yes, sorry, I wanted to write this message in 1809 pull request, not here. |
Is something planned to evaluate a change ? So far, it seems ignored because SF is still able to make progress, even if that progress is harder. I also question [-3, 1] for functional simplifications. Obviously, a simplification is not expected to win elo, and an elo loss is worth it useful when it leads to greater maintainability and easier improvements in the future. However, -1 elo simplification is about as likely as a +2 elo patch to pass. When the code that is removed takes 1 or 2 lines, a 1 elo loss is actually very significant. If it is suspected that the code simplified away could be replaced by a more efficient more general version, it should not be taken out until such a better version is submitted and proves superior. I am almost certain that going through the code, selecting some minor eval factors at random, taking it out, and testing the result at fishtest, there is a non-trivial likelihood to get it to pass... |
After extensive testing, I propose new bounds [1,4]@stc(10+0.1) and [0,3]@ltc(60+0.6). These bounds significantly reduce (5x) the probability to pass 0 Elo patches, while improving at the same time the probability to pass 1 or 2 Elo patches by 35-20%. The number of patches applied as well as the Elo gained is expected to increase by 30%. The testing resources needed increase by roughly 2x, assuming the number of patches tested remains the same. To illustrate the proposal the pass probability between old and new scheme can be compared: There is quite more detail in the notebooks that I used to derive this proposal. You can view it here: |
The higher bounds at stc than at ltc to counter the repeated testing of minor variations while minimizing harm to real elo-gainers which go to ltc is clever. And your page detailing your testing procedure and allowing to easily try variants is awesome. I tested some possibilities. Getting a range smaller than 3 elo between both bounds quickly increases the resource cost, making it not worthwhile. The two most important variables are probably "elo gain of accepted patches" and "cost estimate". Your proposed bounds give 1.3057 and 1.9594, which will hence be used as a reference - the goal is to squeeze more elo at a similar resource cost, or to keep the same elo gain at a lower resource cost. The resource/elo ratio of your proposal is 0,6663, which as expected is lower than the current setup (each increase in precision tends to lower this ratio), but allow to squeeze more elo of the submitted patches, which is beneficial if the available resources are not a strong bottleneck (which they are currently not). Hence, at a similar ratio, the higher the elo gain compared to the current bounds, the better. Here are the results for some interesting combinations I did, tweaking with decimal bounds. I include some bounds resulting in poor results :
The last bounds in my list are the strongest proposals I found, but with the elo-losing patches not taken into account in the elo-gain ratio, and the number of patches between -1 and 0 elo underestimated by the gaussian estimate, this is inconclusive. All of this does not take into account the case of patches which get better/worse depending on TC. Arguably, while more favorable than the previous bounds, these new bounds are not ideal for good scalers which will still have difficulty clearing the STC barrier if they don't bring enough elo at low TC. But I don't see a great solution for those, and in any case, there is no way that the current STC/LTC tests can inform us if an apparently small gain is not a loss at VLTC, or if an apparent small loss is not a gain at VLTC. The filtering process itself makes detecting discrepancies between STC and LTC results (which could prompt additional testing) more or less impossible. |
@Alayan-stk-2 thanks for testing. I tried to stick to integer sprt bounds, but that's just for convenience. If you want to explore formulas that allow non-negligible pass rate of negative elo patches, you have to adjust the range of values (used for both plots and numerical integration in the summary stats) defined in input cell 5: # basic graphics in the 0-5 logistic Elo range, also used for quick numerical integration below.
x = np.linspace(0, 5, 1000) to cover the range below zero. All variants I've tested safely allowed to truncate the below zero range. |
I extended it to a range including elo losers. This is actually quite interesting from a resource cost perspective, because there are many patches in that range, so how quickly they are rejected depending on bounds also influence the bounds overall effectiveness, and the simple increase in resource usage here reduces the relative increase overall. The gaussian distribution is also a quite rough approximation. It overestimates significantly patches in the 1,5-2,5 range and underestimate significantly patches in the -2, -0,5 range ; both of which are quite relevant. I used a quick hack : I put the mu at 0.15 to lower the proportion of expected elo-gaining patches, and if x<0, then return the gaussian distribution value for x/3. This is rough, but at least a better match of the data. With (1,4) and (0,3) : I got an actually surprising result : using stupidly lousy bounds increase so much the proportion of elo-gaining passing patches that the loss from many regressions is considered gaining more elo for less resources usage than the current setup. This obviously ignore the question of code quality and ease of future improvements, but speaks volume about just how much the current bounds reject. Now, I got those results for some previously tested bounds (elo gain ratio ; resource cost ratio ; elo gain/resource), testing on [-2,5] range (the lower bound is helpful to better appreciate resource usage, the ratio is noticeably better than at [0,5]) :
Tuning the bounds is inaccurate because the distribution of patches in the simulation lacks accuracy. I got this :
It achieves this by being less strict about rejecting bad patches than vondele's proposal (36% less regression than now instead of 85% less - still stricter than current bounds); and by accepting 80% more +0.5 elo patches than now, 145% more +1 elo patches (not a typo), 70% more +1.5 elo patches and 25% more +2 elo patches (basically accepting almost all of them) (all these numbers carry a significant inaccuracy, but the point stands) This makes the importance of the bounds change even more obvious. |
Great news! I am surprised that bounds only 3 Elo apart appear possible, but it would be great if that works. |
If we want to keep things "simple" (not using too much decimals in the bounds), STC [0.5, 4] and LTC [0, 3] would still give better results/resource efficiency than sticking to integers with [1,4] and [0,3]. I know your question is for vondele, but I don't see the point of that intermediate phase. According to vondele's tool, a [0,4] [0,4] would give around two-thirds of the elo-benefit at a similar efficiency than the bounds I tested, which is in itself worse (in theory, the two would perform similarly if fishtest resources become a bottleneck, but that would discourage more the useless random tests), but an additional downside is that the confidence that the patch is not just a lucky one is not increased, while debate about the lack of confidence in passing patches is what prompted this whole discussion. Elo bounds Increasing the lower bound of STC is an important part of the proposed change, to filter out more effectively random variations which don't add elo. Arguably, "speculative LTC", or LTC tests for LTC tunings, should pass something like [0.5, 3.5] rather than [0, 3]. And basic simplifications should probably pass something like [-2, 1] (massive simplifications are another beast), there is no point in adding 1 line of code gaining 1 elo if this same line can be removed 6 months later as a simplification. |
But is this is actually a problem in practice? What I do think is that simplifications need to be made harder and/or discouraged a bit. The problem we have currently is too much simplification, not too much spaghetti. |
I absolutely agree. I'm almost tempted to randomly choose some code line somewhere, to remove it and get the result to pass [-3, 1] to prove this point. If a single code line adds 0.5 elo to stockfish, that's huge. But with the current setup for simplifications, it is very easy to get it simplified away. If one simplification doesn't pass, just try a few at random and at some point you'll remove useful code. The simplification tests make no distinction between trivial or minor simplifications, which only help the project if they are almost elo-neutral ; and big simplifications where losing 1 or 2 elo is justifiable by making place for a better replacement.
I saw the same thing on fishtest, and it rubbed me the wrong way. Chess is not a fully symmetrical game, and it stands to reason that kingside and queenside aren't identical. The 8x8 PSQT are somewhat harder to tune as there is more values, but that's pretty much the only downside. The time at execution to poll a value from the 8x8 PSQT is virtually identical (the operations are the same, the 8x8 only takes a slightly bigger amount of cache, but if it's sitting in L2 that's very unlikely to matter). Now that StockfishTest has a lot of resources at its disposal, 8x8 PSQT are an easy way to grab a few more elo. Simplification for the sake of simplification is misguided. |
I just want to thank everybody for fruitful discussion here. |
After grumbling about simplifications earlier, I ran a test on simplifications merged over the last 2 months: There was one Elo gaining patch included, Vondele's castling extension change (around +1 Elo), so that the simplifications to that would apply. Considering the margin of error we have to consider this as essentially zero gain/loss in Elo from a fair number of simplifications, which seems like a good result. Given that, I'm much more comfortable with simplifications now, at least in them not having a negative effect on sf's Elo rating. I do think it was a mistake to simplify Vondele's castling patch, but what's done is done. I tried a test to reinstate it, but it was nowhere near passing. |
Me too! |
I was playing around with the notebook shared by @vondele (great work btw!). I got interesting results with the following: I kept vondele's bounds for LTC [0,3], and STC[1,4], but for STC, I changed the confidence to 0.10 (from 0.05), while keeping LTC at 0.05. Here are the summary statistics from the bottom of the page:
It's not strictly better, since it doesn't reduce 0 ELO patches as much (although it still does well compared to current values), but the 1 ELO pass ratio goes up quite a bit. In terms of resource use and ELO gain ratio, it does well enough. I think there's something to be said about tweaking confidence levels for STC (since LTC is at 0.05 anyway). Anyway, maybe this will encourage others to play around with it some more and see if there's a better way by varying the STC confidence jointly with the bounds. I see that others have already proposed other bounds so maybe later on I'll start playing with this again. |
@miguel-l Technically, the degree of confidence and the bounds are interchangeable. That is, you can achieve a result mathematically equivalent to increasing/lowering confidence by increasing/lowering the bound ; because bound+confidence is used to determine a single value. For the lower SPRT bound, it is a value A under which the SPRT test is considered failed. For the upper SPRT bound, it is a value B over which the SPRT test i considered successful. B must be smaller than A. Then if x < B ; test failed. If x > A, test successful. If B < x < A, result still uncertain, keep testing. All that to say that your confidence tweak is not fundamentally different from my tweaks on bounds a few messages above 🙂 I think at this point, tweaking for the best possible bounds is limited by assumptions about the repartitions of elo patches. Also, something I had in mind lately : how many of the +1 elo patches are duplicates ? If this is a significant amount (including minor variations), then maybe the elo benefit of increasing +1 elo passing will be smaller than expected, which changes the optimal bounds. @snicolet It is nice that you appreciate the discussion here, but can you tell us if there are plans to test new bounds based on vondele's work and additional tweaking ? If you have some reservations, what are they ? 🙂 |
Ahh, interesting. I should probably review statistics. Now that I think about it more, I guess it makes sense how the tweaks could be interchangeable and only tweaking the bounds might be easier. Thanks for taking the time to explain. Regarding the elo distribution, I can see how it affects what kind of weight we give to patches, but at the same time, I can't really think of a better way than the current estimate using normal distribution. But also we can hope that the duplicates will be lessened/discouraged by new bounds. After all, Fishtest is dynamic and people are encouraged/discouraged to test more depending on how empty/full it is. New bounds which utilize more resources may actually affect this behavior as well. |
The precise result appears to be that if
then in first order an The relation between
Then
Note that these are approximations but they should be quite accurate for the type of SPRTs that are run on fishtest. Here is a graph that gives the equivalence between bounds and confidence (I am assuming as reference alpha=beta=0.05 so that confidence=alpha'=beta'). http://hardy.uhasselt.be/Toga/ratio_vs_confidence.png E.g. going from SPRT(0,5) to SPRT(0.5,4.5) (ratio=0.8) entails a reduction in confidence from 0.05 to 0.087 to maintain the same pass probabilities. EDIT: I don't think there are more equivalences between tests than those indicated by the above formulas. If this is true then it shows that there is a limit to the interchangeability of confidence levels and bounds: a symmetric test (alpha=beta) can not be equivalent to an asymmetric one (alpha<>beta). EDIT2: The graph also shows that increasing confidence is actually quite cheap. To go from alpha=beta=0.05 to alpha=beta=0.01 one may equivalently decrease the length of the elo interval by a factor 1.561 (and keep alpha=beta=0.05). So ressource usage increases by a factor 1.561^2=2.43. This is much lower than I had expected. |
The randomness of STC is because of time management code interacting with eval and different patches push time management to allocate very different time slices, which influences searched depth. LTC allows some minimum depth and plenty of hash space, so its naturally more stable. 10+0.1:",22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70" 1+0.15:",5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59,62,65,68,71,74,77" 1+0.2: ",6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102" I've made a test here to check if 10+0.1 time control has ELO effects(silly, but it would be interesting to see if random move times lead to ELO changes ) |
Thanks people! I have opened a specific issue #1859 Please keep moving on there. Thanks. |
Failed STC LLR: -2.96 (-2.94,2.94) [0.00,4.00] Total: 27487 W: 5846 L: 5903 D: 15738 http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8 Passed 1st LTC LLR: 2.95 (-2.94,2.94) [0.00,4.00] Total: 38503 W: 6270 L: 5999 D: 26234 http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750 Passed 2nd LTC LLR: 2.95 (-2.94,2.94) [0.00,4.00] Total: 34016 W: 5584 L: 5326 D: 23106 http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4 This pull request lead to an interesting discussion about testing methodology for Stockfish: official-stockfish#1804 Bench: 3647775
Pawn and Piece Values Tuned at LTC
Failed STC
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 27487 W: 5846 L: 5903 D: 15738
http://tests.stockfishchess.org/tests/view/5be1d3190ebc595e0ae2e5b8
Passed 1st LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 38503 W: 6270 L: 5999 D: 26234
http://tests.stockfishchess.org/tests/view/5be1f5ef0ebc595e0ae2e750
Passed 2nd LTC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 34016 W: 5584 L: 5326 D: 23106
http://tests.stockfishchess.org/tests/view/5be2a1970ebc595e0ae2f1b4
Bench: 3460731