Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong evaluation of draw position #4103

Closed
mendi80 opened this issue Jul 6, 2022 · 20 comments
Closed

wrong evaluation of draw position #4103

mendi80 opened this issue Jul 6, 2022 · 20 comments

Comments

@mendi80
Copy link

mendi80 commented Jul 6, 2022

Position: q7/8/2p5/B2p2pp/5pp1/2N3k1/6P1/7K w - - 0 1
The latest version (5/07/2022) miss the draw and evaluate -4.8
Compared to an older version like 14/05/2022 that see the draw and evaluate 0.00.

@ErdoganSeref
Copy link

The issue is similar to #3894. @vondele closed it by saying the following:

There will always be positions that are wrongly evaluated, especially constructed ones. Not much we can do to fix particular positions, but fortunately, the number of them gets smaller over time.

@vondele
Copy link
Member

vondele commented Sep 4, 2022

I think this is an interesting hard puzzle for engines in that it basically needs to resolve to the 50 move rule to see the draw evaluation. For this one needs to extend search very deep, which causes something like #3911 .. on the other hand, the right move is found almost instantly.

@Craftyawesome
Copy link

There is also the reverse of this problem Position: rnb1q2k/2ppnr1p/p4p1B/3PpP1N/1P1bP1R1/2NB4/1PP3PP/5R1K w - - 5 22 SF15 don't see the lose and evaluate 0.00 = draw Compared to an older version SF14 that see the lose and evaluate -2.00

Do you have the full solution? SF14 also eventually drops to near 0.

@dav1312
Copy link
Contributor

dav1312 commented Sep 9, 2022

This is the line that I got doing some fast analysis

[FEN "rnb1q2k/2ppnr1p/p4p1B/3PpP1N/1P1bP1R1/2NB4/1PP3PP/5R1K w - - 5 22"]

22.Nxf6 Qd8 23.d6 cxd6 24.Bc4 d5 25.Ncxd5 Nxd5 26.Nxd5 Ra7 27.c3 d6
28.Rf3 Nc6 29.Rg5 Rab7 30.h4 Ba7 31.Kh2 Ne7 32.Nxe7 Qxe7 33.f6 Qe8 34.Bxf7 Qxf7
35.Rg7 Be6 36.g4 *

But then the latest dev refuted the whole thing with 24.Rf3 which is winning for white

@Craftyawesome
Copy link

SF14 at depth 40 evaluate -1.50

Keep in mind that lichess SF14 is using a smaller net that's a few elo worse on average. It looks like it missed 24.Rf3 as mentioned above.

@dav1312
Copy link
Contributor

dav1312 commented Sep 9, 2022

white is winning ! therefore, the position is not draw

That is the wrong conclusion. It's a draw because black can play 22... Rxf6

@peregrineshahin
Copy link
Contributor

peregrineshahin commented Sep 11, 2022

@MaiaChess I will assume you're legit asking
First the concept of mistakes and inaccuracies is not something that is embedded in Stockfish, it is just Lichess way of interpreting the amount of eval shift, and yeah you can lose even if you don't play any mistakes and inaccuracies in chesscom and lichess by being outplayed a bit by bit without much gap in the shift of the eval that's one thing

Second Lichess uses another net and maybe a modified binary for SF to run smoothly
Third it is not SF 15 also its Stockfish 14+ on my end
Fourth you cannot compare an engine running on a browser to find the critical eval shift between two engines, do it on a good hardware
Fifth this is unrelated to the issue
Sixth it's not a wrong eval of the whole game
I can go on and on

@peregrineshahin
Copy link
Contributor

Simply black is lost because it got outplayed a bit by a bit, there are millions/billions of possibilities
And some variations has a little difference in the eval that choosing from them is the battle ground between engines
Download SF 15 yourself and let it run with good parameters and analyse it well you will certainly find moments that has greater eval shift
When two engines play against each other that's the expected behavior, outplaying another engine can be done in those lines that for example differs 0.04 or even less also and it makes sense because it can accumulate
And anyways The goal of SF 15 wasn't a chess evaluater, it's an engine that needs to make good moves and win
If there is a better eval in recent versions its a byproduct of the recent elo gains (product being able to get better results than the previous one) but it's not the main goal of stockfish development
And if it is really it's unobtainable anyways because nobody knows the real truth about the position anyways
I suggest you think of SF 15 as an engine that wouldn't make the mistakes that SF 8 did
And not mainly a better evaluator (which indeed SF 15 already is)

@peregrineshahin
Copy link
Contributor

For god sake what do you mean by a Lichess server analysis?
Request analysis takes 1 sec to finish for the whole game
How is this any hella trustable
Don't get me wrong I love Lichess, but if you even run the analysis in your broweser it's even better than this request server analysis button
When you want to analyze games between two engine you can't just by one click and 1 sec get the full report maybe you can in 2090 but not in 2022
You can go to the Lichess source code or ask somebody in lichess and they will totally agree that this feature can't and shouldn't give reports about engine vs engine games
You are repeating yourself again and again, just go see how many nodes does this feature calculate per move
It is not a good indicator
Whereas it is for human games but also not for top players out there

@bftjoe
Copy link
Contributor

bftjoe commented Sep 11, 2022

Lichess uses fixed nodes analysis at low hash size, it means nothing in engine vs engine play.

@dav1312
Copy link
Contributor

dav1312 commented Sep 11, 2022

@MaiaChess Please don't use Lichess's server analysis for engine vs engine games.
Their analysis is limited to 1.5M nodes, intended for games between humans.

@Craftyawesome
Copy link

which one is correct ? sf15 or sf11 ?

The position is completely lost. It is subjective what the best move to try to save the game is. Neither is really wrong or right.

And if you want to argue subjective, Leela also prefers e4. -5 to -11.

@jxu
Copy link

jxu commented Sep 16, 2022

@MaiaChess are you impersonating the creators of https://github.com/CSSLab/maia-chess?

@shorome2
Copy link

Just one inaccuracy cant cause lose a game , inaccuracy defined as the delta function for evaluate mistake
You will see the analysis of this game in the future by SF20 that was able to find blunder & mistake from game
please pay attention the SF8 did not have NNUE for play game against AlphaZero

I should have grabbed popcorn for this...

Your view of Computer Engine Chess is a bit simplistic. SF's NN is tuned on ~5k CPUs. The tests you gave weren't even performed on TCEC equipment much less all of Fishtest. The accuracy of a specific move in the middle game isn't objective... it's subjective to the ideas being calculated vs what the opponent is actually doing. If your opponent already played this game and won and it remembered the game perfectly, do you expect an engine with very limited computing power to be able think far enough ahead to draw these games?

I expect SF using max threads and hash and a 7man TB to be able to correctly evaluate the positions you gave. The only problem is you're expecting moves that were calculated by a supercomputer to be refuted by your PC at low depth, low hash, and low thread count --without an EGTB.

Should there be better scalability between what SF can do on a supercomputer vs your PC? yes
But, as was mentioned earlier, you should probably use a super computer to analyze a game from another supercomputer... 40 depth is nothing... I have a third gen ryzen 7 with only 16 GB ram and it can reach 70 depth 100 selective depth in a matter of 3-5mins on these game positions

@RogerThiede
Copy link

@MaiaChess, what is it that you're trying to convey? Do you believe you have found an actual issue in the source code? Do you believe you have found a systemic issue in the latest neural network?
We can discover many positions which are draw by Syzygy Tablebase lookup but which the latest neural network gives vastly different evaluations.

@jxu
Copy link

jxu commented Sep 17, 2022

Should there be better scalability between what SF can do on a supercomputer vs your PC ?

Yes that's right , But remember Kasparov's loss to an amateur opponent , Kasparov thought that his opponent had an idea for each move but the opponent was just playing his simple game

SF15 is stronger than SF11 but this power will cause its weaknesses because it is not perfect

#4103 (comment)

what does this even mean? it's too vague to tell and there's no point in comparing engine calculations to human calculations. maybe Kasparov lost a simul or blitz game to an amateur once, but it sounds more like a made up urban myth

@jxu
Copy link

jxu commented Sep 17, 2022

I mean that sometimes simple moves are more powerful than each moves

?????????????

@nathan-lc0
Copy link

The original position that started this issue seems to no longer be a problem. Latest stockfish finds ne4 and evaluates it as 0.00 almost instantly on a single thread. Perhaps can be closed.

@Craftyawesome
Copy link

I am getting -2.08 on a single thread. Multithread seems to flatline at some random eval, sometimes 0. But either way, the issue tracker might not be the best place for positions SF gets wrong.

@vondele
Copy link
Member

vondele commented Nov 21, 2022

I'm closing this in light of the last two comments

@vondele vondele closed this as completed Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests