-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does the graph differ from Lichess's? #242
Comments
I guess this line? e here is some number between 0 and 1 |
Well that's fetched from My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen? |
It's been a while, but as far as I remember, Leela outputs winrates which Nibbler just uses directly. Stockfish doesn't but does output WDL stats, which can be easily converted to winrates - which I think is what Nibbler does. |
As of official-stockfish/Stockfish#4216 Stockfish is now normalizes its centipawn values to an exact winrate. (Shipped in Stockfish 15.1) Since Leela operates in winrates directly, I think the way Nibbler does it is actually where the future is headed. Winrates also have a well defined range (0.0 to 1.0) and the blunder/mistake/inaccuracy detection is in winrates in the first place (so if you want the graph to reflect the detections, we'd want to graph in winrates anyway). It's likely that in the future Lichess changes their chart to look like Nibbler's, actually. |
@OverLordGoldDragon do you have a screenshot of Lichess and a screenshot of Nibbler showing the same game? (And if you could attach the PGN it would be easy to reproduce where the biggest differences are) |
Granularity is my main issue. Note how Nibbler is mostly "ez" or "u ded". I don't think it's tied to the compute limit per move, but I could be wrong (tested a few configs). PGN -- https://lichess.org/VjoUufjl/black |
Where The rounding in Stockfish's
So replacing
with something like
should do what you're looking for. |
Thanks for the investigation @yuzisee !
lol. Yep. Indeed I've not tested very compute-limited settings, I expected Lichess isn't being that cheap but it makes sense.
So, is it "already winrate" from Stockfish or not?
I presume that's from your local run? Then before/after pics would be helpful - though I can just test myself later with previous point clarified |
According to official-stockfish/Stockfish#4216 it's "already winrate" on move 32 of each game. There's about a 7% reduction starting at move 0, up to a 15% increase by move 128. Lichess is drawing centipawns though, even though Stockfish is calibrating to winrate internally. Discussion:
Yes, the screenshot from |
I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that. Does your proposed code insertion handle it? Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread? |
Yes. The linked thread answers all that. One way to think about it is the quote "Centipawns are great for developing chess engines, which is their main use. But not so much for human comprehension." from (2) which you linked above. What happened is that this same "human comprehension" problem also caused a headache for the Stockfish team because, as you saw in 1 that you linked above, both Lichess & Chess.com (and now Nibbler) look at So in October 2022 the Stockfish team set out to try to reduce this headache by getting very detailed about the winrate↔centipawns relationship inside the engine. That's what this thread talks about official-stockfish/Stockfish#4216 and at this point the winrate model had been thoroughly tested for 28 months in development, 24 months in production, and three full Stockfish versions 12, 13 & 14 Comparing the… latest Stockfish formula
Stockfish's current value for
with the… latest Lichess formula
Lichess's current value for we can get "win_scale" evals corresponding to e.g. −2.5, −1.0, +1.0, and +2.5 "centipawn_scale" evals
Yes, the proposed code insertion handles it, because Nibbler reads the direct If you make the proposed code changes, you'll get the same "line graph" as Lichess (as long as you're running at the same depth as Lichess; Lichess seems to run at very low depth).
To verify what is out-of-date and what is up-to-date, we have to trace the following:
SummaryLichess calculates winrate based on actual player performance during rapid time controls, but draws their graph using Stockfish centipawns. Stockfish calculates centipawns using NNUE and uses a formula to keep them in sync with winrates with a process that is stable between releases. Given that Lc0 now calculates winrates directly, and converts to a centipawn value purely to work with the Stockfish definition and keep them in sync, it seems like graphing winrate is probably the better choice when performing engine analysis. Note that Nibbler ideally works well with both Stockfish type engines and Leela type engines, so showing winrate helps achieve that goal since it's a stable output reference point for both engine types. For now, both the Stockfish centipawn↔winrate equivalence and Lichess centipawn↔winrate equivalence are both based on the https://en.wikipedia.org/wiki/Logit function, so it's very simple (5 lines of code) to match them up visually as we saw here #242 (comment) TL;DR:The (2) you linked was first published in 2018. Stockfish did not provide winrates back then. In 2020, Stockfish started showing winrates (June 2020) and redefined centipawns to match winrates (Nov 2022). Meanwhile, Lichess keeps a separate winrate formula because it's based on player performance in rapid time controls whereas Stockfish is giving you the "winrate of the position itself" so to speak. Nobody is relying on centipawns as a reference point anymore. |
Thanks for the highly detailed response! So TL;DR: Lichess graphs centipawns, but labels blunders-etc via winrate, and centipawns != winrate. Stockfish outputs winrate, and your code converts it back to centipawns. Correct? Also, I don't know whether "winrate" is useful at all, if it's based on, as you suggested earlier, what the engine thinks its odds of winning the game are. To a perfect player, the graph only ever takes on three values (1, -1, 0), and the current engines are around 90% of the way there, if not 99% with enough time. This isn't an issue early on since they can't yet quickly see "mate in 64", but with enough pieces off the board that's often enough what my graphs look like. It'd be more informative to output "human winrate", e.g. if I lose an unimportant pawn in opening, a perfect player's eval is 1, but mine is still around 0. Hence it'd be also tailored to rating and time control. You do say "Lichess calculates winrate based on actual player performance during rapid time controls", so is this actually what's happening? I guess, minus the rating part, instead they calibrate it to ~2800 I suppose.
Geez, you searched and typed up all that? I mean, not complaining! PS, Lc0 Discord often discusses interesting stuff, in case you're not there already. |
Also I take it your labeling code is unaffected by changing the graph code? |
The same happens in centipawns though. At high depth, the graph will only ever take on three values: (+200 a.k.a. Mate for white, −200 a.k.a. Mate for black, 0.0.) The problem isn't centipawns vs winrate, it's that the depth of your engine is too high.
Lc0 specifically added this feature just last month! You give it your Elo and it calculates a "human winrate" at that level for any/each position. https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation/ When I get back home maybe I'll see about a pull request for Nibbler to support this new setting.
Oh neat, I'll check it out. Thanks!
Yes. They are separate and don't rely on each other. |
Just saw that you referenced LeelaChessZero/lc0#1791 and wanted to add my two cents on eval definition/normalization and win rate, despite the issue already being closed. I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games, and requires some translation to be of use, which is mostly what Lc0 PR1791 is about. After working on that quite some time, I came to the conclusion that what is now the I'd personally strongly prefer Nibbler to show eval (SF centipawn or Lc0 WDL_mu) on a [-2.5, +2.5] y-axis with dashed lines marking the +1 and -1 boundaries for 50% W/L over displaying the expected score; maybe I can convince @rooklift why that would be a good idea :) |
Your and yuzisee's comments are excellent Wiki material and shouldn't just get lost in a closed issue. Do you have a summary somewhere that explains your approach? @rooklift Consider enabling repository Wiki, where you can say "Not verified by core maintainers" to absolve review responsibility. I still don't know what some of the buttons do, maybe some will volunteer to explain. |
A summary for the approach taken in Lc0 can be found in the linked PR1791, with original parametrization choice, explanation of the WDL_mu eval definition and comparison of pawn-odds startpos between SF and Lc0 probably being the most useful comments. |
The trouble with defaulting to I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing. |
And since we have @Naphthalin here, I'll ask a question:
|
Not sure where you took that from; if the user is using defaults (and not setting any contempt related parameters), there won't be any WDL adjustment. However, the beauty of the
Seems like his opinion is similar, given that #244 makes my proposed "display centipawn from -2.5 to +2.5 eval, with dashed lines depicting +1.0 and -1.0" the default.
Due to Nibbler being inherently a GUI for Lc0, this question is very closely related to how the To throw around some numbers in the |
Although I like the sort of "critical strip" in the new graph, the clamping of it to the [-2.5, 2.5] centipawn range is a bit counterintuitive... |
What about a simple alternative like displaying the |
I'd prefer whatever makes the interpretation approximately linear, minus the clamping part (and maybe that's what you also intend). |
The |
Then certainly there should be a conversion first (to a common eval)? If the method can't check which engine is running, then the best solution is to make it. If that's a problem, I guess just document it. |
It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine |
Isn't that just "doing evaluation"? |
Hence the notion of "evals above +2.5 are imaginary numbers anyway". This kind of (self-)disagreement only really seems to happen above that value. |
(comment deleted, centipawn analysis moved to #242 (comment) and inaccuracy/mistake/blunder discussion moved to #237 (comment)) |
What do you mean by self-disagreement? Doesn't more evaluated positions just mean updated evaluation (and presumably more accurate)? |
I find Nibbler's evaluations graph to be less granular and more prone to picking extremes (totally winning or totally losing) than Lichess's, while both analyze with Stockfish 16 (though amount of compute does differ).
Is there a line of code I can edit to mimic Lichess's metric?
The text was updated successfully, but these errors were encountered: