Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does the graph differ from Lichess's? #242

Closed
OverLordGoldDragon opened this issue Aug 22, 2023 · 33 comments
Closed

How does the graph differ from Lichess's? #242

OverLordGoldDragon opened this issue Aug 22, 2023 · 33 comments

Comments

@OverLordGoldDragon
Copy link
Contributor

I find Nibbler's evaluations graph to be less granular and more prone to picking extremes (totally winning or totally losing) than Lichess's, while both analyze with Stockfish 16 (though amount of compute does differ).

Is there a line of code I can edit to mimic Lichess's metric?

@rooklift
Copy link
Owner

I guess this line?

https://github.com/rooklift/nibbler/blob/master/files/src/renderer/55_winrate_graph.js#L96C5-L96C30

e here is some number between 0 and 1

@OverLordGoldDragon
Copy link
Contributor Author

Well that's fetched from eval_list[n] and eval_list = node.future_eval_history();, so it's really in future_eval_history(), but that's defined as this.get_end().eval_history();?

My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?

@rooklift
Copy link
Owner

rooklift commented Aug 22, 2023

It's been a while, but as far as I remember, Leela outputs winrates which Nibbler just uses directly. Stockfish doesn't but does output WDL stats, which can be easily converted to winrates - which I think is what Nibbler does.

@yuzisee
Copy link

yuzisee commented Aug 25, 2023

As of official-stockfish/Stockfish#4216 Stockfish is now normalizes its centipawn values to an exact winrate. (Shipped in Stockfish 15.1)

Since Leela operates in winrates directly, I think the way Nibbler does it is actually where the future is headed. Winrates also have a well defined range (0.0 to 1.0) and the blunder/mistake/inaccuracy detection is in winrates in the first place (so if you want the graph to reflect the detections, we'd want to graph in winrates anyway).

It's likely that in the future Lichess changes their chart to look like Nibbler's, actually.

@yuzisee
Copy link

yuzisee commented Aug 25, 2023

@yuzisee
Copy link

yuzisee commented Aug 25, 2023

I find Nibbler's evaluations graph to be less granular and more prone to picking extremes (totally winning or totally losing) than Lichess's

@OverLordGoldDragon do you have a screenshot of Lichess and a screenshot of Nibbler showing the same game? (And if you could attach the PGN it would be easy to reproduce where the biggest differences are)

@OverLordGoldDragon
Copy link
Contributor Author

Granularity is my main issue. Note how Nibbler is mostly "ez" or "u ded". I don't think it's tied to the compute limit per move, but I could be wrong (tested a few configs).

PGN -- https://lichess.org/VjoUufjl/black

@yuzisee
Copy link

yuzisee commented Aug 25, 2023

My understanding is, engines output centipawns, then there's some conversion formula for win %. So where's this calculation happen?

https://github.com/vondele/Stockfish/blob/ad2aa8c06f438de8b8bb7b7c8726430e3f2a5685/src/uci.cpp#L223

  • win_rate_per_mille = 1000 / (1 + std::exp((a - x) / b))
  • win_rate = 1.0 / (1 + std::exp((a - x) / b))
  • win_rate = σ((a - x) / b) = σ(centipawn_scale_value)
  • logit(win_rate) = centipawn_scale_value

Where σ is the standard logistic function, and logit is the logit function: https://en.wikipedia.org/wiki/Logit

The rounding in Stockfish's win_rate_model means resolution of win_rate is capped at

  • 0.9995 = 99.9% ≅ centipawn_scale_value of +7.6
  • 0.0005 = 0.1% ≅ centipawn_scale_value of −7.6

So replacing

let y = (1 - e) * height;

with something like

let clamped_e = Math.min(Math.max(e, 0.0005), 0.9995);  // values from 0.0005 to 0.9995
let logit_e = 2.0 * Math.atanh(2.0 * clamped_e - 1.0);  // values from −7.6 to +7.6
let logit_e_scaled = (logit_e / 7.6) / 2.0;  // values from −0.5 to +0.5
let logit_e_scaled_01 = logit_e_scaled + 0.5;  // values from 0.0 to 1.0
let y = (1 - logit_e_scaled_01) * height;

should do what you're looking for.

Here's what that would look like
image

@yuzisee
Copy link

yuzisee commented Aug 25, 2023

Note how Nibbler is mostly "ez" or "u ded".

Observe that the specific move you're referring to 46. Re4 is, in fact, "u ded" at higher Stockfish depth. It's only with a lower depth that you can see the granularity you're looking for.

image image image image

The longer you run Stockfish, the closer to the Nibbler chart it's going to get. So, technically, Nibbler's current "picking extremes" is actually the more correct behavior.

I don't think it's tied to the compute limit per move

Put another way, if you had a much slower computer your chart would look like the Lichess one. On the other extreme, if you run a supercomputer long enough it eventually finds "Mate in 40" after 46. Re4 meaning the game is completely lost.

@OverLordGoldDragon
Copy link
Contributor Author

OverLordGoldDragon commented Aug 26, 2023

Thanks for the investigation @yuzisee !

46. Re4 is, in fact, "u ded" at higher Stockfish depth

lol. Yep. Indeed I've not tested very compute-limited settings, I expected Lichess isn't being that cheap but it makes sense.

So replacing

So, is it "already winrate" from Stockfish or not?

Here's what that would look like

I presume that's from your local run? Then before/after pics would be helpful - though I can just test myself later with previous point clarified

@yuzisee
Copy link

yuzisee commented Aug 26, 2023

So, is it "already winrate" from Stockfish or not?

According to official-stockfish/Stockfish#4216 it's "already winrate" on move 32 of each game.

There's about a 7% reduction starting at move 0, up to a 15% increase by move 128.

Lichess is drawing centipawns though, even though Stockfish is calibrating to winrate internally.


Discussion:

  • For most evals, Stockfish's formula is somewhat more extreme with winrates early in the game and more lenient with winrates later in the game. However, Lichess (human data) is far more lenient than anything Stockfish produces.
    • It turns out that humans are expected to make more mistakes than engines; who knew?
  • If the eval is exactly 0.00 it's always considered to be "50% B : 50% W" by every formula.
  • SF v16 @ move 32 with eval ±1.0 are the calibration points for Stockfish. They will always be 75.0% B : 25.0% W & 25.0% B : 75.0% W respectively (i.e. a 50% advantage) in the Stockfish formula

I presume that's from your local run?

Yes, the screenshot from
#242 (comment)
is drawing centipawns using Stockfish 16 at 10M nodes. You can think of it like

r242_1692863492

@OverLordGoldDragon
Copy link
Contributor Author

OverLordGoldDragon commented Aug 29, 2023

I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that. Does your proposed code insertion handle it?

Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?

@yuzisee
Copy link

yuzisee commented Aug 29, 2023

I really don't follow. Why is it something on move X that it's not on other moves? And it changes over time? Well, your linked thread probably answers all that.

Yes. The linked thread answers all that.

One way to think about it is the quote "Centipawns are great for developing chess engines, which is their main use. But not so much for human comprehension." from (2) which you linked above.

What happened is that this same "human comprehension" problem also caused a headache for the Stockfish team because, as you saw in 1 that you linked above, both Lichess & Chess.com (and now Nibbler) look at Win% to classify inaccuracy/mistake/blunder and not centipawns.

So in October 2022 the Stockfish team set out to try to reduce this headache by getting very detailed about the winrate↔centipawns relationship inside the engine. That's what this thread talks about official-stockfish/Stockfish#4216 and at this point the winrate model had been thoroughly tested for 28 months in development, 24 months in production, and three full Stockfish versions 12, 13 & 14

Comparing the…

latest Stockfish formula

winrate_scale = 0.5  +  0.5 / (1 + std::exp((a - centipawn*MULTIPLIER) / b))) - 0.5 / (1 + std::exp((a + centipawn*MULTIPLIER) / b)))

Stockfish's current value for MULTIPLIER, m, a and b are:

  • MULTIPLIER = 3.28
  • m = ply / 64.0
  • a = 0.38036525 m³ − 2.82015070 m² + 23.17882135 m + 307.36768407
  • b = −2.29434733 m³ + 13.27689788 m² − 14.26828904 m + 63.45318330

with the…

latest Lichess formula

winrate_scale = 1/(1+Math.exp(MULTIPLIER * centipawn_scale))

Lichess's current value for MULTIPLIER is -0.368208

we can get "win_scale" evals corresponding to e.g. −2.5, −1.0, +1.0, and +2.5 "centipawn_scale" evals

formula eval −2.5 eval −1.0 eval +1.0 eval +2.5
SF v16 @ move 0 ~100% B 79.0% B : 21.0% B 21.0% B : 79.0% W ~100% W
SF v16 @ move 32 ~100% B 75.0% B : 25.0% W 25.0% B : 75.0% W ~100% W
SF v16 @ move ≥120 99.9% B : 0.01% W 67.5% B : 32.5% W 32.5% B : 67.5% W 0.01% B : 99.9% W
Lichess 2023 71.5% B : 28.5% W 59.1% B : 40.9% W 40.9% B : 59.1% W 28.5% B : 71.5% W

Does your proposed code insertion handle it?

Yes, the proposed code insertion handles it, because Nibbler reads the direct wdl output of Stockfish which already has the corrections applied.

If you make the proposed code changes, you'll get the same "line graph" as Lichess (as long as you're running at the same depth as Lichess; Lichess seems to run at very low depth).


Regardless, we have 1, then 2 - and from (2) we have "Why not just use Stockfish centipawns?", suggesting it's not using direct Stockfish outputs. I've only skimmed (2). Is (2) outdated, or did I misread?

To verify what is out-of-date and what is up-to-date, we have to trace the following:

Summary

Lichess calculates winrate based on actual player performance during rapid time controls, but draws their graph using Stockfish centipawns. Stockfish calculates centipawns using NNUE and uses a formula to keep them in sync with winrates with a process that is stable between releases.

Given that Lc0 now calculates winrates directly, and converts to a centipawn value purely to work with the Stockfish definition and keep them in sync, it seems like graphing winrate is probably the better choice when performing engine analysis.

Note that Nibbler ideally works well with both Stockfish type engines and Leela type engines, so showing winrate helps achieve that goal since it's a stable output reference point for both engine types. For now, both the Stockfish centipawn↔winrate equivalence and Lichess centipawn↔winrate equivalence are both based on the https://en.wikipedia.org/wiki/Logit function, so it's very simple (5 lines of code) to match them up visually as we saw here #242 (comment)

TL;DR:

The (2) you linked was first published in 2018. Stockfish did not provide winrates back then.

In 2020, Stockfish started showing winrates (June 2020) and redefined centipawns to match winrates (Nov 2022). Meanwhile, Lichess keeps a separate winrate formula because it's based on player performance in rapid time controls whereas Stockfish is giving you the "winrate of the position itself" so to speak.

Nobody is relying on centipawns as a reference point anymore.

@OverLordGoldDragon
Copy link
Contributor Author

OverLordGoldDragon commented Aug 30, 2023

Thanks for the highly detailed response!

So TL;DR: Lichess graphs centipawns, but labels blunders-etc via winrate, and centipawns != winrate. Stockfish outputs winrate, and your code converts it back to centipawns. Correct?

Also, I don't know whether "winrate" is useful at all, if it's based on, as you suggested earlier, what the engine thinks its odds of winning the game are. To a perfect player, the graph only ever takes on three values (1, -1, 0), and the current engines are around 90% of the way there, if not 99% with enough time. This isn't an issue early on since they can't yet quickly see "mate in 64", but with enough pieces off the board that's often enough what my graphs look like.

It'd be more informative to output "human winrate", e.g. if I lose an unimportant pawn in opening, a perfect player's eval is 1, but mine is still around 0. Hence it'd be also tailored to rating and time control. You do say "Lichess calculates winrate based on actual player performance during rapid time controls", so is this actually what's happening? I guess, minus the rating part, instead they calibrate it to ~2800 I suppose.

we have to trace the following

Geez, you searched and typed up all that? I mean, not complaining!

PS, Lc0 Discord often discusses interesting stuff, in case you're not there already.

@OverLordGoldDragon
Copy link
Contributor Author

Also I take it your labeling code is unaffected by changing the graph code?

@yuzisee
Copy link

yuzisee commented Aug 30, 2023

To a perfect player, the graph only ever takes on three values (1, -1, 0)

The same happens in centipawns though. At high depth, the graph will only ever take on three values: (+200 a.k.a. Mate for white, −200 a.k.a. Mate for black, 0.0.)

The problem isn't centipawns vs winrate, it's that the depth of your engine is too high.

It'd be more informative to output "human winrate"

Lc0 specifically added this feature just last month! You give it your Elo and it calculates a "human winrate" at that level for any/each position.

https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/contempt-implementation/

When I get back home maybe I'll see about a pull request for Nibbler to support this new setting.

Lc0 Discord often discusses interesting stuff

Oh neat, I'll check it out. Thanks!

Your labeling code is unaffected by changing the graph code?

Yes. They are separate and don't rely on each other.

@Naphthalin
Copy link

Just saw that you referenced LeelaChessZero/lc0#1791 and wanted to add my two cents on eval definition/normalization and win rate, despite the issue already being closed.

I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games, and requires some translation to be of use, which is mostly what Lc0 PR1791 is about. After working on that quite some time, I came to the conclusion that what is now the WDL_mu score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.

I'd personally strongly prefer Nibbler to show eval (SF centipawn or Lc0 WDL_mu) on a [-2.5, +2.5] y-axis with dashed lines marking the +1 and -1 boundaries for 50% W/L over displaying the expected score; maybe I can convince @rooklift why that would be a good idea :)

@OverLordGoldDragon
Copy link
Contributor Author

Your and yuzisee's comments are excellent Wiki material and shouldn't just get lost in a closed issue. Do you have a summary somewhere that explains your approach?

@rooklift Consider enabling repository Wiki, where you can say "Not verified by core maintainers" to absolve review responsibility. I still don't know what some of the buttons do, maybe some will volunteer to explain.

@Naphthalin
Copy link

A summary for the approach taken in Lc0 can be found in the linked PR1791, with original parametrization choice, explanation of the WDL_mu eval definition and comparison of pawn-odds startpos between SF and Lc0 probably being the most useful comments.

@yuzisee
Copy link

yuzisee commented Sep 12, 2023

The trouble with defaulting to WDL_mu in Nibbler is making sure the user gets the correct WDLDrawRateReference since right now Nibbler doesn't care what network you use as long as you provide one.

I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.

@yuzisee
Copy link

yuzisee commented Sep 12, 2023

I personally don't think that engine WDL (and the derived expected score) is particularly useful for analyzing human games.... the WDL_mu score type is the most (or possibly only) natural way of assigning a single number to a position, and the normalized SF centipawn eval does a very good job at approximating that up to +2.5 or so, at which point evals become irrelevant. Within this range, the main problem isn't that centipawn evals are a bad reference, it's that engines are too strong.

And since we have @Naphthalin here, I'll ask a question:

@Naphthalin
Copy link

@yuzisee

The trouble with defaulting to WDL_mu in Nibbler is making sure the user gets the correct WDLDrawRateReference since right now Nibbler doesn't care what network you use as long as you provide one.

Not sure where you took that from; if the user is using defaults (and not setting any contempt related parameters), there won't be any WDL adjustment. However, the beauty of the WDL_mu eval is that it is almost completely invariant under the applied WDL transformation (and to some extent, even network selection!), unlike any other score type like Q or centipawn. Feel free to test various values of WDLCalibrationElo with either Contempt: "0" and/or WDLEvalObjectivity: 1.0 to see how it affects WDL_mu and Q evals :)

I can't speak for @rooklift here, but I presume the idea that a new person can download Nibbler and it "works out of the box" is appealing.

Seems like his opinion is similar, given that #244 makes my proposed "display centipawn from -2.5 to +2.5 eval, with dashed lines depicting +1.0 and -1.0" the default.

Blunder/Mistake/Inaccuracy checks by convention still rely on some sort of WDL threshold rather than a centipawn threshold. Given this issue of "engines are too strong, WDL needs translation or it will be too extreme" would you like to suggest a path forward for #237

Due to Nibbler being inherently a GUI for Lc0, this question is very closely related to how the --temp-value-cutoff parameter is applied in Lc0, which currently also is based on deltaQ. However, my WDL adjustment/rescale/contempt implementation from the mentioned PR1791 is based on the central assumption that there is some scale where the numerical size of inaccuracies isn't depending on the evaluation (contrary to centipawn or winrate, which is more resp. less sensitive in decisive positions), and the evaluation on this scale is exactly what the WDL_mu score estimates. I didn't do any of the actual data science groundwork and check how well that translates to human games, but I presume that judging inaccuracies on the WDL_mu scale should work significantly better than both (native engine) winrate and (old A/B engine) centipawn. Since the introduction of normalizing evals to +1.00 = 50% W however, the SF centipawn is doing a very good job at approximating WDL_mu (only has some minor issue due to game progress dependence) up to +2.5 or so, from where things become increasingly meaningless.

To throw around some numbers in the WDL_mu scale: Since +0.25 is roughly the initial white advantage (half a tempo?) and 1.0 the difference between equal and 50% W, it might make sense to start with thresholds of 0.25, 0.5 and 1.0, basically meaning "giving up the initiative" (inaccuracy), "handing over the initiative to the opponent" (mistake) and "enough to change the most likely outcome" (blunder), with treating evals above +2.5 as virtually the same.

@rooklift
Copy link
Owner

Although I like the sort of "critical strip" in the new graph, the clamping of it to the [-2.5, 2.5] centipawn range is a bit counterintuitive...

@Naphthalin
Copy link

What about a simple alternative like displaying the [-2.5, +2.5] range as it is now, compressing the [+2.5, +5.0] into the [+2.5, +3.0] range, and clamp above +5.0?

@OverLordGoldDragon
Copy link
Contributor Author

[-2.5, +2.5] range as it is now, compressing the [+2.5, +5.0] into the [+2.5, +3.0] range

I'd prefer whatever makes the interpretation approximately linear, minus the clamping part (and maybe that's what you also intend).

@Naphthalin
Copy link

The [-2.5,+2.5] is meant that way: approximately linear inside, and basically meaningless outside. Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.

@OverLordGoldDragon
Copy link
Contributor Author

Then certainly there should be a conversion first (to a common eval)? If the method can't check which engine is running, then the best solution is to make it. If that's a problem, I guess just document it.

@yuzisee
Copy link

yuzisee commented Sep 13, 2023

Main question for me is how to accomodate for the fact that SF/Lc0 show evals like +5 or +10, but a +4 from one of them could be a +20 from the other.

It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine

@OverLordGoldDragon
Copy link
Contributor Author

Isn't that just "doing evaluation"?

@Naphthalin
Copy link

It's actually worse than that: a "+4" from low depth/low nodes could be "+20" from high depth/high nodes with the same engine

Hence the notion of "evals above +2.5 are imaginary numbers anyway". This kind of (self-)disagreement only really seems to happen above that value.

@yuzisee
Copy link

yuzisee commented Sep 15, 2023

(comment deleted, centipawn analysis moved to #242 (comment) and inaccuracy/mistake/blunder discussion moved to #237 (comment))

@OverLordGoldDragon
Copy link
Contributor Author

This kind of (self-)disagreement only really seems to happen above that value

What do you mean by self-disagreement? Doesn't more evaluated positions just mean updated evaluation (and presumably more accurate)?

@OverLordGoldDragon
Copy link
Contributor Author

OverLordGoldDragon commented Sep 15, 2023

The original topic's been answered, now content's being mixed and isn't great for future readers. I think a separate Issue is best - if this makes sense, I've opened #245, consider continuing there.

It'd also be good if @yuzisee moved their latest comment to there (and deleted it here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants