BUG - Zero Inaccuracies evaluation, despite blunder

To be clear, let us assume we are discussing the current implementation of this feature, rather than potential future versions of it.

It might help the discussion if you can clarify your definition of "inaccuracy", @Toadofsky or others. I interpret it to mean when a lesser move was chosen, when there was a [possibly significantly] better move available. I may fundamentally misinterpret what is meant by that term, as I'm extremely surprised by the perspective shared by @qkxwsm. In my view, the ten point drop should indicate a prominent example of a "mistake" or "inaccuracy." Just because Black still has the advantage doesn't mean this move was accurate.

I used the phrase "blundered my queen" in the common and casual vernacular, which perhaps does not strictly adhere to the engine's technical definition of it (for instance, reserving the term "blunder" for going from winning to losing). We can then agree, in that sense, it is not a true blunder, but that's a distraction from the heart of the matter...

The true issue is that the Computer Analysis feature and its inaccuracies/mistakes/blunders tally completely ignores my bad move as though I played completely correctly! Only when double checking this suspicious tally with the local engine is the bad move revealed.

Put differently, if it is not deemed an "inaccuracy", the play would then be considered "accurate"?!

Toadofsky

#13

It's not quite that simple...

Years ago I recommended to staff (and submitted a code patch because I felt so strongly about it) that more inaccuracies be counted. The counterpoint I failed to consider is that many games exist which lack stored analyses (for better moves) for those positions. Well, it's not quite that simple either: I had considered that possibility and didn't see a problem, although having "inaccuracies" without supporting data I guess adds some complexity.

Anyway... definitions, per: github.com/ornicar/lila/blob/9e5311d72225925fa31d4fabf342299768b24121/modules/analyse/src/main/Advice.scala#L63
github.com/ornicar/lila/issues/1494
github.com/ornicar/lila/pull/5337

Winning odds: 1 / ( 1 + e^( - eval / 2 ) ) with eval in pawns, approximating mate=100
"Inaccuracy": reduction in win% (winning odds) by 10%
"Mistake": reduction in win% by 20%
"Blunder": reduction in win% by 30%

Talmanian

#14

Thank you very much @Toadofsky for the very detailed and quick reply. I'll get back to you when I have a chance to absorb it more fully.

Talmanian edited

#15

@Toadofsky

I've taken some time to absorb how it works, via my computer science background. The github links are quite helpful and get right to the core of it, which is awesome. Note that I am not directly familiar with the internal architecture of Lichess or even Scala, but this is still enough data to continue the discussion. I have great respect for the Lichess developers and don't want to sound like a "monday morning quarterback."

I am frankly very surprised that this is how inaccuracies, mistakes, and blunders are determined -- that it is based on the change in winning percentage. My initial impression is that this is an incorrect and flawed premise. However, I admit in advance that I may be ignorant to essential architectural context.

Simply put, the determination should be made relative to *the other moves available*. Right now, it seems that these other potentially much better moves are not even considered in the "judgement" determination and the winning percentage delta dominates as the only metric.

This explains and is consistent with the game scenario in this post. The analysis is never noticing the other moves that preserve the larger advantage. Instead, it determines and declares, "you're still winning, nothing to see here, move along".

For illustration, revisiting the top 5 engine moves for Black's move 24:

24...Qxd2 == -16.5
24...Qe2 == -11
24...Qc5 == -10.5
24...Re1 == -10.5
24...Qh4 == -10.4
...excluded...
24...Re2 == -7.7

Those top engine moves are completely ignored by the "judgment" source code that was shared. That code doesn't care or consider if better moves were available than the one made.

I'm summary, the Inaccuracies/Mistakes/Blunders tally is not based on whether you made the best moves -- only how it affected your chances of winning! This is evidently the essence of the matter. The entire basis of requesting a computer analysis is to illuminate whether better moves were available.

Also, @Toadofsky I agree in principle with your suggestion that more inaccuracies should be included. I think the field "inaccuracy" is intended to reflect imprecision and missed opportunities, but the 10% delta threshold will hide many such instances of suboptimal moves.

RapidVariants

#16

Not making the best moves is precisely what affects your chances of winning!

If you're up a queen but miss a mate in 20, do you want that to be called out as a blunder?

If you're up a rook but miss a move that would lead to being a queen up through a complicated sequence moves (that just happens to be short enough that the engine can see it), do you want that to be called out as a blunder?

In completely winning positions, there are many ways to win. The only things that matter are making some kind of progress and not blundering too much of it away. For humans, the simplest win tends to be the best. The engine will prefer the fastest win, and it's a lot better than us at finding fast ways to win even if they involve walking on unnecessary tightropes.

Mi5ter_t

#17

Exactly #16 - even the best players do not go into calculating a complicated line that could be winning by force, engine does, the graph is to point you what tactics you might have missed but you should be able to quickly identify and gain winning chances (eg 3 moves deep)

Toadofsky

edited

#18

Indeed, there's almost a duality (that change in evaluation indicates an inferior move being played) although evaluations are highly volatile, and somewhat less volatile when transformed via sigmoid 1 / ( 1 + e^( - x / 2 ) ).

#15 defines an XY Problem http://xyproblem.info/ -- that's a great solution, but it doesn't match the problem, as #16 points out. The actual problem is just that the threshold for inaccuracy needs to decrease, not that there's some fundamental flaw in my methodology. In theory evaluation volatility might be a flaw, however it is mitigated by increasing the time spent by an engine analyzing the position until a mate score (or mate impossible against best play) is determined, or until evaluations become less volatile, whichever comes first.

Also, if you read the server-side analysis code (which decides which variations to store in the PGN - I forget where to find this code), your suggestion is already implemented, but it amounts to the same thing. If I remember correctly (and perhaps I don't -- this code has many authors), moves that would be flagged "inaccurate" due to evaluation swings are discarded when there isn't a stored variation indicating the better move.

dboing

edited

#19

@Toadofsky I was curious about whether over many games, there was some dependence on game ply or turn number for the volatility. little volatility at opening, and would increase. some peak before endgame, or not. maybe no trends...-
although your text seem to indicate volatility down eventually. there might be trade-offs.

The amount of total material still on board, and the various mate evaluation contributions as phase changes would have some effect on volatility, if i understand the definition of it. hypotheses.

the swings you mention, is that for a given fixed number of variations (move lines) considered and evaluated, the variability between the different lines changes (swingingly). (bad luck about variation double meaning chess vs statistics).

Are there references of studies or statistics of evaluations across variations width (number of lines), and across game phases for large chess databases, or any big enough for those stats to make sense. You mention in theory, so that seems to indicate some foundations or is it practical knowledge your accumulated?

Toadofsky

#20

@dboing My practical knowledge here mainly consists of:
1. Research referenced in github.com/ornicar/lila/issues/1494#issue-127725150
2. The Stockfish Evaluation Guide hxim.github.io/Stockfish-Evaluation-Guide/
3. The sum of my experience analyzing games with various engines, plus some focused research I did years ago when first developing this feature, plus whatever knowledge I've gained from watching PRs and talking with developers about them such as github.com/ornicar/lila/pull/5337 . Everything I've seen favors a "winning chances" based approach to annotating game scores, although I'm glad if contrary evidence can be discovered. I'll even give everyone a head-start - here's a free game annotator which does all the hard work of parsing PGN and communicating with any UCI engine... go forth and experiment!
github.com/rpdelaney/python-chess-annotator

I suggest that the sigmoid function mitigates volatility, and you ask a good question about maybe some game phases being more volatile than others. I can make some assumptions:
* Openings may be volatile because the Official-Stockfish team always uses opening books and so the opening phase goes untested.
* Endgame evaluations done without 7-man tablebases are known to be volatile, and the remedy is to use tablebases (even if currently there are more than 7 pieces on the board, if ends of variations hit the tablebase, evaluations are not volatile).
* Positions which Stockfish is unlikely to encounter in self-play (positions from human-human games) may be volatile since Stockfish tests and trains against itself.
* Positions referenced in github.com/official-stockfish/Stockfish/issues where there are known bugs may be volatile.

Regarding total material on board, etc. you're probably referring to SF's "scale factor" which multiplies endgame scores:
hxim.github.io/Stockfish-Evaluation-Guide/

This topic has been archived and can no longer be replied to.