[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

gnomed · 2024-07-16T04:23:46Z

Internal/External
External

Area
Governance Related to CIP1694 implementation
Other

Summary
I am trying to test 9.0.0 on preview net. A clean deployment of 9.0.0 from version 8.12.2 produces the below errors and causes the leger to be replayed from genesis.

[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:32:58.89 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433676, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017483 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))
[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:33:28.45 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433086, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017841 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))

Steps to reproduce
Steps to reproduce the behavior:

Run a relay with cardano-node version 8.12.2
Allow it to sync and run stably for some time
Gracefully shut down the 8.12.2 node
Launch the 9.0.0 version of Cardano Node
Read the logs and note whether ledger is replayed from genesis

I also tested this on preprod net with the same results. Before trying on the preprod net I verified 8.12.2 was able to restart without any issues at all. The issue seems specific to the upgrade.

Expected behavior
I expect the node to detect the clean shutdown and resume from where the ledger is like when I restart the 8.12.2 version.

System info (please complete the following information):

OS Name: Ubuntu
OS Version: 22.04
Node version 9.0.0 from 8.12.2
Running the official cardano-node docker image

Additional context
I did change a config value to disable PeerSharing but I would not expect that to cause this.
First bug report, apologies if I missed something.

The text was updated successfully, but these errors were encountered:

jasagredo · 2024-07-23T09:43:17Z

I suspect this issue is expected. Often times when upgrading the node, the disk format of snapshots changes thus requiring a replay from genesis. I will let @IntersectMBO/cardano-ledger confirm if this is expected though.

DanTup · 2024-07-26T12:12:29Z

I got this too. If it is intended, it would be nice if it could be handled better and a more suitable message so it's clearer of this is a bug - for example by storing a version number and outputting that the version number has changed and it must be replayed.

Relying on it failing because of a size difference also feels odd, because what if there are changes to the format but the size does not change? Would it blindly read potentially incorrect data? 🤔

gnomed · 2024-07-26T15:45:15Z

In some sense the ship has already sailed so perhaps the point is moot but it would be nice to see this type of functionality extended in more of a backward compatible way.

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

I might be speaking nonsense since I have little knowledge of the underlying implementation logic, in which case I apologize.

DanTup · 2024-07-26T15:57:19Z

In some sense the ship has already sailed

It has for the current version, but surely the next time the data format changes, a version number could be added, and then subsequent versions could check it?

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward.

I know nothing about how this works, but my guess is that reading the old format would involve keeping two sets of code/data structures that they currently do not, and they'd need to convert them as they're read at startup into the new format anyway (so the node can operate with the new structures). I also know nothing about this though 😄

I'm not against the rebuilding of the data now and then, it would just be nice if it was clear from the logs whether this is expected or might be that my store got corrupted (in which case, I might have a problem I want to investigate!).

disassembler · 2024-07-27T12:43:56Z

Two things:

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.
Yes, we've been talking about a version number for ledger state for a while now. It's in the backlog, but isn't quite as simple as just adding a version number. It requires some communication between ledger and consensus to negotiate this and feed it back to the user in a trace.

disassembler · 2024-07-27T12:48:31Z

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

The new version of the node doesn't know the old format. That serialization is in a previous version of the code. It can only tell the serialization is broken, but not why it's broken. I do agree, a better error message based on a versioning number would make this nicer though.

DanTup · 2024-07-27T12:52:12Z

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

Thanks, I was worried that would be the answer. I'm running with microk8s and I shut the nodes down by scaling to 0 instances. My understanding was that this would gracefully shutdown the container (it's using the official docker container).

What's odd though, is that of all the times I recall seeing it do this, it's always been when I upgrade to a new version of the node, I don't recall every seeing it when I do the same to make k8s config changes or reboot the host for OS updates 🤔

Is it possible the error could include more information (for example what file it was reading, what data it read?) to help understand what might be wrong when it occurs?

gnomed · 2024-07-28T18:44:44Z

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

I strongly disagree with this suggestion and statement. I use the same rollout process for all node changes and that process works correctly for other updates that I have made in the weeks prior and since this bug report (upgrade from 9.0 to 9.1 was seamless in preview and preprod). Also the error message for a dirty shutdown is completely different than this snapshot error message. The only variable that reliably triggers this behaviour is going from 8.x to 9.x.

new version of the node doesn't know the old format

Yes, I was suggesting keeping both implementations in the same version for a couple releases of transition as only one possible strategy (out of many possible strategies) for backward compatability. Inspired slightly by the combinator approach used by cardano node itself for its own protocol migration.

github-actions · 2024-08-28T02:00:38Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

gnomed added the needs triage Issue / PR needs to be triaged. label Jul 16, 2024

amesgen mentioned this issue Aug 1, 2024

Add more context to invalid snapshot events #5933

Merged

9 tasks

github-actions bot added the Stale label Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

gnomed commented Jul 16, 2024 •

edited

Loading

jasagredo commented Jul 23, 2024

DanTup commented Jul 26, 2024

gnomed commented Jul 26, 2024

DanTup commented Jul 26, 2024

disassembler commented Jul 27, 2024

disassembler commented Jul 27, 2024

DanTup commented Jul 27, 2024

gnomed commented Jul 28, 2024

github-actions bot commented Aug 28, 2024

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

Comments

gnomed commented Jul 16, 2024 • edited Loading

jasagredo commented Jul 23, 2024

DanTup commented Jul 26, 2024

gnomed commented Jul 26, 2024

DanTup commented Jul 26, 2024

disassembler commented Jul 27, 2024

disassembler commented Jul 27, 2024

DanTup commented Jul 27, 2024

gnomed commented Jul 28, 2024

github-actions bot commented Aug 28, 2024

gnomed commented Jul 16, 2024 •

edited

Loading