Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

Open
gnomed opened this issue Jul 16, 2024 · 9 comments
Open

[BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis #5908

gnomed opened this issue Jul 16, 2024 · 9 comments
Labels
needs triage Issue / PR needs to be triaged. Stale

Comments

@gnomed
Copy link

gnomed commented Jul 16, 2024

Internal/External
External

Area
Governance Related to CIP1694 implementation
Other

Summary
I am trying to test 9.0.0 on preview net. A clean deployment of 9.0.0 from version 8.12.2 produces the below errors and causes the leger to be replayed from genesis.

[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:32:58.89 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433676, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017483 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))
[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:33:28.45 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433086, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017841 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))

Steps to reproduce
Steps to reproduce the behavior:

  1. Run a relay with cardano-node version 8.12.2
  2. Allow it to sync and run stably for some time
  3. Gracefully shut down the 8.12.2 node
  4. Launch the 9.0.0 version of Cardano Node
  5. Read the logs and note whether ledger is replayed from genesis

I also tested this on preprod net with the same results. Before trying on the preprod net I verified 8.12.2 was able to restart without any issues at all. The issue seems specific to the upgrade.

Expected behavior
I expect the node to detect the clean shutdown and resume from where the ledger is like when I restart the 8.12.2 version.

System info (please complete the following information):

  • OS Name: Ubuntu
  • OS Version: 22.04
  • Node version 9.0.0 from 8.12.2
  • Running the official cardano-node docker image

Additional context
I did change a config value to disable PeerSharing but I would not expect that to cause this.
First bug report, apologies if I missed something.

@gnomed gnomed added the needs triage Issue / PR needs to be triaged. label Jul 16, 2024
@jasagredo
Copy link
Contributor

I suspect this issue is expected. Often times when upgrading the node, the disk format of snapshots changes thus requiring a replay from genesis. I will let @IntersectMBO/cardano-ledger confirm if this is expected though.

@DanTup
Copy link

DanTup commented Jul 26, 2024

I got this too. If it is intended, it would be nice if it could be handled better and a more suitable message so it's clearer of this is a bug - for example by storing a version number and outputting that the version number has changed and it must be replayed.

Relying on it failing because of a size difference also feels odd, because what if there are changes to the format but the size does not change? Would it blindly read potentially incorrect data? 🤔

@gnomed
Copy link
Author

gnomed commented Jul 26, 2024

In some sense the ship has already sailed so perhaps the point is moot but it would be nice to see this type of functionality extended in more of a backward compatible way.

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

I might be speaking nonsense since I have little knowledge of the underlying implementation logic, in which case I apologize.

@DanTup
Copy link

DanTup commented Jul 26, 2024

In some sense the ship has already sailed

It has for the current version, but surely the next time the data format changes, a version number could be added, and then subsequent versions could check it?

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward.

I know nothing about how this works, but my guess is that reading the old format would involve keeping two sets of code/data structures that they currently do not, and they'd need to convert them as they're read at startup into the new format anyway (so the node can operate with the new structures). I also know nothing about this though 😄

I'm not against the rebuilding of the data now and then, it would just be nice if it was clear from the logs whether this is expected or might be that my store got corrupted (in which case, I might have a problem I want to investigate!).

@disassembler
Copy link
Contributor

Two things:

  1. this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.
  2. Yes, we've been talking about a version number for ledger state for a while now. It's in the backlog, but isn't quite as simple as just adding a version number. It requires some communication between ledger and consensus to negotiate this and feed it back to the user in a trace.

@disassembler
Copy link
Contributor

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

The new version of the node doesn't know the old format. That serialization is in a previous version of the code. It can only tell the serialization is broken, but not why it's broken. I do agree, a better error message based on a versioning number would make this nicer though.

@DanTup
Copy link

DanTup commented Jul 27, 2024

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

Thanks, I was worried that would be the answer. I'm running with microk8s and I shut the nodes down by scaling to 0 instances. My understanding was that this would gracefully shutdown the container (it's using the official docker container).

What's odd though, is that of all the times I recall seeing it do this, it's always been when I upgrade to a new version of the node, I don't recall every seeing it when I do the same to make k8s config changes or reboot the host for OS updates 🤔

Is it possible the error could include more information (for example what file it was reading, what data it read?) to help understand what might be wrong when it occurs?

@gnomed
Copy link
Author

gnomed commented Jul 28, 2024

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

I strongly disagree with this suggestion and statement. I use the same rollout process for all node changes and that process works correctly for other updates that I have made in the weeks prior and since this bug report (upgrade from 9.0 to 9.1 was seamless in preview and preprod). Also the error message for a dirty shutdown is completely different than this snapshot error message. The only variable that reliably triggers this behaviour is going from 8.x to 9.x.

new version of the node doesn't know the old format

Yes, I was suggesting keeping both implementations in the same version for a couple releases of transition as only one possible strategy (out of many possible strategies) for backward compatability. Inspired slightly by the combinator approach used by cardano node itself for its own protocol migration.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue / PR needs to be triaged. Stale
Projects
None yet
Development

No branches or pull requests

4 participants