Detect and handle corrupt or missing blocks or indexes #537

jmjatlanta · 2022-04-04T18:06:20Z

Situation:

When a node encounters a problem, the block files or index files can become corrupted or have incomplete data.

Weapons:

Corruption and missing data can be detected. Other nodes can provide the information the corrupted node lacks.

Objective:

Detect the problem and repair it before allowing the node to report that it is fully synchronized.

Tactics:

For the case of corrupted block files, the chain must be downloaded from the network starting from the point the corruption was detected.
For missing block files, the chain must be downloaded and the node re-indexed.
For corrupted or missing indexes, the problem must be accurately detected and the user must be prompted to restart the node with the -reindex option.

Where we are now:

Some situations of block or index corruption are detected, others are not.

As an example, having the node crash after writing the block to disk but before the index file is written will lead to missing transactions when the node restarts. The node does restart, but the data is inaccurate. Looking for the block returns nullptr and looking for a transaction that was within that block returns that the transaction does not exist.

Additional Information:

The function LoadIndexDB() does some checks to verify the integrity of the block files. This may be a good place to add additional checks to verify that the blocks and the index are synchronized.

Note: The attempt is to assist node operators when hardware/software issues make a mess within the persisted data on disk. Detecting malicious modification to distort data is not considered here.

Note: Having a block who's previous block does not exist may not be an indication of corruption. It is a valid (temporary) situation that must be planned for.

In my testing:

Having an incomplete index file (an entire entry about a block does not exist) is not detected, and the node starts with incomplete data.
Having a corrupted index file (data truncated off the end) is not detected, and the node starts with a shortened chain. I have yet to test to see if it re-syncs correctly.
Having an incomplete block file (an entire block does not exist but does exist in the index) is not detected, but would probably be a very rare occurrence. We could test for it, but we may not want to concentrate heavily on detecting/fixing it.
Having a corrupted block file (block cannot be de-serialized) is detected. I have yet to test what options are available for a node operator beyond a full re-sync.

ToDo:

Verify the findings above are accurate for different combinations of corruption / missing data.
Run tests on a multi-node chain to determine current abilities for recovering from corruption.

What I think about this issue:
basically blocks and indexes are two databases that need to be updated atomically.
In other systems this is done via 2-phase resource coordinator which ensure either or both dbs are updated or not.
In our code we do not have such a coordinator so it is possible that only one of the both block and index dbs may be updated if a crash occurs.
And if this happens it is not necessary that both dbs are corrupted, maybe they could be in a good state but unsynchronised.
So it is good to enhance detection of failures in chain dbs but maybe we should detect abnormal ends and suggest the user on startup that he needs to reindex or restart from the bootstrap, maybe ask for y/n confirmation to continue.
(I know it could be a problem for auto-maintained nodes which restart automatically if crashed but anyway this is better than nothing)

TheComputerGenie · 2022-04-05T13:18:49Z

If it's known where the block is, wouldn't it be a better option to invalidate it and then reconsider it after connecting to peers, rather than having the user down for 12-20 hours with a reindex?

jmjatlanta · 2022-04-05T13:50:49Z

Based on my tests, 1 common problem is when they are out of sync at the end. i.e. block written to disk but the daemon dies before writing the index. That I believe could be solved without re-indexing everything if we choose to do so.

Another problem is when the files get damaged in the middle. At that point, it becomes very difficult to trust anything after that point. A download across the wire and a re-index is probably inevitable at that point.

As for 2 phase commit, that is a possible solution. There are actual journaling filesystem libraries that can help with that. I am unsure if any open-source libraries exist for that, but I imagine so. Or we could roll our own. The "costs" of such a solution must be weighed (i.e. maintenance).

dimxy · 2022-04-05T14:42:29Z

I think we should try and test the revalidation proposal

Release v1.0.12-6

who-biz pushed a commit to who-biz/komodo that referenced this issue Jul 29, 2024

Merge pull request KomodoPlatform#537 from Asherda/release-v1.0.12-6

2386ecd

Release v1.0.12-6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and handle corrupt or missing blocks or indexes #537

Detect and handle corrupt or missing blocks or indexes #537

jmjatlanta commented Apr 4, 2022

dimxy commented Apr 5, 2022

TheComputerGenie commented Apr 5, 2022

jmjatlanta commented Apr 5, 2022

dimxy commented Apr 5, 2022

Detect and handle corrupt or missing blocks or indexes #537

Detect and handle corrupt or missing blocks or indexes #537

Comments

jmjatlanta commented Apr 4, 2022

Situation:

Weapons:

Objective:

Tactics:

Where we are now:

Additional Information:

ToDo:

See also:

dimxy commented Apr 5, 2022

TheComputerGenie commented Apr 5, 2022

jmjatlanta commented Apr 5, 2022

dimxy commented Apr 5, 2022