-
Notifications
You must be signed in to change notification settings - Fork 20.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAD BLOCK errors #20478
Comments
That repro isn't accurate is it? You also had a non-empty ancients but otherwise wiped chaindata right?
|
Oh, sorry, you're correct. I updated. The attached log file is from my re-try to import. Upon initial import my chaindata and ancients was empty. While importing, ancients data was filling up. There was a crash ( #20477 ) which seemed to recover just fine...hours later, this happened. I tried to restart it again, same error. Can't proceed with import any further. |
I received the same error on a different block. This time it was from the network, not DB import. This is like the 4th time I've been unable to perform a full sync.
|
Again. Block number: 3793414 The difference between this run and the other runs is that I moved the freezer folder to a locally attached USB drive. Didn't help. I'm going to buy a new SSD and try again after it comes... |
I suspect your disk is toast...
|
Alright, I tried to perform a full sync on a 100% different workstation. Windows this time.
Error is different...? When exiting I received this:
After restarting the node I got this:
|
I purchased brand new SSDs yesterday, re-installed Ubuntu 18.04.3 LTS fresh. I then ran a full sync and it failed, again.
|
Wow, that is strange indeed! I've done four-five full syncs the last couple of weeks, never hit any problems like these.. makes me wonder... Gripping at straws here, could you do a bios memtest?
|
Theoretically, could a single 'static-peer' provide bad block(s) resulting in the behavior? I have to assume that even 1 connected peer providing good block(s) would be enough, but I'm curious if that scenario would be enough to cause this behavior? Would receiving invalid or corrupt block(s) cause this behavior? I would assume it would NOT given that the node is processing each block, but I'm also grasping at straws 😄 |
Hm, I'm a bit confused, I thought this ticket and #20485 were the same, but they're not. So, is everything in this ticket about
No, block 2952165 does indeed have hash The receipts you posted seem to match http://mon04.ethdevops.io/block/2952165 , at least concerning the accumulated gas, though I suspect that there's some mismatch in one of the receipts, will need to investigate it further. All of this (both tickets) seem to come from a data corruption, and things like this typically happens when there's a disk corruption which causes leveldb to be unable to provide e.g. correct state (which makes geth miss a state entry, which could make it think a nonce is Did you compile the code yourself? And if so, what go compiler version did you use? |
Sorry for the confusion. I didn't want to open a new error every time. Basically I only tried to import the DB twice, then you suggested that the export was corrupt. So I switched to just syncing from the network. I tried that a few times but it failed so I switched to the PPA release on Ubuntu (stable). That also failed multiple times. So I switched to the binary release for Windows on a different workstation, that failed too. I bought a new SSD, tried again on Ubuntu stable from PPA after fresh OS install. The only thing that is equal between all runs (2 different workstations, 2 different OS) is that they're all on the same local network. Which is where my last question came from, I thought it might be possible that I have a neighbor peer node with corrupt data poisoning the sync process somehow. To fuel my theory, I had 2 Geth nodes:
Yesterday I thought screw it I'll try to do a Full Sync on Geth02 also. So I wiped BOTH nodes and started them at the same time. Geth01 is now on block 4,731,858 and still going. Granted, it could still fail, it's only 1/2 way to the tip. But so far it's going better than previous runs. |
Nope (unless there's some deep flaw being exploited by some really really advanced attacker). A full sync fetches blocks from the network. It verifies the block content against the header, and executes every transaction. There's no way a peer can influence that, really. A fast sync is a different matter, where in theory at least there are more avenues for a peer to do something bad. However, all state downloads are verified against the trie, so in practice I don't think a peer can do stuff which actually causes corrupt data on the victim. |
On your windows-machine, this message indicates that you got a data corruption on leveldb:
specifically: So is |
No, Geth01 is the Ubuntu machine that all of these errors came from except for the single run that I did on windows. Geth01 is the system that I replaced the SSD. FWIW the windows machine is also brand new... |
Well Geth01 failed again. I got all the way to 5,013,983 this time.
|
Someone else on Twitter says they experienced similar issues: https://twitter.com/MrBobGilbert/status/1217513656323321858?s=20 Unfortunately they didn't open any issue about it. Replacing the CPU/Mobo seems extreme. I mean, it's not like my systems exhibit any other issues / crashes / blue screens / etc. My systems have been stable otherwise. I have to assume that all of the Geth benchmark nodes run on virtual machines and/or dockers, which could be abstracting some underlying driver/hardware issue or bug. All of my machines are bare metal... |
Hm,
The |
If you restart this one (shut it down cleanly and restart it with same command line) does it then hit the same error? |
I suspect that this error is ephemeral, but if we're really really lucky and it's reproducable, this pr: #20567 would provide more details about the internals. |
Unfortunately I did not expect you to come back with a potential debug and I already wiped it in an attempt to run it inside a docker to see if it was still reproducible. Terribly sorry. I'll try to reproduce. I have 3 runs going right now. I'll keep at it. |
So, I ran Geth in a docker this time...and I still got the error...
I mounted drives to the docker so that it would be easy to restart geth outside of docker with the same chaindata. After Geth crashed in the docker with that, I stopped it. I then re-ran geth outside of the docker and reproduced.
If you can give me some basic instructions for how to test your patch I can get you the output. Still on 1.9.9 |
Interesting! So you'll need to build from source, and specifically check out that patch. I've never done this on windows, but the linux steps are basically
That would make the |
No worries, it's Ubuntu I didn't have Go since I had reinstalled Ubuntu, so I grabbed that:
But it says it can't find your branch?
|
Nevermind, I figured it out. Should be: |
Boom:
|
It's great that it's reproducable! Now, please do a
And the resulting |
The only thing on my entire network that could be of interest is my pfSense firewall running Snort. I don't believe Snort would actually modify any data in transit, but it could drop packets or block hosts. |
Same |
Not sure if it helps but I remove go via apt remote, did a make with
Appears to be the same. |
Note for ourselves: The error is a bit misleading. The header with invalid mix digest is in fact the @MysticRyuujin Interesting! So it appears that when run in isolation (the testcase), it behaves correctly, but it still fails when run within geth. I've now pushed a change which does the same xor-check of the ethash cache contents. My hunch is that the generated dag-data is corrupt, which would have happened when the cache was generated (and in that case, it doesn't matter if you switched compiler now, since it was generated a while ago). If I am correct, then we can also adapt the testcase to use the same ethash directories as your geth installation does. Anyway, please |
Here you go.
I noticed you're dumping directory paths, in case you want the full Geth command I'm running:
I notice that it says
|
Wow, that's really curious! So what we can see, is:
This means that the call that produces an error is this one: We have verified that |
I pushed another commit on my branch. The commit produces a detailed printout of the internals in |
Perfect.
This is yours, which has wrong values on lookup index
Those invalid lookup values later causes the wrong indices to be used. |
Do I recall correctly that you offered ssh access -- and if so, does that still stand? I think that would be easier for both of us :) |
Yeah, I'm at work right now, but when I get home I'll set it up and send you credentials via Twitter DM or discord or whatever. |
There appears to be two ethash directories:
These caches correspond to
and
These correspond to
So these latter ones are the ones that are interesting for us. These are the
On my local machine, verifying the correctness:
Correct
Correct.
Mismatch. ( So the ethash cache for epoch
This one generates the correct cache ( Trying instead with the docker version:
This one is also correct. So both the stock The two ethash files are seemingly pretty similar, there's only one byte that differs:
Indeed, a
Here's what I get from one of our aws machines:
So sources of corruption can either be the disk or the non-error-correcting RAM. From the info site above, it says
I really have no idea how common that is, but at this point that is my primary suspect. |
To follow-up, deleting the file |
I also installed
The short self-test completed without errors, but it does report some type of error. Perhaps would be worth it to run a longer self-test with |
Could you dump out the SMART stats of your SSD? Just wondering what the counters look like. |
|
and for
|
So I haven't found time to do the extended memtest again, but I deleted the cache files like you suggested. It then ran fine for about 3 or 4 days then last night I got the mix hash digest errors again. I deleted the cache again and it picked up where it left off... |
Closing this, it's getting pretty stale, and even though we never narrowed down what caused the corruption, we did at least discover the exact bit that flipped :) |
I am facing similar problem as follows. |
Please open a new ticket, and provide full logs. Otherwise it's very difficult to analyse anything.
|
System information
Geth version:
1.9.10-unstable-f51cf573-20191212
OS & Version: Ubuntu 18.04.3
Expected behaviour
Block is imported successfully
Actual behaviour
BAD BLOCK
Steps to reproduce the behaviour
From a fully sync'd node (Fast Sync'd > 3 months ago)
geth --datadir /data/ethereum --datadir.ancient /archive/ancient export <file>
Wiped
chaindata
, wipedancient
foldergeth --datadir /data/ethereum --datadir.ancient /archive/ancient --syncmode full import file
Crash ( #20477 ). Restart
geth --datadir /data/ethereum --datadir.ancient /archive/ancient --syncmode full import file
This error. Restart (same command). Same error.
Backtrace
error.log
The text was updated successfully, but these errors were encountered: