-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geth sync (full, not --fast) stalls: head header timeout
#2621
Comments
head header timeout
Observation 3: I got this However, when I let A connect to peers in the network at wide, and after it finished syncing. I again had B connect to A, and the error did not recurr, and syncing on B launched fine. To reproduce the error, I tried to reproduce the scenario of unsynced/syncing node A and less synced node B, by creating a brand new geth datadir on node A and on node B, and having node A sync from the network, while node B connects. The error did not reproduce itself: syncing on B proceded as expected. Aside: at one point I tried a fresh node on A sync from node B. A did connect to B and started syncing, however very soon A aborted with some timeout (sorry, missing details) and disconnected from B. Then B got a 'broken pipe' and closed the connetion (as expected). Note that B is on a very slow ARM board, A is on a fast x86_64 VPS. This problem is probably distinct, but could either or both of these bugs be due to overly optimistic timeout values? |
Localized this finally: the node that generates I don't know the root cause of the blocking, but there is likely some blocking blockchain operations on the path between arrival of header message and processing it. I assume the blocking happens in the fetcher somewhere, due to a synchronous I/O call, or due to synchronization. (I trust go runtime to not block other goroutines, while other goroutines are blocked on I/O, so I assume the blocking is fixable at the level of the Go code). Workaround: sync on a fast system, and copy chaindata to slow system. If error occurs during the (short) sync after the transfer, then just wait, eventually a reply will come during a break in I/O workload, and the sync will make progress. Once synced, the node generates neglibily little I/O work, so it runs smoothly on a system with severly slow I/O (like an ARM board with USB2.0 flash drive). |
This is very helpful. Thank you for your analysis. |
Workaround: increase all timeouts (*TTL, *RTT) at the top of The problem: These timeouts create a "positive feedback" loop of excessive I/O. When the node is attempting to sync, it sends out a request for remote chain height (for example). It appears that each such request (or its associated actions) has a significant cost in disk I/O (I don't know exactly why). When the system is heavily loaded on I/O, this required I/O work doesn't complete in time before the timeout (3s). But, this makes it worse, because when the timeout triggers, the request is aborted, and a new request is initiated, but the new requests costs disk I/O, which is added to already loaded system. So the node burries itself in I/O work without seeing any I/O request return: the process is perpetually in D state. On very rare occassion, I've seen the node recover from this state: it synced, and once synced, the steady-state I/O load is much lower. But, all it takes is a spike in I/O load on the system for the node to no longer keep up with the incoming blocks and enter sync mode, and trigger the above-described death spiral. Real fix would be to decouple communication with the peer from local disk operations. Timeout timers should not be ticking while the disk is accessed. Any disk I/O should be async and/or careful about what goroutines it's blocking. Hacky workaround is to increase the timeouts. Here is a log with the timeouts increased from 3 s to with 60 s (see interval between ***, it's 36 sec). With this change node works well on armv7h with slow USB2.0 storage: the process is in D state while syncing, but does complete the sync shortly, after which it is ever in D state only for short intervals. Occassionally the node fails to keep up, but when this happens, it successfully re-syncs quickly from first attempt, which is not the case with short timeouts.
|
Please try with our latest Geth 1.4.6 release. It should address all sync issues we were aware of. |
I'm going to close this issue to try and cleanup sync failures that should be addressed by 1.4.6. Feel free to comment and have it reopened if you still see something wrong! |
i think i'm seeing this, or at least the visible symptoms are as described. geth is running on 1.4.10 on linux. it's an older machine with a 250G drive and 2G of RAM. note that it's sync-ing the non-hard-fork chain and there seems to be quite some mess currently on the chain, which may or may not be related. |
no, sorry. i've seen it again, but it was just sitting there without emitting log messages or maxing out disk IO. |
If you catch it doing nothing, we'd appreciate a stack dump. To get the dump, run |
log output:
backtraces are attached, because it's long. |
Thanks. @karalabe you should look at this trace. It looks like a downloader hang, all peers are stuck on
This one indicates that the queue might be at fault:
|
OK, looking closer this seems to be a problem with the queue. Instead, |
Possibly related issues
#2570 #2533 #2467 #2569
System information
Geth version: Geth/v1.4.5-stable-a269a713/linux/go1.6.2
OS & Version: Linux x86_64, VPS 6 cores, 8GB RAM
(binary from github release)
Expected behaviour
Sync makes progress: eth.syncing shows increasing block number, log shows 'imported ... blocks'.
NOTE: --fast is irrelevant to this issue. In my use-case, I do not use and do not want to use it.
Actual behaviour
Sync stops making progress after some time (sometimes immediately, sometimes minutes, sometimes hours). Log snippet with -vmodule=downloader=6 attached below. It's trying to contact peers, but looks like these attempts fail (
head header timeout
). The attempts appear to be failing for different peers. While this is happening, iftop does show a fair amount of traffic to peers (~200KB/s).Steps to reproduce the behaviour
./geth --cache 4096
./geth attach
eth.syncing
...wait...
eth.syncing
Log
Stuck state
Snippet with same error, but recovered (not stuck)
This snippet is the first thing that happened after a geth startup.
Note that the attempt with the first peer failed with 'head header timeout' but a subsequent attempt with another peer succeeded. So, this problem might be triggered by something about the remote peer.
The text was updated successfully, but these errors were encountered: