geth sync (full, not --fast) stalls: `head header timeout` #2621

radfish · 2016-05-26T04:14:16Z

Possibly related issues

System information

Geth version: Geth/v1.4.5-stable-a269a713/linux/go1.6.2
OS & Version: Linux x86_64, VPS 6 cores, 8GB RAM
(binary from github release)

Expected behaviour

Sync makes progress: eth.syncing shows increasing block number, log shows 'imported ... blocks'.

NOTE: --fast is irrelevant to this issue. In my use-case, I do not use and do not want to use it.

Actual behaviour

Sync stops making progress after some time (sometimes immediately, sometimes minutes, sometimes hours). Log snippet with -vmodule=downloader=6 attached below. It's trying to contact peers, but looks like these attempts fail (head header timeout). The attempts appear to be failing for different peers. While this is happening, iftop does show a fair amount of traffic to peers (~200KB/s).

Steps to reproduce the behaviour

./geth --cache 4096
./geth attach
eth.syncing
...wait...
eth.syncing

Log

Stuck state

I0526 03:54:32.877741 eth/downloader/downloader.go:942] Peer ae97d3d269a935a0 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 03:54:33.074072 eth/downloader/downloader.go:239] Registering peer 976c6dbab73dda7e
I0526 03:54:34.612864 eth/downloader/downloader.go:251] Unregistering peer 70536e199c862c4a
I0526 03:54:35.693658 eth/downloader/downloader.go:263] Attempting synchronisation: ae97d3d269a935a0, head [8d3feddb…], TD 14677732179519775579
I0526 03:54:35.693784 eth/downloader/downloader.go:271] Synchronisation already in progress
I0526 03:54:35.694481 eth/downloader/downloader.go:239] Registering peer 6a91f301f1256d0b
I0526 03:54:35.878319 eth/downloader/downloader.go:968] Peer ae97d3d269a935a0 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: head header timeout
I0526 03:54:35.878449 eth/downloader/downloader.go:350] Synchronisation terminated after 3.000714475s
I0526 03:54:35.878554 eth/downloader/downloader.go:274] Removing peer ae97d3d269a935a0: timeout
I0526 03:54:35.878615 eth/downloader/downloader.go:251] Unregistering peer ae97d3d269a935a0
I0526 03:54:35.949651 eth/downloader/downloader.go:263] Attempting synchronisation: 6a91f301f1256d0b, head [3922d599…], TD 22194594436841222970
I0526 03:54:35.950000 eth/downloader/downloader.go:348] Synchronising with the network using: 6a91f301f1256d0b [eth/63]
I0526 03:54:35.950079 eth/downloader/downloader.go:942] Peer 6a91f301f1256d0b [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 03:54:37.785356 eth/downloader/downloader.go:263] Attempting synchronisation: 6a91f301f1256d0b, head [3922d599…], TD 22194594436841222970
I0526 03:54:37.785521 eth/downloader/downloader.go:271] Synchronisation already in progress
I0526 03:54:37.794671 eth/downloader/downloader.go:239] Registering peer fa3c32e264aec6f5
I0526 03:54:38.950509 eth/downloader/downloader.go:968] Peer 6a91f301f1256d0b [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: **head header timeout**
I0526 03:54:38.950658 eth/downloader/downloader.go:350] Synchronisation terminated after 3.000586115s
I0526 03:54:38.950791 eth/downloader/downloader.go:274] Removing peer 6a91f301f1256d0b: timeout
I0526 03:54:38.950854 eth/downloader/downloader.go:251] Unregistering peer 6a91f301f1256d0b
I0526 03:54:39.496877 eth/downloader/downloader.go:263] Attempting synchronisation: fa3c32e264aec6f5, head [44ea351d…], TD 21814286049049245198
I0526 03:54:39.497238 eth/downloader/downloader.go:348] Synchronising with the network using: fa3c32e264aec6f5 [eth/63]
I0526 03:54:39.497309 eth/downloader/downloader.go:942] Peer fa3c32e264aec6f5 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 03:54:39.763598 eth/downloader/downloader.go:239] Registering peer 0e82eb28dac7c7cf
I0526 03:54:42.497667 eth/downloader/downloader.go:968] Peer fa3c32e264aec6f5 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: head header timeout
I0526 03:54:42.497793 eth/downloader/downloader.go:350] Synchronisation terminated after 3.000490685s
I0526 03:54:42.498171 eth/downloader/downloader.go:274] Removing peer fa3c32e264aec6f5: timeout
I0526 03:54:42.498244 eth/downloader/downloader.go:251] Unregistering peer fa3c32e264aec6f5
I0526 03:54:43.836299 eth/downloader/downloader.go:263] Attempting synchronisation: 0e82eb28dac7c7cf, head [903262e0…], TD 12277614324907509188
I0526 03:54:43.849969 eth/downloader/downloader.go:239] Registering peer 7f8a5d1e4839009a
I0526 03:54:43.850118 eth/downloader/downloader.go:348] Synchronising with the network using: 0e82eb28dac7c7cf [eth/63]
I0526 03:54:43.850165 eth/downloader/downloader.go:942] Peer 0e82eb28dac7c7cf [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 03:54:45.949630 eth/downloader/downloader.go:263] Attempting synchronisation: 0e82eb28dac7c7cf, head [903262e0…], TD 12277614324907509188
I0526 03:54:45.949794 eth/downloader/downloader.go:271] Synchronisation already in progress
I0526 03:54:46.850538 eth/downloader/downloader.go:968] Peer 0e82eb28dac7c7cf [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: head header timeout
I0526 03:54:46.850821 eth/downloader/downloader.go:350] Synchronisation terminated after 3.000658602s

Snippet with same error, but recovered (not stuck)

This snippet is the first thing that happened after a geth startup.

Note that the attempt with the first peer failed with 'head header timeout' but a subsequent attempt with another peer succeeded. So, this problem might be triggered by something about the remote peer.

I0526 04:38:07.568551 eth/downloader/downloader.go:239] Registering peer 5491ee118d2a4624
I0526 04:38:08.033922 eth/downloader/downloader.go:239] Registering peer 0318278fc6efad61
I0526 04:38:08.405837 eth/downloader/downloader.go:239] Registering peer 7b907a4b086ebcad
I0526 04:38:09.816065 eth/downloader/downloader.go:239] Registering peer 7844af35d4dd4a12
I0526 04:38:15.984698 eth/downloader/downloader.go:263] Attempting synchronisation: 7844af35d4dd4a12, head [e9033f4b…], TD 22202188422181415108
I0526 04:38:15.984896 eth/downloader/downloader.go:299] Block synchronisation started
I0526 04:38:15.985140 eth/downloader/downloader.go:348] Synchronising with the network using: 7844af35d4dd4a12 [eth/63]
I0526 04:38:15.985272 eth/downloader/downloader.go:942] Peer 7844af35d4dd4a12 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 04:38:18.807697 eth/downloader/downloader.go:251] Unregistering peer 7844af35d4dd4a12
I0526 04:38:18.985812 eth/downloader/downloader.go:968] Peer 7844af35d4dd4a12 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: head header timeout
I0526 04:38:18.985961 eth/downloader/downloader.go:350] Synchronisation terminated after 3.000714051s
I0526 04:38:18.986043 eth/downloader/downloader.go:274] Removing peer 7844af35d4dd4a12: timeout
I0526 04:38:19.350897 eth/downloader/downloader.go:239] Registering peer 9213853d97647bb6
I0526 04:38:19.403522 eth/downloader/downloader.go:251] Unregistering peer 0318278fc6efad61
I0526 04:38:20.515231 eth/downloader/downloader.go:239] Registering peer 443d7bcfca77e3b3
I0526 04:38:25.984741 eth/downloader/downloader.go:263] Attempting synchronisation: 7b907a4b086ebcad, head [6f2e888e…], TD 22202146836930888384
I0526 04:38:25.985292 eth/downloader/downloader.go:348] Synchronising with the network using: 7b907a4b086ebcad [eth/63]
I0526 04:38:25.985367 eth/downloader/downloader.go:942] Peer 7b907a4b086ebcad [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
I0526 04:38:26.001887 eth/downloader/downloader.go:990] Peer 7b907a4b086ebcad [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: looking for common ancestor
I0526 04:38:26.211576 eth/downloader/downloader.go:1080] Peer 7b907a4b086ebcad [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: common ancestor: #1056025 [ab936f55…]
I0526 04:38:26.211817 eth/downloader/downloader.go:1395] Downloading node state data
I0526 04:38:26.212031 eth/downloader/downloader.go:1347] Downloading block bodies from #1056026
I0526 04:38:26.211988 eth/downloader/downloader.go:1158] Peer 7b907a4b086ebcad [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: downloading headers from #1056026
I0526 04:38:26.212390 eth/downloader/downloader.go:1371] Downloading receipts from #1056026
I0526 04:38:26.212491 eth/downloader/downloader.go:1192] Peer 7b907a4b086ebcad [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: fetching 192 headers from #1056026

The text was updated successfully, but these errors were encountered:

radfish · 2016-05-28T17:07:16Z

Observation 3: I got this head header timeout reproduced (once, not reliably) with two nodes exclusively connected to each other. Node A was synced up to about 12 hours ago. Node B was synced up to a little farther back than 12 hours. When B connected to A, B got the head header timeout error, and the syncing was stalled. Unfortunately, I did not capture the verbose log on A when this happened.

However, when I let A connect to peers in the network at wide, and after it finished syncing. I again had B connect to A, and the error did not recurr, and syncing on B launched fine.

To reproduce the error, I tried to reproduce the scenario of unsynced/syncing node A and less synced node B, by creating a brand new geth datadir on node A and on node B, and having node A sync from the network, while node B connects. The error did not reproduce itself: syncing on B proceded as expected.

Aside: at one point I tried a fresh node on A sync from node B. A did connect to B and started syncing, however very soon A aborted with some timeout (sorry, missing details) and disconnected from B. Then B got a 'broken pipe' and closed the connetion (as expected). Note that B is on a very slow ARM board, A is on a fast x86_64 VPS. This problem is probably distinct, but could either or both of these bugs be due to overly optimistic timeout values?

radfish · 2016-05-29T07:37:46Z

Localized this finally: the node that generates head header timeout is blocked on disk I/O, so the reply is not processed before the timeout (3 sec) elapses. From observations I am 95% sure (I did not confirm with data, though) that the reply does come and the problem is blocking on the receiving end.

I don't know the root cause of the blocking, but there is likely some blocking blockchain operations on the path between arrival of header message and processing it. I assume the blocking happens in the fetcher somewhere, due to a synchronous I/O call, or due to synchronization. (I trust go runtime to not block other goroutines, while other goroutines are blocked on I/O, so I assume the blocking is fixable at the level of the Go code).

Workaround: sync on a fast system, and copy chaindata to slow system. If error occurs during the (short) sync after the transfer, then just wait, eventually a reply will come during a break in I/O workload, and the sync will make progress. Once synced, the node generates neglibily little I/O work, so it runs smoothly on a system with severly slow I/O (like an ARM board with USB2.0 flash drive).

obscuren · 2016-05-29T09:42:37Z

This is very helpful. Thank you for your analysis.

radfish · 2016-06-01T23:24:11Z

Workaround: increase all timeouts (*TTL, *RTT) at the top of eth/downloader/downloader.go. For my slow system, I increased from 3 sec to 60 sec, and the node works now.

The problem: These timeouts create a "positive feedback" loop of excessive I/O. When the node is attempting to sync, it sends out a request for remote chain height (for example). It appears that each such request (or its associated actions) has a significant cost in disk I/O (I don't know exactly why). When the system is heavily loaded on I/O, this required I/O work doesn't complete in time before the timeout (3s). But, this makes it worse, because when the timeout triggers, the request is aborted, and a new request is initiated, but the new requests costs disk I/O, which is added to already loaded system. So the node burries itself in I/O work without seeing any I/O request return: the process is perpetually in D state.

On very rare occassion, I've seen the node recover from this state: it synced, and once synced, the steady-state I/O load is much lower. But, all it takes is a spike in I/O load on the system for the node to no longer keep up with the incoming blocks and enter sync mode, and trigger the above-described death spiral.

Real fix would be to decouple communication with the peer from local disk operations. Timeout timers should not be ticking while the disk is accessed. Any disk I/O should be async and/or careful about what goroutines it's blocking.

Hacky workaround is to increase the timeouts. Here is a log with the timeouts increased from 3 s to with 60 s (see interval between ***, it's 36 sec). With this change node works well on armv7h with slow USB2.0 storage: the process is in D state while syncing, but does complete the sync shortly, after which it is ever in D state only for short intervals. Occassionally the node fails to keep up, but when this happens, it successfully re-syncs quickly from first attempt, which is not the case with short timeouts.

Jun 01 19:00:11 ucube.galactica.pw geth[5097]: I0601 19:00:11.513499 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:11 ucube.galactica.pw geth[5097]: I0601 19:00:11.514098 eth/downloader/downloader.go:348] Synchronising with the network using: 0ca995ff063fa509 [eth/63]
***Jun 01 19:00:11 ucube.galactica.pw geth[5097]: I0601 19:00:11.514250 eth/downloader/downloader.go:942] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: retrieving remote chain height
Jun 01 19:00:11 ucube.galactica.pw geth[5097]: I0601 19:00:11.514575 eth/downloader/downloader.go:956] Received headers from incorrect peer(3dae075c9e3be509)
Jun 01 19:00:21 ucube.galactica.pw geth[5097]: I0601 19:00:21.513349 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:21 ucube.galactica.pw geth[5097]: I0601 19:00:21.513519 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:25 ucube.galactica.pw geth[5097]: I0601 19:00:25.518194 eth/downloader/downloader.go:251] Unregistering peer f2fc241023a14c51
Jun 01 19:00:26 ucube.galactica.pw geth[5097]: I0601 19:00:26.964189 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:26 ucube.galactica.pw geth[5097]: I0601 19:00:26.964346 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:28 ucube.galactica.pw geth[5097]: I0601 19:00:28.156315 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:28 ucube.galactica.pw geth[5097]: I0601 19:00:28.156449 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:28 ucube.galactica.pw geth[5097]: I0601 19:00:28.485486 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:28 ucube.galactica.pw geth[5097]: I0601 19:00:28.485645 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:28 ucube.galactica.pw geth[5097]: I0601 19:00:28.489169 eth/downloader/downloader.go:239] Registering peer eafdf595b9d5679b
Jun 01 19:00:31 ucube.galactica.pw geth[5097]: I0601 19:00:31.513311 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:31 ucube.galactica.pw geth[5097]: I0601 19:00:31.513463 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:35 ucube.galactica.pw geth[5097]: I0601 19:00:35.805229 eth/downloader/downloader.go:251] Unregistering peer eafdf595b9d5679b
Jun 01 19:00:37 ucube.galactica.pw geth[5097]: I0601 19:00:37.878116 eth/downloader/downloader.go:263] Attempting synchronisation: 0ca995ff063fa509, head [4dac09d6…], TD 23940768313305468958
Jun 01 19:00:37 ucube.galactica.pw geth[5097]: I0601 19:00:37.878247 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:38 ucube.galactica.pw geth[5097]: I0601 19:00:38.235963 eth/downloader/downloader.go:239] Registering peer de471bccee3d0422
Jun 01 19:00:41 ucube.galactica.pw geth[5097]: I0601 19:00:41.513393 eth/downloader/downloader.go:263] Attempting synchronisation: de471bccee3d0422, head [2d17ce22…], TD 23940853998644616342
Jun 01 19:00:42 ucube.galactica.pw geth[5097]: I0601 19:00:41.513550 eth/downloader/downloader.go:271] Synchronisation already in progress
Jun 01 19:00:47 ucube.galactica.pw geth[5097]: I0601 19:00:47.794495 eth/downloader/downloader.go:251] Unregistering peer a7fec143b42cdb7a
Jun 01 19:00:47 ucube.galactica.pw geth[5097]: I0601 19:00:47.799913 eth/downloader/downloader.go:263] Attempting synchronisation: de471bccee3d0422, head [8fc4ffdf…], TD 23940896862233486866
Jun 01 19:00:47 ucube.galactica.pw geth[5097]: I0601 19:00:47.800068 eth/downloader/downloader.go:271] Synchronisation already in progress
***Jun 01 19:00:47 ucube.galactica.pw geth[5097]: I0601 19:00:47.813490 eth/downloader/downloader.go:990] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: looking for common ancestor
***Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.036285 eth/downloader/downloader.go:1080] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: common ancestor: #1626745 [5d07aaad…]
***Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.036663 eth/downloader/downloader.go:1158] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: downloading headers from #1626746
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.037005 eth/downloader/downloader.go:1347] Downloading block bodies from #1626746
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.037163 eth/downloader/downloader.go:1371] Downloading receipts from #1626746
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.037327 eth/downloader/downloader.go:1395] Downloading node state data
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.037221 eth/downloader/downloader.go:1192] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: fetching 192 headers from #1626746
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.117129 eth/downloader/downloader.go:1260] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: schedule 6 headers from #1626746
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.118977 eth/downloader/downloader.go:1192] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: fetching 192 headers from #1626752
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.120309 eth/downloader/downloader.go:1558] Peer 0ca995ff063fa509 [blocks 0.00/s, receipts 0.00/s, states 0.00/s, lacking    0]: requesting 1 body(s), first at #1626747
Jun 01 19:00:48 ucube.galactica.pw geth[5097]: I0601 19:00:48.122325 eth/downloader/downloader.go:1606] Inserting chain with 1 items (#1626746 [4becbdc2…] - #1626746 [4becbdc2…])

karalabe · 2016-06-07T07:52:34Z

Please try with our latest Geth 1.4.6 release. It should address all sync issues we were aware of.

karalabe · 2016-06-07T08:02:25Z

I'm going to close this issue to try and cleanup sync failures that should be addressed by 1.4.6. Feel free to comment and have it reopened if you still see something wrong!

attila-lendvai · 2016-07-24T16:46:23Z

i think i'm seeing this, or at least the visible symptoms are as described.

geth is running on 1.4.10 on linux. it's an older machine with a 250G drive and 2G of RAM.

note that it's sync-ing the non-hard-fork chain and there seems to be quite some mess currently on the chain, which may or may not be related.

attila-lendvai · 2016-07-25T15:31:14Z

no, sorry. i've seen it again, but it was just sitting there without emitting log messages or maxing out disk IO.

fjl · 2016-07-25T17:17:02Z

If you catch it doing nothing, we'd appreciate a stack dump. To get the dump, run geth attach and then debug.stacks() in the resulting console. If the attach doesn't work you can do killall -QUIT geth. It will print many lines to stderr.

attila-lendvai · 2016-07-26T12:49:11Z

log output:

I0726 11:25:15.382469 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 1 txs in 18.116768ms. #1950692 [38c64ee8 / 38c64ee8]
I0726 11:25:15.400091 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 2 txs in 10.943335ms. #1950693 [417ef59a / 417ef59a]
I0726 11:25:15.430458 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 2 txs in 23.646717ms. #1950694 [176819ba / 176819ba]
I0726 11:25:18.453174 eth/handler.go:295] Peer a95b6634e8bf3b8e [eth/63]: timed out DAO fork-check, dropping
I0726 11:25:19.638420 eth/handler.go:295] Peer 384f75e61515f14c [eth/63]: timed out DAO fork-check, dropping
I0726 11:25:21.982033 eth/handler.go:295] Peer dd941c751ea28c67 [eth/63]: timed out DAO fork-check, dropping
I0726 11:26:34.477519 miner/worker.go:573] commit new work on block 1950692 with 53 txs & 1 uncles. Took 1m20.601337798s
I0726 11:27:03.934558 miner/worker.go:573] commit new work on block 1950695 with 58 txs & 0 uncles. Took 29.456846457s
I0726 11:27:04.853108 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 2 txs in 12.403676ms. #1950695 [c27ace5a / c27ace5a]
I0726 11:27:04.945711 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 4 txs in 17.195568ms. #1950696 [f4d0e8ae / f4d0e8ae]
I0726 11:27:05.206666 miner/worker.go:573] commit new work on block 1950695 with 64 txs & 0 uncles. Took 1.272008615s
I0726 11:27:05.858747 core/blockchain.go:963] imported 0 block(s) (0 queued 2 ignored) including 0 txs in 17.093466ms. #1950693 [38c64ee8 / 417ef59a]
I0726 11:27:05.885817 miner/worker.go:573] commit new work on block 1950697 with 62 txs & 0 uncles. Took 678.981823ms
I0726 11:27:05.904675 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 3 txs in 12.550832ms. #1950697 [9f2244f5 / 9f2244f5]
I0726 11:27:06.463906 miner/worker.go:573] commit new work on block 1950697 with 54 txs & 0 uncles. Took 578.006465ms
I0726 11:27:07.754902 miner/worker.go:573] commit new work on block 1950698 with 61 txs & 0 uncles. Took 1.290759653s
I0726 11:27:08.701187 miner/worker.go:573] commit new work on block 1950698 with 63 txs & 0 uncles. Took 946.182109ms
I0726 11:27:09.697816 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 6 txs in 20.599393ms. #1950696 [279179a5 / 279179a5]
I0726 11:27:10.481319 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 0 txs in 11.017476ms. #1950698 [3b941a41 / 3b941a41]
I0726 11:27:10.655815 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 14 txs in 38.973285ms. #1950699 [86f0dfdd / 86f0dfdd]
I0726 11:27:19.314826 eth/handler.go:295] Peer 50086ed7dc1a895c [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:19.407783 miner/worker.go:573] commit new work on block 1950699 with 45 txs & 1 uncles. Took 8.912800772s
I0726 11:27:24.805663 eth/handler.go:295] Peer dd941c751ea28c67 [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:34.625067 eth/handler.go:295] Peer 8b27bd85d9703966 [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:34.707020 eth/handler.go:295] Peer 719fe02fb793d324 [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:34.756054 eth/handler.go:295] Peer c2e5365f3eab0e67 [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:34.809162 eth/handler.go:295] Peer 44826a5d6a55f88a [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:34.854815 eth/handler.go:295] Peer 8914f8e7d959a5b1 [eth/63]: timed out DAO fork-check, dropping
I0726 11:27:51.557916 miner/worker.go:573] commit new work on block 1950700 with 89 txs & 0 uncles. Took 32.149905632s
I0726 11:27:58.214390 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 14 txs in 44.502099ms. #1950699 [c4371bcc / c4371bcc]
I0726 11:27:58.492875 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 14 txs in 39.726836ms. #1950699 [d8ffe018 / d8ffe018]
I0726 11:27:58.942776 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 14 txs in 36.453451ms. #1950699 [f94d92c1 / f94d92c1]
I0726 11:27:59.182249 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 73 txs in 234.571168ms. #1950700 [21155944 / 21155944]
I0726 11:27:59.967633 core/blockchain.go:963] imported 1 block(s) (0 queued 0 ignored) including 71 txs in 213.230188ms. #1950700 [8714d086 / 8714d086]
I0726 11:28:00.833842 miner/worker.go:573] commit new work on block 1950701 with 64 txs & 2 uncles. Took 1.634098987s
I0726 11:28:01.119502 miner/worker.go:573] commit new work on block 1950701 with 66 txs & 2 uncles. Took 285.500564ms
I0726 11:28:05.268758 eth/handler.go:295] Peer 6f2c91dadbbf1837 [eth/63]: timed out DAO fork-check, dropping
I0726 11:28:05.316726 eth/handler.go:295] Peer c559072c8e18e124 [eth/63]: timed out DAO fork-check, dropping
I0726 11:28:12.587897 eth/handler.go:295] Peer 08e9b472cff48670 [eth/63]: timed out DAO fork-check, dropping
I0726 11:28:16.416685 eth/handler.go:295] Peer 297959879cbd3014 [eth/63]: timed out DAO fork-check, dropping
I0726 11:28:21.042984 eth/handler.go:295] Peer 2453fe61eae41e42 [eth/63]: timed out DAO fork-check, dropping
I0726 11:28:37.004757 eth/handler.go:295] Peer dd941c751ea28c67 [eth/63]: timed out DAO fork-check, dropping
I0726 11:30:45.749497 eth/handler.go:295] Peer dd941c751ea28c67 [eth/63]: timed out DAO fork-check, dropping
I0726 11:30:53.586136 eth/handler.go:295] Peer a43775e1d26dc6a3 [eth/63]: timed out DAO fork-check, dropping
I0726 11:36:23.211295 eth/handler.go:295] Peer 868e1c229d15d143 [eth/63]: timed out DAO fork-check, dropping
I0726 11:36:37.835869 eth/handler.go:295] Peer 04f4d9b0ddbe41d3 [eth/63]: timed out DAO fork-check, dropping
I0726 11:36:54.277204 eth/handler.go:295] Peer 64dfcf207a9db9fa [eth/63]: timed out DAO fork-check, dropping
I0726 11:37:23.433947 eth/handler.go:295] Peer 99da4a25a37faf8e [eth/63]: timed out DAO fork-check, dropping
I0726 11:37:29.668053 eth/handler.go:295] Peer c3ef60804a610434 [eth/63]: timed out DAO fork-check, dropping

backtraces are attached, because it's long.

backtrace-geth-stuck.txt

fjl · 2016-08-23T23:21:22Z

Thanks. @karalabe you should look at this trace. It looks like a downloader hang, all peers are stuck on

goroutine 1146867 [select, 155 minutes]:
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).deliver(0xc820654000, 0xc823c05990, 0x10, 0xc8206b6ba0, 0x7fdad17304b0, 0xc8391b2c90, 0x7fdad3876f70, 0x19bcc50, 0x7fdad3876f70, 0x19bcc50, ...)
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:1920 +0x276
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).DeliverHeaders(0xc820654000, 0xc823c05990, 0x10, 0xc823cb4ef0, 0x1, 0x1, 0x0, 0x0)
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:1886 +0x139
github.com/ethereum/go-ethereum/eth.(*ProtocolManager).handleMsg(0xc82000ad00, 0xc82a650750, 0x0, 0x0)

This one indicates that the queue might be at fault:

goroutine 1144751 [semacquire, 152 minutes]:
sync.runtime_Syncsemacquire(0xc820556290)
    /usr/lib/go/src/runtime/sema.go:237 +0x201
sync.(*Cond).Wait(0xc820556280)
    /usr/lib/go/src/sync/cond.go:62 +0x9b
github.com/ethereum/go-ethereum/eth/downloader.(*queue).WaitResults(0xc820089e00, 0x0, 0x0, 0x0)
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/queue.go:452 +0xc6
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).processContent(0xc820654000, 0x0, 0x0)
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:1809 +0x97
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync.func1(0xc836885b50, 0xc835b64cc0, 0xc820654000)
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:487 +0x4c
created by github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).spawnSync
    /opt/ethereum/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:487 +0xd2

fjl · 2016-08-23T23:47:16Z

OK, looking closer this seems to be a problem with the queue.
In the trace, a sync has just completed and the only thing that's still alive (and stuck) is
the loop that waits for downloaded items (processContent). The blocked peer goroutines
would been unblocked if the downloader exited, but that hasn't happen.

Instead, processContent is waiting for a notification on queue.active. I can't find anything obvious
in the queue code, the conditions under which queue.active is signaled are non-trivial. My educated
guess is that the queue has been closed by spawnSync but the resulting notification was
missed. It could also be that spawnSync has not closed the queue yet, but that's unlikely because
all fetchers are dead already.

…hereum#2621)

radfish changed the title ~~geth sync (full, not --fast) stalls~~ geth sync (full, not --fast) stalls: head header timeout May 26, 2016

karalabe closed this as completed Jun 7, 2016

maoueh pushed a commit to streamingfast/go-ethereum that referenced this issue Aug 14, 2024

core: not record zero hash beacon block root with Parlia engine (et…

9e34366

…hereum#2621)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geth sync (full, not --fast) stalls: `head header timeout` #2621

geth sync (full, not --fast) stalls: `head header timeout` #2621

radfish commented May 26, 2016 •

edited

Loading

radfish commented May 28, 2016

radfish commented May 29, 2016

obscuren commented May 29, 2016

radfish commented Jun 1, 2016

karalabe commented Jun 7, 2016

karalabe commented Jun 7, 2016 •

edited

Loading

attila-lendvai commented Jul 24, 2016

attila-lendvai commented Jul 25, 2016

fjl commented Jul 25, 2016 •

edited

Loading

attila-lendvai commented Jul 26, 2016

fjl commented Aug 23, 2016

fjl commented Aug 23, 2016

geth sync (full, not --fast) stalls: head header timeout #2621

geth sync (full, not --fast) stalls: head header timeout #2621

Comments

radfish commented May 26, 2016 • edited Loading

Possibly related issues

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Log

Stuck state

Snippet with same error, but recovered (not stuck)

radfish commented May 28, 2016

radfish commented May 29, 2016

obscuren commented May 29, 2016

radfish commented Jun 1, 2016

karalabe commented Jun 7, 2016

karalabe commented Jun 7, 2016 • edited Loading

attila-lendvai commented Jul 24, 2016

attila-lendvai commented Jul 25, 2016

fjl commented Jul 25, 2016 • edited Loading

attila-lendvai commented Jul 26, 2016

fjl commented Aug 23, 2016

fjl commented Aug 23, 2016

geth sync (full, not --fast) stalls: `head header timeout` #2621

geth sync (full, not --fast) stalls: `head header timeout` #2621

radfish commented May 26, 2016 •

edited

Loading

karalabe commented Jun 7, 2016 •

edited

Loading

fjl commented Jul 25, 2016 •

edited

Loading