eth/downloader: adaptive quality of service tuning #2630

karalabe · 2016-05-27T17:09:40Z

Currently Geth is tuned for low latency, high bandwidth network scenarios, which the sync mechanism enforces by dropping peers who do not reply fast enough to data queries. The rationale was that by honing in on faster peers (i.e. geographically closer to us), we can achieve much faster sync times.

Even though we are aggressive in peer selection, the downloader tries to be smart about the queries it issues by estimating the throughput of all connected peers and requesting data in proportion with the time it should take the peer to respond, attempting to stabilize both the network (by not assigning huge tasks that would time out) as well as the data stream (by having all peers respond with the same RTT).

This algorithm can ensure that independent of what quality peers we have, we will have a stable download stream... as long as we have enough bandwidth to cover all our peers as well as a negligible latency compared to the employed timeouts. However if I have low bandwidth myself, packet RTTs increase proportionally do the download tasks, which can easily hit the timeout thresholds, causing peers to be dropped. Even worse, if I have a high latency connection (EDGE, satellite), the timeouts will almost immediately drop all peers, making sync impossible.

The fundamental flaw is that our current codebase has fixed values as the target RTT in which peers are expected to respond to queries (1-1.5 seconds) and request TTLs after which peers are dropped (3-4 seconds); but for low bandwidth or high latency connections, these are just too low to work out. Simply raising these would allow users with bad connectivity to sync, but would at the same time slow down sync for good connectivity users to the speed of their weakest peer, which would be a bottleneck for the entire algorithm. Instead, this PR attempts to dynamically adapt the target RTT and request TTL values in such a way that we can still hone in on good peers, but do that based on our active peer set and not a blind baseline expected performance.

Note, hard RTT and TTL values were needed prior to concurrent headers because the master peer which we used to pull headers from was already a bottleneck and we couldn't afford to stall sync with arbitrary performance. With concurrent headers, even if the master peer is slow-ish we can still sync fast, so we can me much more lax with regard to RTT and TTL.

Tuning peers

To dynamically adapt the RTT and TTL values, we need to make various measurements on our peers' performance and aggregate them into these thresholds. However if we have a lot of peers (e.g. 10, 25, 100), some of them will be high quality peers, and some will be low quality ones. Aggregating all of them would result in too lax RTT/TTL values, delaying sync due to unnecessary bottlenecks.

Instead of adapting to all connected peers, we would like to hone in on our true performance, and keep only peers that are reasonably close performance wise. To do this, we will tune the parameters based only on our best N peers. Those upwards from N will be required some threshold of performance from our best peers, or else be dropped during sync. This algorithm ensures that although we adapt our RTT/TTL values to our peers and network connectivity, we still saturate our own connection and not introduce bottlenecks. If a better peer joins, it will automatically result in tighter thresholds, whereas if a top peer drops off, the threshold will be automatically relaxed.

The PR chosses N = 5. The rationale is that by default geth has --maxpeers=25, however only half of that is reserved for outbound connections, so with a NATed/firewalled system, peer count is capped at around 12. With N == 5 we retain enough peers to do meaningful tuning, but also leave enough slots for bad peers to die off and leave room for better ones. Lastly, it's important to mention that the BitTorrent protocol also uses 4 (stable) + 1 (optimistic) peers for downloading data at any given time, as higher values cause suboptimal TCP congestion control. Although BitTorrent furthers this logic by choking the remaining peers, it's a complexity we need to decide whether to introduce or not (@fjl suggested we do, maybe in a followup PR we can do an N-tuning M-optimistic K-spare peers split).

RTT vs. TTL

As discussed, sync performance depends on tuning two parameters of the downloader: the target round trip time (RTT) that ensures we have a stable stream of data to process; and the request time to live (TTL) which ensures some baseline quality of peers. Thinking about it though, the two values are somewhat related: if we can sustain a stable RTT over multiple peers, then we can set TTL to be a constant multiple c of this RTT value. As long as RTT doesn't fibrillate, TTL = c * RTT is a good filter to weed out under performing peers.

Fibrillation however can be an issue, especially during the beginning of sync. If we have only a couple peers (e.g. 2) that we tuned the RTT to, and suddenly experience a surge in connections (e.g. 4 new ones), then limited bandwidth availability would cause actual round trip times to grow proportionally to the additional download requests (not giving an opportunity for our estimated RTT (and hence related TTL) to adapt fast enough). Hitting the TTL limit would result in both new as well as potentially old, perfectly good peers being dropped. Reducing peer count would drive down the RTT, and reconnecting peers would fibrillate it back up.

The moral of the story is, that TTL = c * RTT only makes sense, if the estimated RTT is relatively stable and approximates the true RTT closely enough. As we cannot guarantee RTT stability during peer joins, we introduce a confidence factor into our estimation. The more peers join (proportionally to our existing ones), the less confident we can be in our estimated RTT value; and the less confident we are, the higher TTL threshold needs to be allowed to cater for splitting our bandwidth among more peers.

Aggregating the above rational, we can set TTL = c * RTT / conf, where conf ∈ (0, 1]. If we assume a stable RTT (i.e. conf == 1), we can pick a meaningful number for the c scaling factor. It's really arbitrary, but c = 3 gives enough wiggle room for occasional hiccups but is not lax enough to hinder performance too much. As for the confidence multiplier, the only idea we can use is that limited bandwidth is split between peers, so more peers could (not would) result in proportionally larger RTTs. Whenever a peer joins, we reduce conf proportionally to cater for bandwidth splits: conf = conf * (1 - 1/(peers+1)) = conf * peers / (peers + 1).

E.g. if we had a stable RTT (conf == 1) with 2 peers, adding new peers would:

+1 peer => conf = 1 * 2 / 3 = 2/3
+1 peer => conf = 2/3 * 3 / 4 = 2/4
+1 peer => conf = 2/4 * 4 / 5 = 2/5

Per the above, the confidence factor correctly follows the number of joining peers, even when processed one by one (+1+1+1 results in the expected value that +3 would produce). Note, we'll cap the minimum of the confidence factor to 0.1 to prevent too wild TTL values. The confidence factor is increased towards 1 after every RTT.

Adaptive RTT

Based on the above discussion, if we can estimate a reasonable target round-trip time and can keep it relatively up to date with peers' capacities but at the same time stable as peers join/leave and requests come/go, we can derive a good enough TTL value. The remaining challenge is how to select a good RTT to target for requests. The requirements are that it should be high enough to allow a meaningful batch of data to be retrieved, but low enough so that import can progress without stuttering too much.

We will start out with the maximum allowed RTT value if we have no connections (PR sets it to 20s). This ensures that irrelevant of our local connectivity, we can do some meaningful communication.

Further, for each peer we will maintain a very rough RTT value (initially these are maxed too), which we'll update whenever the said peer responds to one of our data requests (weighed update). Note, this RTT will vary a ton based on our network saturation, the peers network saturation, latencies, congestion, etc. This is fine, it's just meant as a means to compare the performance of different peers at the current workload (whatever that might be).

Using these constantly changing RTT values, every once in a while (more on this later) we select the N best peers to tune with (as mentioned above, 5 currently) and updated our target RTT to the median performance of these best peers (again, weighed update to smooth hiccups). With the target RTT updated, we can calculate a fresh TTL value too.

Converging towards the optimum

Adaptive algorithms have a tendency to converge to some stable state, which may or may not be the optimal solution (depends on the exact scenario). In our case if we have a single peer, two stable scenarios could be: a) download 256 headers in 2 seconds; and b) download 128 headers in 1 second. Both are perfectly valid in the long run, but as block processing is a streaming operation, and there are a lot of steps to go through, it is usually more advisable to prefer shorter cycles over longer ones. To this extent, we need to force the algorithm out of sub-optimal stable points.

One such scenario is the selection of a stable, but larger RTT (e.g. 5 sec vs. 2 sec). Having a larger than essential RTT means that streaming the blockchain is more bursty and less flowing; but it also means that we could end up keeping weaker peers that ould have been weeded out by the small RTT. As long as the smaller RTT is equally stable, performance wise it will always be preferable. To prevent locking in to a higher than needed RTT, whenever we fetch some data, we will not use our estimated best RTT, but rather only 90% of it. If the smaller RTT was a bad guess (e.g. our latency is the culprit), we'll just estimate the original RTT the next time too and remain unchanged. However if our guessed RTT turns out to be valid, we just shaved off a bit of your sync time. Repeating this will converge on the minimum RTT that is stable for our actual workload.

The second interesting scenario is during throughput estimation. The downloader tries to (did this since forever) estimate how many blocks, receipts, states, etc a peer can give us in a given amount of time, and use that to scale the size of data retrievals to the target RTT. This works well in low latency environments (e.g. 10 blocks 1 sec => 20 blocks 2 secs), but it performs poorly if latency comes into play (e.g. with a 250ms latency, 10 blocks 1 sec => 20 blocks 1.5sec). In high latency environments, the algorithm will constantly under-estimate the true throughput and retrieve less, wasting time because of the latency, not because of actual throughput. To solve this, the PR makes a very subtle change to the calculation of download task sizes. Instead of using RTT / throughput, we will actually request +1 data item than what the correct estimate is. If the limiting factor was bandwidth, we'll get back the original throughput and be done with it. If however the limiting factor was latency, we'll get the same RTT, but at a much higher throughput.

robotally · 2016-05-27T17:09:41Z

Vote	Count	Reviewers
👍	0
👎	0

Updated: Mon Jun 6 11:16:04 UTC 2016

JasonCoombs · 2016-05-30T20:36:38Z

https://gitter.im/ethereum/go-ethereum?at=574ca032f44fde236e51c4e8

JasonCoombs · 2016-05-30T20:36:46Z

https://gitter.im/ethereum/go-ethereum?at=574ca19880352f204df38667

JasonCoombs · 2016-05-30T20:37:39Z

Jason Coombs @JasonCoombs 10:18
@karalabe I've been using develop branch for a week, unable to sync. I'm on satellite broadband so my network latency is high. Looks like the develop branch currently imposes 1.5 second peer timeouts.
@karalabe so I switched to your PR#2630 #2630

git fetch origin pull/2630/head:pr-2630
git checkout pr-2630
cd cmd/geth
godep go install

Jason Coombs @JasonCoombs 10:24
@karalabe problem solved! thank you. due to latency my node was removing peers that didn't need to be removed. and apparently there are enough peers that are already tolerant of >1.5s latency that I'm not being ignored by everyone...

JasonCoombs · 2016-05-31T02:00:39Z

https://gitter.im/ethereum/go-ethereum?at=574cef49454cb2be09506538

Jason Coombs @JasonCoombs 15:56
@karalabe still having issues with sync via satellite -- bottom line appears to be that adding PR#2630 helps substantially by avoiding removing peers due to my high latency, but presumably because everyone else is still intolerant of high-latency peers it is hard to find peers that will stay connected to me. I'm now seeing about twice as many peers and I'm getting much better sync rate, the problem of self-isolating until there's nobody left to talk with seems to be solved but twice as many peers is still barely viable when it means going from one to two, or two to four. I can't seem to sustain more than a few peers at a time. Many, many peers are reporting "Too many peers" when I try to connect to them, and most that do accept my high-latency connection get tired of me and go away after enough of my requests have exceeded either 1.5s or 3s RTT.

karalabe · 2016-06-01T15:11:51Z

Hey @JasonCoombs , I've pushed out a fresh attempt at the adaptive QoS tuning. It has quite a bit different tuning logic compared to the old one and I can get a stable data stream on basically all kinds of network connectivity I threw at it. Would appreciate if you could try it with your satellite uplink too :)

PS: I've also written up the PR description with the logic behind the changes, it might interest you :)

agnewpickens · 2016-06-01T20:14:33Z

My sync on my HP Pavilion G7 running Windows 7 has pretty much been on slow motion since May 9th and I have not been able to achieve a full sync on either Test or Main since then, although I have not experienced a block sync restart. The sync still manages to keep track of what block I am on in sync on both Test and Main. Also, I have not been seeing the word "peers" next to the nodes icon on either Test or Main. I am currently at ~block 875K on Test and ~block 1.3M on Main and the best sync I can get on Main is about 256 blocks in 30 secs with the usual being 2 to 4 minutes.

agnewpickens · 2016-06-01T20:16:08Z

^^^ I am usually wired in on Ethernet on a DSL line. Wallet 7.4 with Geth 1.4.5

karalabe · 2016-06-02T06:42:55Z

@agnewpickens That is probably a HDD bottleneck, that will hopefully be addressed in the 1.5 release cycle. These PRs mostly address networking issues where latency and/or low bandwidth causes lost peers and a general failure to download stuff.

agnewpickens · 2016-06-03T05:19:41Z

Thanks @karalabe was wondering if it might have been because of my ISP, I am in a really small town and have had to reset my router a lot because of iffy service where I am at.

obscuren · 2016-06-03T07:09:56Z

eth/downloader/peer.go

-			fmt.Sprintf("receipts %3.2f/s, ", p.receiptThroughput)+
-			fmt.Sprintf("states %3.2f/s, ", p.stateThroughput)+
-			fmt.Sprintf("lacking %4d", len(p.lacking)),
+		fmt.Sprintf("hs %3.2f/s, ", p.headerThroughput)+


I know this was here before but I suggest that you change this to just one call to fmt.Sprintf. I don't think this makes it either more readable ~~or cleaner~~ (perhaps it makes it easier to read).

Perhaps we should have a better formatter? ;-)

An idea maybe would be to do:

return fmt.Sprintf("Peer %s [%s]", p.id, strings.Join([]string{ fmt.Sprintf("hs %3.2f/s", p.headerThroughput), fmt.Sprintf( ... ), }, ", ")

Join's indeed better ,I'll do that :)

obscuren · 2016-06-03T07:33:16Z

I've tried many different network settings with this PR and none of them made them stall the downloader in any way.

The PR's description should be moved to the wiki. It serves as a good starting point for anyone who'd like to know more about the design decisions.

karalabe · 2016-06-03T08:06:16Z

@obscuren Fixed the formatter. PTAL

fjl · 2016-06-03T14:52:33Z

eth/downloader/peer.go

@@ -585,13 +589,26 @@ func (ps *peerSet) RTTs() []time.Duration {
 		rtts = append(rtts, p.rtt)
 		p.lock.RUnlock()
 	}
-	// Sort into ascending order and return them
+	// Sort into ascending order and retrieve the median
 	for i := 0; i < len(rtts); i++ {
 		for j := i + 1; j < len(rtts); j++ {
 			if rtts[i] > rtts[j] {
 				rtts[i], rtts[j] = rtts[j], rtts[i]
 			}
 		}
 	}


If the slice is either []int or []float64 you can use sort.Ints or sort.Float64s
Making the type or rtt float64 would also reduce type conversions in other places.

Switched the rtts array to float64 and used sort, will push in the next commit. I'd rather keep peer.rtt as a time as I don't think a single conversion once every few seconds matters much and it's less error prone and more descriptive to have a meaningful time.Duration type for an RTT value.

fjl · 2016-06-03T15:03:26Z

In general I don't like the heavy use of atomic operations here but I know that doing it in another way is hard. Possibly you could do it without atomics by pushing the new value into the respective fetch loops via a channel. It's your choice whether you want to address this or not.

fjl · 2016-06-06T09:57:30Z

eth/downloader/downloader.go

+	default:
+		close(d.quitCh)
+	}
+	d.quitLock.Unlock()


What is the difference between the cancel channel and the quit channel?

Answering my own question: quit is permanent, cancel isn't. But can't we use cancel to exit the qos adaptation goroutine?

fjl · 2016-06-06T10:01:28Z

Travis now failing in another test.

codecov-io · 2016-06-06T10:56:08Z

Current coverage is 56.93%

Merging #2630 into develop will increase coverage by <.01%

@@            develop      #2630   diff @@
==========================================
  Files           216        216          
  Lines         24566      24638    +72   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          13964      14026    +62   
- Misses        10601      10612    +11   
+ Partials          1          0     -1

Powered by Codecov. Last updated by 780bdb3...8438e31

fjl · 2016-06-06T11:16:03Z

👍

obscuren added the in progress label May 27, 2016

fjl mentioned this pull request May 27, 2016

eth: make fast sync great again #2569

Closed

7 tasks

karalabe force-pushed the adaptive-qos-tuning branch from 5e9071c to e4a0d09 Compare May 30, 2016 11:41

karalabe added this to the 1.4.6 milestone May 30, 2016

karalabe mentioned this pull request May 30, 2016

geth fast sync restarts at end of blockchain #2639

Closed

karalabe force-pushed the adaptive-qos-tuning branch 2 times, most recently from e8aedf0 to a808e84 Compare June 1, 2016 15:08

karalabe added review and removed in progress labels Jun 1, 2016

karalabe force-pushed the adaptive-qos-tuning branch 2 times, most recently from ba7a47d to 9b469a1 Compare June 1, 2016 15:56

obscuren reviewed Jun 3, 2016
View reviewed changes

fjl reviewed Jun 3, 2016
View reviewed changes

karalabe force-pushed the adaptive-qos-tuning branch 2 times, most recently from 33a91fb to 990e010 Compare June 6, 2016 09:45

fjl reviewed Jun 6, 2016
View reviewed changes

karalabe force-pushed the adaptive-qos-tuning branch 2 times, most recently from a21becd to 05c626d Compare June 6, 2016 10:47

karalabe force-pushed the adaptive-qos-tuning branch from 05c626d to c10e1a3 Compare June 6, 2016 11:04

eth/downloader: adaptive quality of service tuning

88f174a

karalabe force-pushed the adaptive-qos-tuning branch from c10e1a3 to 88f174a Compare June 6, 2016 11:21

karalabe added the merge imminent label Jun 6, 2016

karalabe merged commit 826efc2 into ethereum:develop Jun 6, 2016

obscuren removed the review label Jun 6, 2016

snyk-bot mentioned this pull request Sep 18, 2021

[Snyk] Security upgrade recharts from 1.0.0-beta.7 to 2.1.3 gstearmit/go-ethereum#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth/downloader: adaptive quality of service tuning #2630

eth/downloader: adaptive quality of service tuning #2630

karalabe commented May 27, 2016 •

edited by fjl

Loading

robotally commented May 27, 2016 •

edited

Loading

JasonCoombs commented May 30, 2016

JasonCoombs commented May 30, 2016

JasonCoombs commented May 30, 2016

JasonCoombs commented May 31, 2016

karalabe commented Jun 1, 2016

agnewpickens commented Jun 1, 2016

agnewpickens commented Jun 1, 2016 •

edited

Loading

karalabe commented Jun 2, 2016

agnewpickens commented Jun 3, 2016

obscuren Jun 3, 2016 •

edited

Loading

obscuren Jun 3, 2016 •

edited

Loading

karalabe Jun 3, 2016

obscuren commented Jun 3, 2016

karalabe commented Jun 3, 2016

fjl Jun 3, 2016 •

edited

Loading

karalabe Jun 6, 2016

fjl commented Jun 3, 2016 •

edited

Loading

fjl Jun 6, 2016

fjl Jun 6, 2016

fjl commented Jun 6, 2016

codecov-io commented Jun 6, 2016 •

edited

Loading

fjl commented Jun 6, 2016

eth/downloader: adaptive quality of service tuning #2630

eth/downloader: adaptive quality of service tuning #2630

Conversation

karalabe commented May 27, 2016 • edited by fjl Loading

Tuning peers

RTT vs. TTL

Adaptive RTT

Converging towards the optimum

robotally commented May 27, 2016 • edited Loading

JasonCoombs commented May 30, 2016

JasonCoombs commented May 30, 2016

JasonCoombs commented May 30, 2016

JasonCoombs commented May 31, 2016

karalabe commented Jun 1, 2016

agnewpickens commented Jun 1, 2016

agnewpickens commented Jun 1, 2016 • edited Loading

karalabe commented Jun 2, 2016

agnewpickens commented Jun 3, 2016

obscuren Jun 3, 2016 • edited Loading

Choose a reason for hiding this comment

obscuren Jun 3, 2016 • edited Loading

Choose a reason for hiding this comment

karalabe Jun 3, 2016

Choose a reason for hiding this comment

obscuren commented Jun 3, 2016

karalabe commented Jun 3, 2016

fjl Jun 3, 2016 • edited Loading

Choose a reason for hiding this comment

karalabe Jun 6, 2016

Choose a reason for hiding this comment

fjl commented Jun 3, 2016 • edited Loading

fjl Jun 6, 2016

Choose a reason for hiding this comment

fjl Jun 6, 2016

Choose a reason for hiding this comment

fjl commented Jun 6, 2016

codecov-io commented Jun 6, 2016 • edited Loading

Current coverage is 56.93%

fjl commented Jun 6, 2016

karalabe commented May 27, 2016 •

edited by fjl

Loading

robotally commented May 27, 2016 •

edited

Loading

agnewpickens commented Jun 1, 2016 •

edited

Loading

obscuren Jun 3, 2016 •

edited

Loading

obscuren Jun 3, 2016 •

edited

Loading

fjl Jun 3, 2016 •

edited

Loading

fjl commented Jun 3, 2016 •

edited

Loading

codecov-io commented Jun 6, 2016 •

edited

Loading