-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth/downloader: adaptive quality of service tuning #2630
Conversation
Updated: Mon Jun 6 11:16:04 UTC 2016 |
5e9071c
to
e4a0d09
Compare
Jason Coombs @JasonCoombs 10:18
Jason Coombs @JasonCoombs 10:24 |
https://gitter.im/ethereum/go-ethereum?at=574cef49454cb2be09506538 Jason Coombs @JasonCoombs 15:56 |
e8aedf0
to
a808e84
Compare
Hey @JasonCoombs , I've pushed out a fresh attempt at the adaptive QoS tuning. It has quite a bit different tuning logic compared to the old one and I can get a stable data stream on basically all kinds of network connectivity I threw at it. Would appreciate if you could try it with your satellite uplink too :) PS: I've also written up the PR description with the logic behind the changes, it might interest you :) |
ba7a47d
to
9b469a1
Compare
My sync on my HP Pavilion G7 running Windows 7 has pretty much been on slow motion since May 9th and I have not been able to achieve a full sync on either Test or Main since then, although I have not experienced a block sync restart. The sync still manages to keep track of what block I am on in sync on both Test and Main. Also, I have not been seeing the word "peers" next to the nodes icon on either Test or Main. I am currently at ~block 875K on Test and ~block 1.3M on Main and the best sync I can get on Main is about 256 blocks in 30 secs with the usual being 2 to 4 minutes. |
^^^ I am usually wired in on Ethernet on a DSL line. Wallet 7.4 with Geth 1.4.5 |
@agnewpickens That is probably a HDD bottleneck, that will hopefully be addressed in the 1.5 release cycle. These PRs mostly address networking issues where latency and/or low bandwidth causes lost peers and a general failure to download stuff. |
Thanks @karalabe was wondering if it might have been because of my ISP, I am in a really small town and have had to reset my router a lot because of iffy service where I am at. |
fmt.Sprintf("receipts %3.2f/s, ", p.receiptThroughput)+ | ||
fmt.Sprintf("states %3.2f/s, ", p.stateThroughput)+ | ||
fmt.Sprintf("lacking %4d", len(p.lacking)), | ||
fmt.Sprintf("hs %3.2f/s, ", p.headerThroughput)+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was here before but I suggest that you change this to just one call to fmt.Sprintf
. I don't think this makes it either more readable or cleaner (perhaps it makes it easier to read).
Perhaps we should have a better formatter? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An idea maybe would be to do:
return fmt.Sprintf("Peer %s [%s]", p.id, strings.Join([]string{
fmt.Sprintf("hs %3.2f/s", p.headerThroughput),
fmt.Sprintf( ... ),
}, ", ")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Join's indeed better ,I'll do that :)
I've tried many different network settings with this PR and none of them made them stall the downloader in any way. The PR's description should be moved to the wiki. It serves as a good starting point for anyone who'd like to know more about the design decisions. |
@obscuren Fixed the formatter. PTAL |
@@ -585,13 +589,26 @@ func (ps *peerSet) RTTs() []time.Duration { | |||
rtts = append(rtts, p.rtt) | |||
p.lock.RUnlock() | |||
} | |||
// Sort into ascending order and return them | |||
// Sort into ascending order and retrieve the median | |||
for i := 0; i < len(rtts); i++ { | |||
for j := i + 1; j < len(rtts); j++ { | |||
if rtts[i] > rtts[j] { | |||
rtts[i], rtts[j] = rtts[j], rtts[i] | |||
} | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the slice is either []int or []float64 you can use sort.Ints or sort.Float64s
Making the type or rtt float64 would also reduce type conversions in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched the rtts
array to float64 and used sort
, will push in the next commit. I'd rather keep peer.rtt
as a time as I don't think a single conversion once every few seconds matters much and it's less error prone and more descriptive to have a meaningful time.Duration
type for an RTT value.
In general I don't like the heavy use of atomic operations here but I know that doing it in another way is hard. Possibly you could do it without atomics by pushing the new value into the respective fetch loops via a channel. It's your choice whether you want to address this or not. |
33a91fb
to
990e010
Compare
default: | ||
close(d.quitCh) | ||
} | ||
d.quitLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the cancel channel and the quit channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering my own question: quit is permanent, cancel isn't. But can't we use cancel to exit the qos adaptation goroutine?
Travis now failing in another test. |
a21becd
to
05c626d
Compare
Current coverage is 56.93%@@ develop #2630 diff @@
==========================================
Files 216 216
Lines 24566 24638 +72
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 13964 14026 +62
- Misses 10601 10612 +11
+ Partials 1 0 -1
|
05c626d
to
c10e1a3
Compare
👍 |
c10e1a3
to
88f174a
Compare
Currently Geth is tuned for low latency, high bandwidth network scenarios, which the sync mechanism enforces by dropping peers who do not reply fast enough to data queries. The rationale was that by honing in on faster peers (i.e. geographically closer to us), we can achieve much faster sync times.
Even though we are aggressive in peer selection, the downloader tries to be smart about the queries it issues by estimating the throughput of all connected peers and requesting data in proportion with the time it should take the peer to respond, attempting to stabilize both the network (by not assigning huge tasks that would time out) as well as the data stream (by having all peers respond with the same RTT).
This algorithm can ensure that independent of what quality peers we have, we will have a stable download stream... as long as we have enough bandwidth to cover all our peers as well as a negligible latency compared to the employed timeouts. However if I have low bandwidth myself, packet RTTs increase proportionally do the download tasks, which can easily hit the timeout thresholds, causing peers to be dropped. Even worse, if I have a high latency connection (EDGE, satellite), the timeouts will almost immediately drop all peers, making sync impossible.
The fundamental flaw is that our current codebase has fixed values as the target RTT in which peers are expected to respond to queries (1-1.5 seconds) and request TTLs after which peers are dropped (3-4 seconds); but for low bandwidth or high latency connections, these are just too low to work out. Simply raising these would allow users with bad connectivity to sync, but would at the same time slow down sync for good connectivity users to the speed of their weakest peer, which would be a bottleneck for the entire algorithm. Instead, this PR attempts to dynamically adapt the target RTT and request TTL values in such a way that we can still hone in on good peers, but do that based on our active peer set and not a blind baseline expected performance.
Note, hard RTT and TTL values were needed prior to concurrent headers because the master peer which we used to pull headers from was already a bottleneck and we couldn't afford to stall sync with arbitrary performance. With concurrent headers, even if the master peer is slow-ish we can still sync fast, so we can me much more lax with regard to RTT and TTL.
Tuning peers
To dynamically adapt the RTT and TTL values, we need to make various measurements on our peers' performance and aggregate them into these thresholds. However if we have a lot of peers (e.g. 10, 25, 100), some of them will be high quality peers, and some will be low quality ones. Aggregating all of them would result in too lax RTT/TTL values, delaying sync due to unnecessary bottlenecks.
Instead of adapting to all connected peers, we would like to hone in on our true performance, and keep only peers that are reasonably close performance wise. To do this, we will tune the parameters based only on our best N peers. Those upwards from N will be required some threshold of performance from our best peers, or else be dropped during sync. This algorithm ensures that although we adapt our RTT/TTL values to our peers and network connectivity, we still saturate our own connection and not introduce bottlenecks. If a better peer joins, it will automatically result in tighter thresholds, whereas if a top peer drops off, the threshold will be automatically relaxed.
The PR chosses
N = 5
. The rationale is that by default geth has--maxpeers=25
, however only half of that is reserved for outbound connections, so with a NATed/firewalled system, peer count is capped at around 12. WithN == 5
we retain enough peers to do meaningful tuning, but also leave enough slots for bad peers to die off and leave room for better ones. Lastly, it's important to mention that the BitTorrent protocol also uses 4 (stable) + 1 (optimistic) peers for downloading data at any given time, as higher values cause suboptimal TCP congestion control. Although BitTorrent furthers this logic by choking the remaining peers, it's a complexity we need to decide whether to introduce or not (@fjl suggested we do, maybe in a followup PR we can do an N-tuning M-optimistic K-spare peers split).RTT vs. TTL
As discussed, sync performance depends on tuning two parameters of the downloader: the target round trip time (RTT) that ensures we have a stable stream of data to process; and the request time to live (TTL) which ensures some baseline quality of peers. Thinking about it though, the two values are somewhat related: if we can sustain a stable RTT over multiple peers, then we can set TTL to be a constant multiple
c
of this RTT value. As long as RTT doesn't fibrillate,TTL = c * RTT
is a good filter to weed out under performing peers.Fibrillation however can be an issue, especially during the beginning of sync. If we have only a couple peers (e.g. 2) that we tuned the RTT to, and suddenly experience a surge in connections (e.g. 4 new ones), then limited bandwidth availability would cause actual round trip times to grow proportionally to the additional download requests (not giving an opportunity for our estimated RTT (and hence related TTL) to adapt fast enough). Hitting the TTL limit would result in both new as well as potentially old, perfectly good peers being dropped. Reducing peer count would drive down the RTT, and reconnecting peers would fibrillate it back up.
The moral of the story is, that
TTL = c * RTT
only makes sense, if the estimated RTT is relatively stable and approximates the true RTT closely enough. As we cannot guarantee RTT stability during peer joins, we introduce a confidence factor into our estimation. The more peers join (proportionally to our existing ones), the less confident we can be in our estimated RTT value; and the less confident we are, the higher TTL threshold needs to be allowed to cater for splitting our bandwidth among more peers.Aggregating the above rational, we can set
TTL = c * RTT / conf
, whereconf ∈ (0, 1]
. If we assume a stable RTT (i.e.conf == 1
), we can pick a meaningful number for thec
scaling factor. It's really arbitrary, butc = 3
gives enough wiggle room for occasional hiccups but is not lax enough to hinder performance too much. As for the confidence multiplier, the only idea we can use is that limited bandwidth is split between peers, so more peers could (not would) result in proportionally larger RTTs. Whenever a peer joins, we reduceconf
proportionally to cater for bandwidth splits:conf = conf * (1 - 1/(peers+1)) = conf * peers / (peers + 1)
.E.g. if we had a stable RTT (
conf == 1
) with 2 peers, adding new peers would:conf = 1 * 2 / 3 = 2/3
conf = 2/3 * 3 / 4 = 2/4
conf = 2/4 * 4 / 5 = 2/5
Per the above, the confidence factor correctly follows the number of joining peers, even when processed one by one (+1+1+1 results in the expected value that +3 would produce). Note, we'll cap the minimum of the confidence factor to
0.1
to prevent too wild TTL values. The confidence factor is increased towards1
after every RTT.Adaptive RTT
Based on the above discussion, if we can estimate a reasonable target round-trip time and can keep it relatively up to date with peers' capacities but at the same time stable as peers join/leave and requests come/go, we can derive a good enough TTL value. The remaining challenge is how to select a good RTT to target for requests. The requirements are that it should be high enough to allow a meaningful batch of data to be retrieved, but low enough so that import can progress without stuttering too much.
We will start out with the maximum allowed RTT value if we have no connections (PR sets it to 20s). This ensures that irrelevant of our local connectivity, we can do some meaningful communication.
Further, for each peer we will maintain a very rough RTT value (initially these are maxed too), which we'll update whenever the said peer responds to one of our data requests (weighed update). Note, this RTT will vary a ton based on our network saturation, the peers network saturation, latencies, congestion, etc. This is fine, it's just meant as a means to compare the performance of different peers at the current workload (whatever that might be).
Using these constantly changing RTT values, every once in a while (more on this later) we select the N best peers to tune with (as mentioned above, 5 currently) and updated our target RTT to the median performance of these best peers (again, weighed update to smooth hiccups). With the target RTT updated, we can calculate a fresh TTL value too.
Converging towards the optimum
Adaptive algorithms have a tendency to converge to some stable state, which may or may not be the optimal solution (depends on the exact scenario). In our case if we have a single peer, two stable scenarios could be: a) download 256 headers in 2 seconds; and b) download 128 headers in 1 second. Both are perfectly valid in the long run, but as block processing is a streaming operation, and there are a lot of steps to go through, it is usually more advisable to prefer shorter cycles over longer ones. To this extent, we need to force the algorithm out of sub-optimal stable points.
One such scenario is the selection of a stable, but larger RTT (e.g. 5 sec vs. 2 sec). Having a larger than essential RTT means that streaming the blockchain is more bursty and less flowing; but it also means that we could end up keeping weaker peers that ould have been weeded out by the small RTT. As long as the smaller RTT is equally stable, performance wise it will always be preferable. To prevent locking in to a higher than needed RTT, whenever we fetch some data, we will not use our estimated best RTT, but rather only 90% of it. If the smaller RTT was a bad guess (e.g. our latency is the culprit), we'll just estimate the original RTT the next time too and remain unchanged. However if our guessed RTT turns out to be valid, we just shaved off a bit of your sync time. Repeating this will converge on the minimum RTT that is stable for our actual workload.
The second interesting scenario is during throughput estimation. The downloader tries to (did this since forever) estimate how many blocks, receipts, states, etc a peer can give us in a given amount of time, and use that to scale the size of data retrievals to the target RTT. This works well in low latency environments (e.g. 10 blocks 1 sec => 20 blocks 2 secs), but it performs poorly if latency comes into play (e.g. with a 250ms latency, 10 blocks 1 sec => 20 blocks 1.5sec). In high latency environments, the algorithm will constantly under-estimate the true throughput and retrieve less, wasting time because of the latency, not because of actual throughput. To solve this, the PR makes a very subtle change to the calculation of download task sizes. Instead of using
RTT / throughput
, we will actually request+1
data item than what the correct estimate is. If the limiting factor was bandwidth, we'll get back the original throughput and be done with it. If however the limiting factor was latency, we'll get the same RTT, but at a much higher throughput.