-
Notifications
You must be signed in to change notification settings - Fork 110
Conversation
e9593b6
to
acefe53
Compare
Actually we could just move the embedded conns and addrs in kademlia to a capabilityIndex named eg "default" ? If it makes it easier, I mean. |
jesus never mind thats exactly what you did |
Hmm, I like "default" more than "global" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach.
I am curious if pubsub ordering is important. This test shows that the ordering of published messages is not preserved on subscribers end: func TestMessagesOrder(t *testing.T) {
ps := pubsubchannel.New()
defer ps.Close()
s := ps.Subscribe()
for i := 0; i < 1000; i++ {
ps.Publish(i)
}
c := s.ReceiveChannel()
var n int
timeout := time.After(2 * time.Second)
loop:
for {
select {
case i, ok := <-c:
if i != n {
t.Errorf("got %v, want %v", i, n)
}
if !ok {
break loop
}
n++
case <-timeout:
break loop
}
}
} By using two independent subscriptions for adding and removing peers in stats without preserving order, is it possible that in highly dynamic environment there is a data race? That first a message about peer being removed is received, resulting a removal from ResourceUseStats which silently passes, and then receiving a message from another channel about the same peer being added, the same that is already removed, and adding it to ResourceUseStats. Is that possible and can it create problems? (previously posted on the wrong PR) |
I have looked more about the problem with order of removals in ResourceUseStats. It appears that it can be quite drastic. This tests just adds and removes peers one by one, but on subtest has a delay between them, and one does not. The test without delays with high number of unexpected length 5-9 out of 10. Higher number of peers produce proportional results. It can be easily seen that the problem is in ResourceUseStats.RemoveResource when removing a not yet added key. func TestResourceUseStats(t *testing.T) {
testResourceUseStats := func(t *testing.T, delay time.Duration) {
k := NewKademlia(make([]byte, 32), NewKadParams())
lb := NewKademliaLoadBalancer(k, false)
for i := uint64(0); i < 10; i++ {
a := make([]byte, 8)
binary.BigEndian.PutUint64(a, i)
p := NewPeer(&BzzPeer{BzzAddr: NewBzzAddr(a, a)}, nil)
k.On(p)
if delay > 0 {
time.Sleep(delay)
}
k.Off(p)
if delay > 0 {
time.Sleep(delay)
}
}
lb.Stop()
count := lb.resourceUseStats.Len()
if count > 0 {
t.Errorf("got resourceUseStats %v, want 0, uses: %v", count, lb.resourceUseStats.DumpAllUses())
}
}
t.Run("no delay", func(t *testing.T) {
testResourceUseStats(t, 0)
})
t.Run("1ms delay", func(t *testing.T) {
testResourceUseStats(t, time.Millisecond)
})
}
I do not think that we should have such data races. Message delivery time and order by pubsub is not guaranteed and it is possible to create invalid ResourceUseStats state at high load on production depending on goroutine scheduling, which is not deterministic. |
You are right, I expose also this from the start. I just implemented it and used it in pss. I didn't use this package in retrieval because I don't have gthe knowledge to test that part, so if anyone could jump in and help or direct me a bit would be great. |
Cool. Thanks for letting me know. Let's try to get this merged and I'll try to jump on this |
…a subscription inbox before blocking. This will ensure order messages delivery if the inbox is never full
Although I think that we don't need to care for message ordering, I have reimplemented both |
Thanks @kortatu. I am curious why do you think that the ordering is not important if that creates a data race? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Warn("listenOnOffPeers received message is not a on/off peer signal!") | ||
continue | ||
} | ||
//log.Warn("OnOff peer", "key", signal.peer.Key(), "on", signal.on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code.
// publishing goroutine. It closes the signal channel whenever it receives the quitC signal | ||
go func(sub *Subscription) { | ||
for { | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This select should have case <-psc.quitC:
with return.
case msg := <-sub.inbox: | ||
log.Debug("Retrieved inbox message", "msg", msg) | ||
select { | ||
case <-psc.quitC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this case return and terminate the goroutine?
} | ||
|
||
// we need to sleep to allow all messages to be received by lb | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is causing flakiness on Travis, we should replace it with a retry at least.
Repeated sleep in test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kortatu for your hard work and addressing my comments. I appreciate it. LGTM
@kortatu please fix the panic that shows up in appveyor:
|
The idea is to honor SWIP ethersphere/SWIPs#28. We will distribute load among peers returning to the kademlia api users a different peer each time one is needed.
Load Balancing only makes sense when the resources to chose from are similar for the requester.
We can't balance among all peers in every request. Instead we must select in each case which peers are equidistant to the target address, and sort them by least used first.
To account the number of uses we have created
KademliaLoadBalance
, that proxies calls to Kademlia sorting the peers by use.New peers can't be initialized with 0 uses, because in that case, if the system has been running for a long time, that peer will be overused until it reaches the level of the rest of the peers. Instead we are taking the number of uses of the least used peer in the same bin as this new peer.
There could be other alternatives (as resetting to 0 in each new peer addition, select the least used globally, etc...)
For the moment, the only part of the system using this Load Balancer is PSS forwarding as a proof of concept.
fixes #1757