-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race: simulatenous NewStream
blocks on one side despite successful connection
#2589
Comments
Just out of interest can you reproduce it when every host is given a unique and deterministic port, e.g. by using the loop count in newHost |
Just tried it in this branch: https://github.com/dennis-tra/newstream-block/tree/port No change - it still blocks sometimes. |
I'm having trouble reproducing this locally. Could you try using an older libp2p version? We made some changes to these code paths recently. cc @sukunrt |
By skipping connectivity checks we reduce the chances of simultaneously opening a stream that will block connection establishment. Context: libp2p/go-libp2p#2589
I tried it with v0.30.0, and it's also happening with that version. Haven't tried older versions yet. Note again, that I also cannot reproduce it on my Mac. I can only semi-reliably reproduce it on my linux box. |
Some racy behaviour with tcp simultaneous connect and reuseport with an Accept on the same socket as the reused port. In the event that the dial times out the sequence of tcp conn establishment packets is:
The connection from A to B is established and is received in the Accept loop after step 4. Here the failure case happens for addresses I'll debug more how mac behaviour defers here. This is on go-libp2p v0.26, before all the swarm changes. |
NewStream
blocks on one side despite successful connectionNewStream
blocks on one side despite successful connection
I wish I could tell you to use quic. But looks like there's a problem with simultaneous udp messages on linux :(
If I add a 10ms delay to one of the goroutines this error goes away. Tried this on two different linux machines. @marten-seemann any idea why this happens? The patch if you want to test this:
|
This seems to be a standard library bug. I opened golang/go#63322 with a minimal example. |
In the TCP case, is there anything we can do in the B->A case when B waits for SYN packets from A? I would have expected that B will stop waiting for SYNs if the connection from A->B succeeds. Btw, the Snippetpackage main
import (
"context"
"fmt"
"github.com/libp2p/go-libp2p"
"github.com/libp2p/go-libp2p/core/host"
"github.com/libp2p/go-libp2p/core/network"
"github.com/libp2p/go-libp2p/core/peerstore"
"github.com/stretchr/testify/require"
"sync"
"testing"
)
func newHost(t *testing.T) host.Host {
listenAddr := libp2p.ListenAddrStrings(fmt.Sprintf("/ip4/127.0.0.1/tcp/0"))
h, err := libp2p.New(listenAddr)
require.NoError(t, err)
t.Cleanup(func() {
if err = h.Close(); err != nil {
t.Logf("unexpected error when closing host: %s", err)
}
})
return h
}
func TestBlock(t *testing.T) {
for i := 0; i < 100; i++ {
t.Run(fmt.Sprintf("[%d]", i), func(t *testing.T) {
h1 := newHost(t)
h2 := newHost(t)
h1.Peerstore().AddAddrs(h2.ID(), h2.Addrs(), peerstore.PermanentAddrTTL)
h2.Peerstore().AddAddrs(h1.ID(), h1.Addrs(), peerstore.PermanentAddrTTL)
h1.SetStreamHandler("/any/protocol", func(stream network.Stream) {})
h2.SetStreamHandler("/any/protocol", func(stream network.Stream) {})
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
_, err := h1.NewStream(context.Background(), h2.ID(), "/any/protocol")
require.NoError(t, err)
}()
go func() {
defer wg.Done()
_, err := h2.NewStream(context.Background(), h1.ID(), "/any/protocol")
require.NoError(t, err)
}()
wg.Wait()
})
}
} |
Same. I'm not sure what's happening in the kernel here.
The call succeeds because of this line: https://github.com/libp2p/go-libp2p/blob/master/p2p/net/swarm/dial_worker.go#L376 Here the dial_worker_loop before exiting checks if somehow a connection has appeared, which in this case it has. This check exists because a concurrent dial might add different addresses to the peerstore and trigger the dial, this helps check if any of the newer addresses have succeeded. For the solution, we need to ensure that a dial call also considers any incoming connections that have been established in the meantime. I'd like to implement some variant of this logic: #1603 (comment)
either this or to setup a notification of incoming connections here: https://github.com/libp2p/go-libp2p/blob/master/p2p/net/swarm/dial_sync.go#L61
|
Simultaneously calling
NewStream
blocks on one side until the dial timeout was reached. Minimum repro example:I cannot reproduce it on my Mac, but on my Linux server. I need to run the test a few times because it's only sometimes happening. This is an example output:
Here's a repository with the minimum repro example: https://github.com/dennis-tra/newstream-block
If you go back a few commits, you can see my debugging attempts.
In the blocking case I saw that both hosts are listening on a TCP port but one host stalls at
maDial
. At the same time that stalled host accepts the dial attempt from the other host. So, there is a successful connection, but theNewStream
/Connect
call blocks until the dial attempt has timed out. When the dial has timed out,NewStream
seems to find the successful connection that the other host established, open a stream, and return it as everything worked fine.The text was updated successfully, but these errors were encountered: