Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT Connectivity and Discoverability Issues #2509

Open
whyrusleeping opened this issue Mar 28, 2016 · 25 comments
Open

NAT Connectivity and Discoverability Issues #2509

whyrusleeping opened this issue Mar 28, 2016 · 25 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@whyrusleeping
Copy link
Member

People have been noticing issues with connectivity through NATs lately. Lets use this issue to track those issues, and provide debugging information/tips/tricks.

@whyrusleeping
Copy link
Member Author

Some tips i've posted in a different issue before:

first, note down the peer IDs of all your involved nodes (run ipfs id)

To check what peers a given node is connected to, run ipfs swarm peers and search for the peer IDs at the end of addresses for the ones youre interested in.

To check connectivity to a given node, i normally start at an ipfs node that i know has good connectivity (my vps normally) and run ipfs dht findpeer <PEERID> for the peer you're investigating. This should list out all addresses that the peer is advertising. If the public address is in that list (and you arent already connected to them) you can run ipfs swarm connect <ADDR> where ADDR is the entire /ip4/...../ipfs/QmPeerID

If you can successfully connect a node to the node with the data, you should be able to run an ipfs get to grab the data youre interested in.

If you connect and arent able to get the data, i would check ipfs dht findprovs <CONTENT HASH> and see if the network returns any records indicating who has that content. If your peer that has the data doesnt show up there, then something interesting is wrong (likely added the data while not connected to the dht). In that case, I would try re-adding the data on the node that already has it (this will trigger a rebroadcast of the provider records). After that compeltes wait a little bit (for the records to propogate) and try running the ipfs get again from the other (non data holding) node.

If you cant make a connection from an outside node to your node with the data, the next thing I would try is making a connection from the data node out to other peers, then try fetching the data on those other peers. If that works, then the issue lies entirely with NAT traversal not working. Ipfs does require some amount of port forwarding to work on NAT'ed networks (whether manual forwarding, NAT PMP or upnp).

@whyrusleeping
Copy link
Member Author

i have thought about designing a NAT test lab, notes here: https://gist.github.com/whyrusleeping/a0ab8df68d1020df32c6

@slothbag
Copy link

I keep getting too many files open errors on my ipfs daemon.. not sure if this is related.

@whyrusleeping
Copy link
Member Author

@slothbag hrm... getting the too many open files error will definitely cause issues with dht connectivity.

@whyrusleeping
Copy link
Member Author

An irc user noted issues after seeing the mdns 'failed to bind unicast' error. Theres likely some correlation here.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Mar 29, 2016

The issue appears to be a file descriptor leak in the utp codebase (thanks for the tip @slothbag it really helped!)

A temporary workaround (while i'm working on an official fix) is to add a utp swarm addr to your swarm address config.

In ~/.ipfs/config (or $IPFS_PATH/config) locate the Addresses.Swarm list and add something like "/ip4/0.0.0.0/udp/4002/utp" to it.

After that value is set, restart your daemon and things should be better. If you continue to experience the same problems, please let me know ASAP

UPDATE: the utp code is disabled by default in recent versions of go-ipfs. This suggestion is no longer valid.

@guruvan
Copy link

guruvan commented Mar 29, 2016

so from our other beginning to this issue - a couple quick notes:

  • definitely a "for sure" is go-ipfs is having difficulty with determining correct interface addresses on my AWS machines - it's very very inconsistent how it's pulling in the AWS "public" ipv4 addresses when using this interface.

dht:

  • so far it looks like it's a problem with finding the correct route to the data from the "client" side, the ipfs add works fine AFAICT, but ipfs get from another node lags out.
    • it seems as though it's an issue when I do ipfs get or ipfs pin but NOT when I simply curl the same data from the gateway interface

I was able to retrieve a 2.04GB data set (1.4GB file, 700MB file, a few others) with the following procedure:

  1. remove all bootstrap swarm hosts
  2. enter in only the hosts I KNOW possess the data (i.e. the host that ADDED it)
  3. start ipfs
  4. run ipfs get
  5. wait for it to hang
  6. kill ipfs
  7. restart
  8. ipfs get
  9. we get to the spot where it hung previously
  10. PAGES of tcp errors (from the WRONG IP addresses!)
  11. the get resumes, and gets some more data
  12. ipfs get hangs
    13 kill ipfs and restart procedure

on host running ipfs get
bad addresses like:
- 127.0.0.1
- docker bridge interfaces
- other known "bad" ip addresses

some of these addresses are known to be from my own hosts, some clearly are not.
dht findprovs shows that there's bad dial attempts to get to peerIDs that are NOT me, but have a bad address - usually localhost
Without changing the swarm bootstrap, each restart would hang at the same point and never continue.

I'll have a little more time tomorrow to investigate further & maybe get some packet traces.

@mitar
Copy link

mitar commented Mar 29, 2016

Is this a regression? Maybe going back in versions to see when it is starting?

@slothbag
Copy link

I did the utp config change and it appears to have fixed the issue.. nice find!

@whyrusleeping
Copy link
Member Author

fixes to the utp lib have been merged into master, so pull the latest down and run make install (new gx deps).

Please let me know how things go.

@guruvan
Copy link

guruvan commented Mar 31, 2016

Added the utp change, added the port to my security groups.
With a node set with a single known peer of my own (didn't notice this or I'd have reset to default)

  • running in docker with --net host (on AWS)
  • hosts are rancheros - 2 docker daemons present
  • these examples are hosts in an AWS VPC, on a publicly exposed network, not subject to my NAT
  • curling a file the node doesn't have yet (of my own production/push to ipfs) seems "better" - still stalling out, but I've not reconfigured all my nodes with the utp fix yet
    • 4 stops & restarts to get a 700mb file

Restarting this same node running ipfs get it stalled out immediately :)
Restarting this same node resetting the bootstrap peers to default + 1 known node of my own running ipfs get on it stalls repeatedly (on what appear to be problem blocks)

with repeated restarts, and slightly different configurations for API,Gateway, Swarm (but including the appropriate utp line) I had many different results regarding the inclusion, lack of inclusion of the AWS PUBLIC_IPV4 address.

  • at one point I saw this come up in the swarm addresses output form the daemon as it came up:
  • ip4/PUBLIC_IPV4/tcp14527
    ?? no idea where it could have gotten this port number
    updated as I test
  • I noted this in checking findpeers from the "fresher node below to find the more problematic node above, and out of space and reset below

[rancher@rancher ~]$ docker exec r-ipfs_gw_1 ipfs dht findpeer QmYB8t7H2Z1xrwZ6fxAhfUg3UTFPGyq5Sg2WUkJZdkChZe 00:23:03.545: <peer.ID QmYB8t> /ip4/private_ipv4/tcp/4001 **/ip4/public_ipv4/tcp/14528** /ip4/public_ipv4/tcp/4001 /ip4/127.0.0.1/udp/4002/utp /ip4/127.0.0.1/tcp/4001 /ip4/system-docker/udp/4002/utp /ip4/system-docker/tcp/4001 /ip4/user-docker/udp/4002/utp /ip4/user-docker/tcp/4001 /ip4/private_ipv4/udp/4002/utp

  • where ever it's getting the port number noted above looks very troublesome to me - if that's expected to be an inbound-capable port....
  • really it looks to me as if I should be able to blacklist local ip addresses - i.e. the docker addresses, or any other addresses I prefer it to not listen on, or publish - no other node should ever talk to me on my docker addresses, localhost, etc. If I could simply set this to at most my public_ipv4, and private_ipv4 that would seem like it'd work better (without knowing too much of the internals of ipfs)

I finally ran this machine out of space (I'm presuming this is related to docker's handling of volumes, rather than ipfs's handling of data) - and blew up the ipfs dir. :)

Starting with a fresh config on the same host
and updating my other nodes to include a utp swarm address I ran ipfs get on this data, and let it run while

I moved onto a "fresher" node (fresh IP addresses, fresh config file, no data downloaded)

  • one quick test without the utp change seems to have stalled out
  • a restart with the utp change was instantaneously better
  • this node was able to download almost the entire 2GB dataset in one try

There's not any perceptible pattern to log messages when the get operation appears to "stall out" other than what I've noted above

  • by stall out, I really mean wait several minutes (up to 15-20) for the get operation to show any progress
  • as I've progressed with the testing, I've added more specific timeouts (up to 30min)

I'll be adding more nodes shortly, all with fresh ip addresses. I'll whip up an updated docker image in the morning from master.
@whyrusleeping does my new image need to add the utp line to the config or is this also updated in master?

@slothbag
Copy link

slothbag commented Apr 1, 2016

Updated local and remote IPFS node with latest UTP fixes. Local node is behind NAT but has port forwarding for IPFS.

I dont seem to get the "too many files in use" error anymore, however the discover-ability is still not working. I have been trying to pin a object for an hour and it cant find it.

Problem still exists.

@whyrusleeping
Copy link
Member Author

@guruvan the 'stalling out' is 'no data at all received for a long time' right? not, 'received some data and then hung'.

If thats the case, then its an issue with discoverability/connectivity (which i think is the problem).

@slothbag in this case, can you discover valid addressed from the NAT'ed node from a node outside the NAT? (ipfs dht findpeer <peer id of NATed node>)

@slothbag
Copy link

slothbag commented Apr 1, 2016

ipfs dht findpeer returns a list if ip addresses... a mixture of my LAN ip and my external IP, but the correct incoming port is on the LAN ip and all the external IPs have incorrect ports.

@whyrusleeping
Copy link
Member Author

@slothbag thats awesome information for me to have, thank you!

@em-ly em-ly added the kind/bug A bug in existing code (including security flaws) label Aug 25, 2016
@mikhail-manuilov
Copy link

Why can't ipfs use same method for determine external IP like for example parity: --nat extip:.
https://github.com/paritytech/parity/wiki/Configuring-Parity ? Current version at the moment simply ignores IP in "Addresses" array.

@ghost
Copy link

ghost commented Oct 26, 2017

Why can't ipfs use same method for determine external IP like for example parity: --nat extip:.
https://github.com/paritytech/parity/wiki/Configuring-Parity ? Current version at the moment simply ignores IP in "Addresses" array.

@mikhail-manuilov Do you mean you specified your external IP in Addresses.Swarm? That'd currently only work if there's a network interface on your local machine that has that IP address.

The Addresses.Swarm setting is only for the addresses to listen on - instead you can explicitly set addresses to be announced to the network in Addresses.Announce.

@Kubuxu
Copy link
Member

Kubuxu commented Oct 26, 2017

Here is other issue with connectivity I have observed at my house.
I am behind carrier grade NAT and then my local NAT.
After starting go-ipfs it connects to one bootstrap node and that is it. Randomly I have found that disabling reuseport (IPFS_PREUSEPORT=false) "fixes" it. Fixes as in: now I can dial out, people still can't dial to me (NAT it too strong).

So if someone has problems with dialing out, disabling reuseport might help.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Oct 26, 2017 via email

@Kubuxu
Copy link
Member

Kubuxu commented Oct 26, 2017

I tried to use this tool right now, seems quite broke.

@whyrusleeping
Copy link
Member Author

@Kubuxu I fixed the issue you reported, thanks! mind trying again?

@Kubuxu
Copy link
Member

Kubuxu commented Nov 1, 2017

{
  "OutboundHTTP": {
    "OddPortConnection": "",
    "Port443Connection": ""
  },
  "Nat": {
    "Error": null,
    "MappedAddr": "/ip4/0.0.0.0/tcp/38044"
  },
  "HavePublicIP": false,
  "Response": {
    "SeenAddr": "/ip4/87.239.222.9/tcp/6812",
    "ConnectBackSuccess": false,
    "ConnectBackMsg": "dial attempt failed: \u003cpeer.ID Pah1CN\u003e --\u003e \u003cpeer.ID TwQRCH\u003e dial attempt failed: connection refused",
    "ConnectBackAddr": "",
    "TriedAddrs": [
      "/ip4/127.0.0.1/tcp/40941",
      "/ip4/0.0.0.0/tcp/38044",
      "/ip4/87.239.222.9/tcp/40941"
    ]
  },
  "Request": {
    "PeerID": "QmTwQRCHoF34HamrcfAQx9rti3AM127hKr6MGrzvBnxBoM",
    "SeenGateway": "",
    "PortMapped": "/ip4/0.0.0.0/tcp/38044",
    "ListenAddr": "/ip4/127.0.0.1/tcp/40941"
  },
  "TcpReuseportWorking": false
}

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Nov 1, 2017 via email

@Kubuxu
Copy link
Member

Kubuxu commented Nov 1, 2017

Yup, interesting thing is that with REUSEPORT I think I might be able to dial only once from that port.

@Kubuxu
Copy link
Member

Kubuxu commented Nov 1, 2017

I have an option to buy external IP from my ISP but I am not doing it on purpose until we successfully recreate setup like this elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

7 participants