Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway unable to find content from swarm connected node #6628

Closed
obo20 opened this issue Sep 5, 2019 · 17 comments
Closed

Gateway unable to find content from swarm connected node #6628

obo20 opened this issue Sep 5, 2019 · 17 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/gateway Topic gateway

Comments

@obo20
Copy link

obo20 commented Sep 5, 2019

I have a gateway node that was unable to retrieve / find content on a host node that it was directly swarm connected to. This only resolved upon restarting the host node. Restarting the gateway node did not help.

Here's the details / bug reporting logs.

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmTL1wEFDQVy26GcEgnTSzFfvB3bVh5ojJBLQL9VbiXHdT

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmV1jX8t4eRbstqkmCXdkHbjsFuDUXRDxUc2sf3vS3f8nP
QmWFRnNMQpckpiqnYFyXeQPELy6C3Sphz8CUQkfqAf72nc
Qmb7ACGzY2amgEqCPBu91aNrzpEgCHKJAiKxeA5GY7Dp6V
Qmecdo1tqDexJbKQRDsCZ4DsihoYvGXStjwFkuQm3sBuVC

Actual debugging logs on each of the nodes:
logs1.zip

@obo20 obo20 added the kind/bug A bug in existing code (including security flaws) label Sep 5, 2019
@lanzafame lanzafame added the topic/gateway Topic gateway label Sep 10, 2019
@obo20
Copy link
Author

obo20 commented Sep 25, 2019

I hit this problem again today. Here's details / logs to compare with the previous logs:

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmTob3x6f2Ey7VLV2JfHg2d6xVf7CtkwvHgjjFPcyfjyEv

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmTT4XKzH7uyAP4ShGxnf5HgoxYcsDuZZj2V7MP2VyNfXV
QmVu1LpAkA1yz3tA3ENcvJimLNmfr6883Ffqotwgv7aH7S
QmWzLqWBkSrGfxNyYBzWMfte6mN9YSNfWGTFoTVRRM5svc

Actual debugging logs on each of the nodes:
logs2.zip

@obo20
Copy link
Author

obo20 commented Oct 22, 2019

These are logs from a little ways back. Unfortunately I can't remember the bitswap wantlists, so these may not be as helpful. I can confirm that we're still hitting this frequently though. I'll be sure to update with new logs as soon as we run into this again.

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

logs3.zip

@obo20
Copy link
Author

obo20 commented Oct 24, 2019

Another instance we just hit:

This time I was attempting to pull the CID: QmQ4BjyyhVw4AoFpLYx9Tmn48UQmxQN69eX3ak6rYDDC4A

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY

Actual Logs of the event:
logs4.zip

Interestingly enough the bitswap wantlist matches up this time. However the CID is not the root CID that I was actually searching for. Instead it seems to be stuck on the child CID QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY (The file was wrapped in a directory to preserve the name)

@Stebalien
Copy link
Member

Ok, this means that the issue is clearly on the host node.

  • We rebroadcast wants every few minutes so missing wants should get fixed.
  • If the gateway knows that the peer wants something but isn't sending it, something is very wrong.

@lanzafame when adding better logging to bitswap, could you keep this issue in mind? Specifically, it would be awesome to be able to log the lifetime of a want: receive want for X from Y, sent want for X from peer Y, added to wantlist, etc.

@Stebalien
Copy link
Member

I've looked through the latest logs and nothing looks obviously stuck. However, I have two observations:

  • All of our message sending queues are constantly running. However, I'm not seeing a ton of CPU usage. I'm wondering if we keep handling the rebroadcast timer and never get around to actually sending messages.
  • We have a lot of inbound streams stuck negotiating the protocol. They're sitting there reading for hours. This is odd because peers shouldn't open streams unless they have something to send us. If they send us something, we should negotiate the protocol.

@Stebalien
Copy link
Member

According to @obo20, the machine is using <10% CPU so we shouldn't be be spinning in the message queue.

@Stebalien
Copy link
Member

I'm also pretty sure that we aren't blocking on the "task worker" (sending blocks). I was concerned that one of the task workers could have been blocked sending the block in question.

@Stebalien
Copy link
Member

Stebalien commented Oct 25, 2019

Known facts:

  1. When we get into the stuck state, the gateway can't fetch anything from the host.
  2. Restarting the gateway doesn't help but restarting the host does.
  3. Other nodes can (we think) fetch content from the host.
  4. When another node fetches content from the host, the gateway can find the content from the infura node.

@obo20
Copy link
Author

obo20 commented Oct 25, 2019

Known facts:

  1. When we get into the stuck state, the gateway can't fetch anything from the host.
  2. Restarting the gateway doesn't help but restarting the host does.
  3. Other nodes can (we think) fetch content from the host.
  4. When another node fetches content from the host, the gateway can find the content from the infura node.

For 3), I'm 99% sure this statement is accurate. I've uploaded new content and still gotten this issue.

To clarify 4) - When this bug hits, a different gateway (such as infura) is able to fetch the content that our gateway cannot (even though our gateway is directly connected to all our host nodes). After this happens, our gateway is then able to retrieve the content when it is requested. This presumably happens by locating that content on the non-pinata gateway that now has the content.

@Stebalien
Copy link
Member

@obo20 Have you ever tried running ipfs swarm disconnect GatewayPID on the host when stuck in this state? I'm wondering if the host is trying to use some kind of dead connection.

@obo20
Copy link
Author

obo20 commented Oct 25, 2019

@Stebalien I have not. I can test that as soon as I run into this again

@Stebalien
Copy link
Member

Thanks! Before you do that, could you run the following on both nodes (after installing jq):

ipfs swarm peers -v --enc=json | jq '.Peers[] | select(.Peer=="ID_OF_OTHER_PEER")'

Where ID_OF_OTHER_PEER is the ID of the other peer (host/gateway).

That will tell me whether or not they appear to be connected, the known latency, and which streams are open, etc.

@obo20
Copy link
Author

obo20 commented Nov 18, 2019

To make things more stable for our users, we've started automating a process of doing this every 30min:

  1. add a small text file to one of our host nodes, consisting of a random string of characters to IPFS
  2. check our gateway to see if it's retrievable within 1min (they're swarm connected so failure here indicates we've hit the error)

If we fail the check:
3) run the collect profiles script on each node
4) restart the nodes and reconnect them to each other

Over the past week we've collected around 70 instances of this happening. Hopefully these logs help a bit. If they need to be reformatted or if there's any logging procedure that needs to be changed let me know.

https://drive.google.com/drive/u/0/folders/1UK0A0uQjE8U0mAAouejJBJuaqsiqAN6W

@obo20
Copy link
Author

obo20 commented Dec 4, 2019

@Stebalien you may find this interesting:

I recently spun up a new gateway for a customer and in my initial stages of testing I noticed I wasn't receiving content from the same host node that's been running into these issues with our main gateway even though I had directly connected the new gateway to the host node.

My guess is that the node was in the error state and (luckily / unluckily) our automated systems hadn't caught it yet. After I rebooted the host node, the problem went away for both gateways.

From this information, this seems to purely be an issue on the host node and not any of the gateway nodes. For whatever reason it's in a state where it's unable to serve content correctly.

@Stebalien
Copy link
Member

When debugging this, we found that we appeared to have two gateway nodes with the same peer ID. I'm leaving this open as we haven't confirmed that that's this bug, but that would cause these symptoms (we'd group the connections with the same IDs together and send blocks to the wrong peer).

@Stebalien
Copy link
Member

Closing as this does, indeed, appear to have been fixed by removing the duplicate nodes from the network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/gateway Topic gateway
Projects
None yet
Development

No branches or pull requests

3 participants