Gateway unable to find content from swarm connected node #6628

obo20 · 2019-09-05T18:47:53Z

I have a gateway node that was unable to retrieve / find content on a host node that it was directly swarm connected to. This only resolved upon restarting the host node. Restarting the gateway node did not help.

Here's the details / bug reporting logs.

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmTL1wEFDQVy26GcEgnTSzFfvB3bVh5ojJBLQL9VbiXHdT

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmV1jX8t4eRbstqkmCXdkHbjsFuDUXRDxUc2sf3vS3f8nP
QmWFRnNMQpckpiqnYFyXeQPELy6C3Sphz8CUQkfqAf72nc
Qmb7ACGzY2amgEqCPBu91aNrzpEgCHKJAiKxeA5GY7Dp6V
Qmecdo1tqDexJbKQRDsCZ4DsihoYvGXStjwFkuQm3sBuVC

Actual debugging logs on each of the nodes:
logs1.zip

The text was updated successfully, but these errors were encountered:

obo20 · 2019-09-25T21:09:05Z

I hit this problem again today. Here's details / logs to compare with the previous logs:

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmTob3x6f2Ey7VLV2JfHg2d6xVf7CtkwvHgjjFPcyfjyEv

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmTT4XKzH7uyAP4ShGxnf5HgoxYcsDuZZj2V7MP2VyNfXV
QmVu1LpAkA1yz3tA3ENcvJimLNmfr6883Ffqotwgv7aH7S
QmWzLqWBkSrGfxNyYBzWMfte6mN9YSNfWGTFoTVRRM5svc

Actual debugging logs on each of the nodes:
logs2.zip

obo20 · 2019-10-22T15:43:00Z

These are logs from a little ways back. Unfortunately I can't remember the bitswap wantlists, so these may not be as helpful. I can confirm that we're still hitting this frequently though. I'll be sure to update with new logs as soon as we run into this again.

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

logs3.zip

obo20 · 2019-10-24T23:42:04Z

Another instance we just hit:

This time I was attempting to pull the CID: QmQ4BjyyhVw4AoFpLYx9Tmn48UQmxQN69eX3ak6rYDDC4A

Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6

Bitswap wantlist from gateway during attempt:
QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY

Result of running bitswap wantlist --peer="gatewayID" on the host node:
QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY

Actual Logs of the event:
logs4.zip

Interestingly enough the bitswap wantlist matches up this time. However the CID is not the root CID that I was actually searching for. Instead it seems to be stuck on the child CID QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY (The file was wrapped in a directory to preserve the name)

Stebalien · 2019-10-24T23:54:29Z

Ok, this means that the issue is clearly on the host node.

We rebroadcast wants every few minutes so missing wants should get fixed.
If the gateway knows that the peer wants something but isn't sending it, something is very wrong.

@lanzafame when adding better logging to bitswap, could you keep this issue in mind? Specifically, it would be awesome to be able to log the lifetime of a want: receive want for X from Y, sent want for X from peer Y, added to wantlist, etc.

Stebalien · 2019-10-25T00:31:16Z

I've looked through the latest logs and nothing looks obviously stuck. However, I have two observations:

All of our message sending queues are constantly running. However, I'm not seeing a ton of CPU usage. I'm wondering if we keep handling the rebroadcast timer and never get around to actually sending messages.
We have a lot of inbound streams stuck negotiating the protocol. They're sitting there reading for hours. This is odd because peers shouldn't open streams unless they have something to send us. If they send us something, we should negotiate the protocol.

Stebalien · 2019-10-25T00:34:34Z

According to @obo20, the machine is using <10% CPU so we shouldn't be be spinning in the message queue.

Stebalien · 2019-10-25T00:39:26Z

I'm also pretty sure that we aren't blocking on the "task worker" (sending blocks). I was concerned that one of the task workers could have been blocked sending the block in question.

Stebalien · 2019-10-25T01:06:18Z

Known facts:

When we get into the stuck state, the gateway can't fetch anything from the host.
Restarting the gateway doesn't help but restarting the host does.
Other nodes can (we think) fetch content from the host.
When another node fetches content from the host, the gateway can find the content from the infura node.

obo20 · 2019-10-25T02:28:09Z

Known facts:

When we get into the stuck state, the gateway can't fetch anything from the host.

Restarting the gateway doesn't help but restarting the host does.

Other nodes can (we think) fetch content from the host.

When another node fetches content from the host, the gateway can find the content from the infura node.

For 3), I'm 99% sure this statement is accurate. I've uploaded new content and still gotten this issue.

To clarify 4) - When this bug hits, a different gateway (such as infura) is able to fetch the content that our gateway cannot (even though our gateway is directly connected to all our host nodes). After this happens, our gateway is then able to retrieve the content when it is requested. This presumably happens by locating that content on the non-pinata gateway that now has the content.

Stebalien · 2019-10-25T06:16:57Z

@obo20 Have you ever tried running ipfs swarm disconnect GatewayPID on the host when stuck in this state? I'm wondering if the host is trying to use some kind of dead connection.

obo20 · 2019-10-25T14:18:35Z

@Stebalien I have not. I can test that as soon as I run into this again

Stebalien · 2019-10-25T20:42:40Z

Thanks! Before you do that, could you run the following on both nodes (after installing jq):

ipfs swarm peers -v --enc=json | jq '.Peers[] | select(.Peer=="ID_OF_OTHER_PEER")'

Where ID_OF_OTHER_PEER is the ID of the other peer (host/gateway).

That will tell me whether or not they appear to be connected, the known latency, and which streams are open, etc.

Stebalien · 2019-11-04T18:21:19Z

New Profiles

ipfs-profile-Pinata-Host-Node-2019-11-03T17_11_26+0000.tar.gz
ipfs-profile-Pinata-Gateway-2019-11-03T17_11_22+0000.tar.gz

obo20 · 2019-11-18T22:23:28Z

To make things more stable for our users, we've started automating a process of doing this every 30min:

add a small text file to one of our host nodes, consisting of a random string of characters to IPFS
check our gateway to see if it's retrievable within 1min (they're swarm connected so failure here indicates we've hit the error)

If we fail the check:
3) run the collect profiles script on each node
4) restart the nodes and reconnect them to each other

Over the past week we've collected around 70 instances of this happening. Hopefully these logs help a bit. If they need to be reformatted or if there's any logging procedure that needs to be changed let me know.

https://drive.google.com/drive/u/0/folders/1UK0A0uQjE8U0mAAouejJBJuaqsiqAN6W

obo20 · 2019-12-04T19:12:22Z

@Stebalien you may find this interesting:

I recently spun up a new gateway for a customer and in my initial stages of testing I noticed I wasn't receiving content from the same host node that's been running into these issues with our main gateway even though I had directly connected the new gateway to the host node.

My guess is that the node was in the error state and (luckily / unluckily) our automated systems hadn't caught it yet. After I rebooted the host node, the problem went away for both gateways.

From this information, this seems to purely be an issue on the host node and not any of the gateway nodes. For whatever reason it's in a state where it's unable to serve content correctly.

Stebalien · 2020-01-30T01:42:53Z

When debugging this, we found that we appeared to have two gateway nodes with the same peer ID. I'm leaving this open as we haven't confirmed that that's this bug, but that would cause these symptoms (we'd group the connections with the same IDs together and send blocks to the wrong peer).

Stebalien · 2020-02-04T00:16:36Z

Closing as this does, indeed, appear to have been fixed by removing the duplicate nodes from the network.

obo20 added the kind/bug A bug in existing code (including security flaws) label Sep 5, 2019

lanzafame added the topic/gateway Topic gateway label Sep 10, 2019

Stebalien closed this as completed Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway unable to find content from swarm connected node #6628

Gateway unable to find content from swarm connected node #6628

obo20 commented Sep 5, 2019

obo20 commented Sep 25, 2019

obo20 commented Oct 22, 2019

obo20 commented Oct 24, 2019

Stebalien commented Oct 24, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019 •

edited

Loading

obo20 commented Oct 25, 2019 •

edited

Loading

Stebalien commented Oct 25, 2019

obo20 commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Nov 4, 2019

obo20 commented Nov 18, 2019 •

edited

Loading

obo20 commented Dec 4, 2019 •

edited

Loading

Stebalien commented Jan 30, 2020

Stebalien commented Feb 4, 2020

Gateway unable to find content from swarm connected node #6628

Gateway unable to find content from swarm connected node #6628

Comments

obo20 commented Sep 5, 2019

obo20 commented Sep 25, 2019

obo20 commented Oct 22, 2019

obo20 commented Oct 24, 2019

Stebalien commented Oct 24, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Oct 25, 2019 • edited Loading

obo20 commented Oct 25, 2019 • edited Loading

Stebalien commented Oct 25, 2019

obo20 commented Oct 25, 2019

Stebalien commented Oct 25, 2019

Stebalien commented Nov 4, 2019

obo20 commented Nov 18, 2019 • edited Loading

obo20 commented Dec 4, 2019 • edited Loading

Stebalien commented Jan 30, 2020

Stebalien commented Feb 4, 2020

Stebalien commented Oct 25, 2019 •

edited

Loading

obo20 commented Oct 25, 2019 •

edited

Loading

obo20 commented Nov 18, 2019 •

edited

Loading

obo20 commented Dec 4, 2019 •

edited

Loading