-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway unable to find content from swarm connected node #6628
Comments
I hit this problem again today. Here's details / logs to compare with the previous logs: Gateway Details: Host Node Details: Bitswap wantlist from gateway during attempt: Result of running bitswap wantlist --peer="gatewayID" on the host node: Actual debugging logs on each of the nodes: |
These are logs from a little ways back. Unfortunately I can't remember the bitswap wantlists, so these may not be as helpful. I can confirm that we're still hitting this frequently though. I'll be sure to update with new logs as soon as we run into this again. Gateway Details: Host Node Details: |
Another instance we just hit: This time I was attempting to pull the CID: QmQ4BjyyhVw4AoFpLYx9Tmn48UQmxQN69eX3ak6rYDDC4A Gateway Details: Host Node Details: Bitswap wantlist from gateway during attempt: Result of running bitswap wantlist --peer="gatewayID" on the host node: Actual Logs of the event: Interestingly enough the bitswap wantlist matches up this time. However the CID is not the root CID that I was actually searching for. Instead it seems to be stuck on the child CID QmXgoZbttfDHE4nd5rLHE7fXJoT5rqq3T339PPBZPaggPY (The file was wrapped in a directory to preserve the name) |
Ok, this means that the issue is clearly on the host node.
@lanzafame when adding better logging to bitswap, could you keep this issue in mind? Specifically, it would be awesome to be able to log the lifetime of a want: receive want for X from Y, sent want for X from peer Y, added to wantlist, etc. |
I've looked through the latest logs and nothing looks obviously stuck. However, I have two observations:
|
According to @obo20, the machine is using <10% CPU so we shouldn't be be spinning in the message queue. |
I'm also pretty sure that we aren't blocking on the "task worker" (sending blocks). I was concerned that one of the task workers could have been blocked sending the block in question. |
Known facts:
|
For 3), I'm 99% sure this statement is accurate. I've uploaded new content and still gotten this issue. To clarify 4) - When this bug hits, a different gateway (such as infura) is able to fetch the content that our gateway cannot (even though our gateway is directly connected to all our host nodes). After this happens, our gateway is then able to retrieve the content when it is requested. This presumably happens by locating that content on the non-pinata gateway that now has the content. |
@obo20 Have you ever tried running |
@Stebalien I have not. I can test that as soon as I run into this again |
Thanks! Before you do that, could you run the following on both nodes (after installing jq):
Where That will tell me whether or not they appear to be connected, the known latency, and which streams are open, etc. |
To make things more stable for our users, we've started automating a process of doing this every 30min:
If we fail the check: Over the past week we've collected around 70 instances of this happening. Hopefully these logs help a bit. If they need to be reformatted or if there's any logging procedure that needs to be changed let me know. https://drive.google.com/drive/u/0/folders/1UK0A0uQjE8U0mAAouejJBJuaqsiqAN6W |
@Stebalien you may find this interesting: I recently spun up a new gateway for a customer and in my initial stages of testing I noticed I wasn't receiving content from the same host node that's been running into these issues with our main gateway even though I had directly connected the new gateway to the host node. My guess is that the node was in the error state and (luckily / unluckily) our automated systems hadn't caught it yet. After I rebooted the host node, the problem went away for both gateways. From this information, this seems to purely be an issue on the host node and not any of the gateway nodes. For whatever reason it's in a state where it's unable to serve content correctly. |
When debugging this, we found that we appeared to have two gateway nodes with the same peer ID. I'm leaving this open as we haven't confirmed that that's this bug, but that would cause these symptoms (we'd group the connections with the same IDs together and send blocks to the wrong peer). |
Closing as this does, indeed, appear to have been fixed by removing the duplicate nodes from the network. |
I have a gateway node that was unable to retrieve / find content on a host node that it was directly swarm connected to. This only resolved upon restarting the host node. Restarting the gateway node did not help.
Here's the details / bug reporting logs.
Gateway Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6
Host Node Details:
go-ipfs version: 0.4.22-rc1-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.6
Bitswap wantlist from gateway during attempt:
QmTL1wEFDQVy26GcEgnTSzFfvB3bVh5ojJBLQL9VbiXHdT
Result of running
bitswap wantlist --peer="gatewayID"
on the host node:QmV1jX8t4eRbstqkmCXdkHbjsFuDUXRDxUc2sf3vS3f8nP
QmWFRnNMQpckpiqnYFyXeQPELy6C3Sphz8CUQkfqAf72nc
Qmb7ACGzY2amgEqCPBu91aNrzpEgCHKJAiKxeA5GY7Dp6V
Qmecdo1tqDexJbKQRDsCZ4DsihoYvGXStjwFkuQm3sBuVC
Actual debugging logs on each of the nodes:
logs1.zip
The text was updated successfully, but these errors were encountered: