Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client initiates transfer but transfer doesn't begin with StorageDealStartDataTransfer #4946

Closed
dineshshenoy opened this issue Nov 20, 2020 · 16 comments · Fixed by #4975 or #4980
Closed
Assignees

Comments

@dineshshenoy
Copy link

Previously running: fe448e
Upgraded to 1.2.0

Tried against the following miners:
f09848
f014394
f015927
f02665

Textile was successful in deal transfers on fe448e but is running into challenges starting a data transfer on 1.2.0

Nov 20 02:03:21 bafyreifeu4k2lgizsdqmmqrbuyfow57qqihaeqgtaqnvk4cxfajtjnjuom 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.1045888 FIL 522944 false
Nov 20 02:03:21 bafyreicbcbjvwpimr2xgjov7t5fk4kvvobxfqb6oufozt7hqpoovtarz4y 0 f014394 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.1045976 FIL 522988 false
Nov 20 02:03:21 bafyreifcmj2bhm3cxw4rw22ns3wkwvjxk72njllvty2i36sumvwgeydrqq 0 f015927 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.0103161366 FIL 521017 false
Nov 20 02:06:21 bafyreigsjfa5ynj3v67uxnv3q5m4otvhprtka2hphiqxdj4teqak6sdf6m 0 f02665 StorageDealStartDataTransfer N N baga6ea4seaqd3w2d6namioztl4cegxnhbnvwjivzn5unxmdiegm5cuhgnu5biky 7.938 GiB 0.002090608 FIL 522652 false
Nov 20 03:03:17 bafyreibkoc54gbijomv6poceji6xrxsypy3w2ayz2vgnqwoysbyoirodam 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.1045648 FIL 522824 false
Nov 20 03:03:17 bafyreieys5gvo5kxa3vv2mq6fzslncfgvj637fisipelcn4to422nqqp2y 0 f014394 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.1045736 FIL 522868 false
Nov 20 03:03:18 bafyreignxfktgmaa66ypnqr75mreh6jv7wmgy35bcz7acmnbas4jj4kyda 0 f015927 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.0103137606 FIL 520897 false
Nov 20 03:06:21 bafyreicpnmoc2iee3zb5523rolfthkkw5xvwkwuyjvv4ubog3ckxla5roe 0 f02665 StorageDealStartDataTransfer N N baga6ea4seaqnhd5xfeuwcrndmeb5xhuum6cqfhprlxzdcetgqqjnbqrgey6mqca 7.938 GiB 0.002090128 FIL 522532 false
Nov 20 04:03:22 bafyreigqr6ifsqwycag3yn2xzicojnnvbt57w2xnwtzidgkmomqmkowv5m 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqnh3pbmml4o44nmuzb4tp3t4gv7kve2sugf4cb5yrsatshzmn2eki 1.984 GiB 0.1045408 FIL 522704 false

@dineshshenoy dineshshenoy added this to the 💹Storage Deal Success milestone Nov 20, 2020
@dirkmc
Copy link
Contributor

dirkmc commented Nov 23, 2020

When this issue occurs, there is a single Open data transfer event and then no more events

2020-11-23T14:06:18.887Z        DEBUG        markets        loggers/loggers.go:35        data transfer event        {"name": "Open", "status": "Requested", "transfer ID": 923, "channel ID": "12D3KooWHAN67HgLoqrygQnouFF5uVhwM4kP7zurpR65Powtfgsa-12D3KooWMTh8Cz9dbUik9nQGKE5aVkUhVLfGLm48vu3gaJXQQBYQ-923", "sent": 0, "received": 0, "queued": 0, "received count": 0, "total size": 0, "remote peer": "12D3KooWMTh8Cz9dbUik9nQGKE5aVkUhVLfGLm48vu3gaJXQQBYQ", "event message": "", "channel message": ""}

@dineshshenoy dineshshenoy assigned dirkmc and unassigned hannahhoward Nov 23, 2020
This was linked to pull requests Nov 23, 2020
@dineshshenoy dineshshenoy reopened this Nov 23, 2020
@kernelogic
Copy link

@magik6k i don't think adding logging will solve this issue. I have a lot deals stuck in StartDataTransfer state

root@li824-220:~# lotus client deal-stats
Total: 791 deals, 1.516 TiB
StorageDealActive: 6 deals, 11.91 GiB
StorageDealProposalAccepted: 1 deals, 1.984 GiB
StorageDealSealing: 383 deals, 760 GiB
StorageDealFundsReserved: 21 deals, 41.67 GiB
StorageDealCheckForAcceptance: 5 deals, 9.922 GiB
StorageDealStartDataTransfer: 125 deals, 237.6 GiB
StorageDealTransferring: 26 deals, 51.59 GiB
StorageDealClientFunding: 1 deals, 1.984 GiB
StorageDealError: 223 deals, 435.6 GiB

@kernelogic
Copy link

Another node of mine has even more stuck in this state

Total: 1157 deals, 2.201 TiB
StorageDealActive: 104 deals, 199.4 GiB
StorageDealSealing: 125 deals, 241.1 GiB
StorageDealFundsReserved: 19 deals, 34.48 GiB
StorageDealStartDataTransfer: 381 deals, 740.9 GiB
StorageDealTransferring: 9 deals, 17.86 GiB
StorageDealError: 519 deals, 1020 GiB

@dirkmc
Copy link
Contributor

dirkmc commented Nov 24, 2020

@kernelogic could you please let us know which version of lotus you are running

@dirkmc
Copy link
Contributor

dirkmc commented Nov 24, 2020

It would also be helpful if you could copy a few of the most common error messages you are seeing

@dirkmc
Copy link
Contributor

dirkmc commented Nov 24, 2020

It appears that the issue is caused in part by a limit (maximum of 6) on the number of in-progress queries in go-graphsync:
https://github.com/ipfs/go-graphsync/blob/785e0b3bbdd6b578e3274fb91e5963c219d372a2/responsemanager/responsemanager.go#L25

When the client wants to send files to the miner

  1. The client opens a "push" data-transfer request to the miner
  2. The miner sends a graphsync query to the client, asking for the file
  3. The client responds to the graphsync request with the file

If the client is already responding to the maximum number of requests, the new request will be queued until there is an available request worker. The deal will remain in the StorageDealStartDataTransfer state.

@kernelogic
Copy link

I'm using the latest lotus 1.2.1

@dineshshenoy
Copy link
Author

Suggestions from standup to action with three pieces:

  1. Create a parameter on the client side so that they can increase the # of active deals from the current value of 6 to something larger.
  2. Log an error message so that the issue is discoverable when the queue is "full" (WARN). It could trigger the client to increase the parameter in Initial structure #1 or invest in more hardware or something else.
  3. Double check that the deals that appear "stuck" actually become unstuck when the number of deals drops below the parameter. These shouldn't error out but wait for a transfer slot to be available. It might take more time but shouldn't really error.

@kernelogic
Copy link

I hope more deals can be transferring at the same time because some miners can be really slow due to geographical difference. For example my node's internet is good for uploading 50MB/s but a lot of the time it only doing 500KB/s while hundreds of deals are "StorageDealStartDataTransfer". I think the whole thing is clogged up with a few slow deals.

@kernelogic
Copy link

What could be the result if blindly increase it to a very large number like 100?

What can I do right now to increase the limit without waiting for an official release? I don't see this file in the lotus repo. It's on the IPFS side.

@kernelogic
Copy link

kernelogic commented Nov 25, 2020

These are the common error messages:

  • data transfer appears to be stalled. attempt restart
  • failed to initiate data transfer: deal data transfer failed: channel removed due to inactivity
  • unexpected deal status while waiting for data request: 11 (StorageDealFailing). Provider message: deal rejected: miner is not considering online storage deals
  • failed to initiate data transfer: failed to open push data channel: Unable to send request: exhausted %!d(float64=5) attempts but failed to open stream, err: dial backoff

@dirkmc
Copy link
Contributor

dirkmc commented Nov 30, 2020

@kernelogic

  • data transfer appears to be stalled. attempt restart
  • failed to initiate data transfer: deal data transfer failed: channel removed due to inactivity
  • failed to initiate data transfer: failed to open push data channel: Unable to send request: exhausted %!d(float64=5) attempts but failed to open stream, err: dial backoff
    These messages mean there was a problem with the network connection. This can happen with miners who are behind firewalls for example.
  • unexpected deal status while waiting for data request: 11 (StorageDealFailing). Provider message: deal rejected: miner is not considering online storage deals
    This means the miner rejected the deal

@kernelogic
Copy link

Thanks for the explanation. So it's all normal then, just congestion.

@dirkmc
Copy link
Contributor

dirkmc commented Dec 10, 2020

When the client creates a deal with the provider:

  1. The go-data-transfer client sends a "push" request
  2. The go-data-transfer provider receives the "push" request and opens a go-graphsync query to the client
  3. The go-graphsync client receives the query and responds with the data for the file

I believe the root cause of this issue is that the provider does not open a graphsync query to the client in step 2. This could be because

  • there was a network error
  • the provider went down

Proposed fix:
In go-data-transfer, when opening a push request, set a timeout by which the provider must begin the data transfer, and fail the transfer if the timeout expires.

@kernelogic
Copy link

@dineshshenoy is this solved? Where?

@dineshshenoy
Copy link
Author

Should have linked to #5211 which has the propose fix and is being worked on shortly. Forked off that issue to a separate item since this issue had gotten unwieldy and a lot of the earlier work has been complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants