Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive successes #141

Closed
ghost opened this issue May 25, 2021 · 4 comments
Closed

False positive successes #141

ghost opened this issue May 25, 2021 · 4 comments
Labels
P2 P2 Issue workstream/general Issues that are not tied to any specific use case.
Milestone

Comments

@ghost
Copy link

ghost commented May 25, 2021

I'm seeing some false successes reported by the dealbot. Pretty sure I see why this is happening: if the retrieval fails instantly, during DealStatusNew, Dealbot reports it as Status 3 (success).

Example 1

Here is a Status==3 reported success by dealbot:

$ curl http://localhost:8764/tasks 2>/dev/null | jq '.[] | select(.Status == 3)'
{
  "UUID": "92f966e0-d7d7-4468-893c-6f41d2d440b5",
  "Status": 3,
  "WorkedBy": "38d9a260",
  "Stage": "ProposeRetrieval",
  "CurrentStageDetails": {
    "Description": "Send retrieval to miner",
    "ExpectedDuration": "",
    "Logs": [
      {
        "Log": "deal sent to miner",
        "UpdatedAt": 1621883018552199000
      }
    ],
    "UpdatedAt": 1621883018552199000
  },
  "PastStageDetails": [
    {
      "Description": "Miner is online",
      "ExpectedDuration": "a few seconds",
      "Logs": [
        {
          "Log": "Tipset successfully fetched",
          "UpdatedAt": 1621883017861807900
        },
        {
          "Log": "Miner Info successfully fetched",
          "UpdatedAt": 1621883017916036900
        },
        {
          "Log": "Miner address validated",
          "UpdatedAt": 1621883017945603800
        },
        {
          "Log": "Connected to miner",
          "UpdatedAt": 1621883017983161300
        }
      ],
      "UpdatedAt": 1621883017983161300
    },
    {
      "Description": "Miner responds to query ask",
      "ExpectedDuration": "a few seconds",
      "Logs": [
        {
          "Log": "Miner Offer Received",
          "UpdatedAt": 1621883018538549800
        }
      ],
      "UpdatedAt": 1621883018538549800
    }
  ],
  "StartedAt": 1621883017837216500,
  "RunCount": 1,
  "RetrievalTask": {
    "Miner": "f0157535",
    "PayloadCID": "bafykbzacecnq5elkwi4xstanm4fttydiexkojqkhjwz3ptevqkzjj2ssljjgc",
    "CARExport": false
  }
}

Here is what happens when you run the same retrieval on the Lotus CLI:

$ lotus client retrieve --miner f0157535 bafykbzacecnq5elkwi4xstanm4fttydiexkojqkhjwz3ptevqkzjj2ssljjgc /dev/null
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWAypLydzLVAWD9aURXSvgxGdfAEuq7cUGVLkbmxK3cMLC for payload CID bafykbzacecnq5elkwi4xstanm4fttydiexkojqkhjwz3ptevqkzjj2ssljjgc (retrieval deal ID 1621630583961700674, state DealStatusOngoing) - existing deal must be cancelled before starting a new retrieval deal

Example 2
Miner f0157535 with CID bafykbzaceahe7dt6szdh23mekkuxhangmjd3sr26rqmpyxiupte6utocyc7m6 produces the exact same result: dealbot reports Status==3, but lotus CLI shows same error as above.

Short-Term Fix
We should be erroring out on ClientEventOpen errors, producing Status==4 and ErrorMessage==ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWAypLydzLVAWD9aURXSvgxGdfAEuq7cUGVLkbmxK3cMLC for payload CID bafykbzacecnq5elkwi4xstanm4fttydiexkojqkhjwz3ptevqkzjj2ssljjgc (retrieval deal ID 1621630583961700674, state DealStatusOngoing) - existing deal must be cancelled before starting a new retrieval deal

Long-Term Fix
We need to detect errors of this type there is an active retrieval deal with peer..., extract the deal id (the number 1621630583961700674 in the above case), and run the equivalent of lotus client cancel-retrieval --deal-id 1621630583961700674.

@ghost ghost added the workstream/general Issues that are not tied to any specific use case. label May 25, 2021
@willscott
Copy link
Collaborator

if a bot pops a task and finds out it is already running the same task it may be worth either unclaiming it or failing it. i don't think it should cancel the previously running instance.

cc @gammazero

@ghost
Copy link
Author

ghost commented May 26, 2021

if a bot pops a task and finds out it is already running the same task it may be worth either unclaiming it or failing it. i don't think it should cancel the previously running instance.

I don't think this is sufficient. Let me show why.

With Example 1 above, I did a lotus client cancel-retrieval and then I re-attempted the retrieval:

ubuntu@dealbot-mainnet-cn:~$ lotus client retrieve --miner f0157535 bafykbzacecnq5elkwi4xstanm4fttydiexkojqkhjwz3ptevqkzjj2ssljjgc /dev/null
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 0 B, Paid 0 FIL, ClientEventUnsealPaymentRequested (DealStatusAccepted)
> Recv: 0 B, Paid 0 FIL, ClientEventPaymentChannelAddingFunds (DealStatusPaymentChannelAddingInitialFunds)
> Recv: 0 B, Paid 0 FIL, ClientEventPaymentChannelReady (DealStatusPaymentChannelAllocatingLane)
> Recv: 0 B, Paid 0 FIL, ClientEventLaneAllocated (DealStatusOngoing)
> Recv: 0 B, Paid 0 FIL, ClientEventPaymentRequested (DealStatusFundsNeeded)
> Recv: 0 B, Paid 0 FIL, ClientEventSendFunds (DealStatusSendFunds)
> Recv: 0 B, Paid 0.000000000000000002 FIL, ClientEventPaymentSent (DealStatusOngoing)
[hangs forever]

In Dealbot, an infinite hang like this will eventually be caught by one of the timeouts. I think when that happens, we should try to cancel the retrieval (with API equivalent of lotus client cancel-retrieval) and then fail with a final task ErrorMessage string of "Timed out after X hours inactivity." Or however we want to phrase it. Otherwise, every subsequent retry is going to fail with Status==3 as in the issue description.

Wdyt @willscott @gammazero?

@ghost
Copy link
Author

ghost commented Jun 2, 2021

This should be re-tested since we now have timeouts.

Also, fixing #156 would fix this, @willscott's comment noted though.

@ghost ghost added P2/general P2 P2 Issue and removed P2/general labels Jun 2, 2021
@ghost ghost added this to the P2 Issues Completed milestone Jun 3, 2021
@willscott
Copy link
Collaborator

i believe we've closed this issue / have canceling in place at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 P2 Issue workstream/general Issues that are not tied to any specific use case.
Projects
None yet
Development

No branches or pull requests

1 participant