-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sealing: Fix RecoverDealIDs loop with changed PieceCID #7117
Conversation
"rather unlikely edge-case" -- I have 7 sectors and 63 deals stuck with RecoverDealIDs on f019551 (1.11.1-m1.3.5+mainnet+git.7be207bc5.dirty+api1.2.0). |
id consider to get this in a lotus v1.11.2-rc if possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with this part of the codebase. I think @aarshkshah1992 can provide a better review here.
-
If what we want to do is remove the sector if we find a piece CID mismatch, then this looks good.
-
Would the deals originally packed into that sector be reassigned a new sector automatically? If so, should we write a test for that?
extern/storage-sealing/sealing.go
Outdated
@@ -114,7 +116,7 @@ type Sealing struct { | |||
commiter *CommitBatcher | |||
|
|||
getConfig GetSealingConfigFunc | |||
dealInfo *CurrentDealInfoManager | |||
DealInfo *CurrentDealInfoManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ubernit: move public fields to the top.
@jennijuju @magik6k This should close #7103. |
Deals which fail because of being put into that sector should auto-retry adding to a new sector, I see we are completely dropping the sector with the bad pieces. But, where is the code to do the above i.e. put all the deals in the dropped sector into a new sector ? |
@aarshkshah1992, from the PR description:
^^ @magik6k do we have a test for that behaviour. Rr can we write one, since that's the most useful outcome that folks are going to expect? |
(missclick on the close thing)
My assumption is that we do the retrying in markets.
No, I'm assuming that markets are going to retry putting deals into a sector when they see that sealing failed (I'm not 100% sure if this is the current behavior, but I'm pretty sure it was at some point). |
@magik6k Discussed this offline but putting it here for Github record. There is no such code in Markets that detects when a deal is dropped from a sector and re-attempts adding it to another sector. In-fact, I am not even sure if there's a mechanism by which Markets can detect that. |
62c8508
to
62769e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGWM, but what actually happens to these deals now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a first step, but it's not a complete solution as stated in comments above.
IUC, by the point we are here, markets has already handed off the deal to the sealing subsystem, and responsibility is transferred.
AFAIK, there is no notification back to markets that the deal has been excluded from a sector a posteriori, or the sector is deleted altogether (like in this PR).
Even if there were, all that markets could do is call SectorAddPieceToAny
again.
In the case we're addressing immediately, it seems like the piece transfer was interrupted, and the miner node ended up with a partial piece which, of course, doesn't compute up to the expected PieceCID.
In other words, the handoff failed but nobody noticed until later. IMO we need to focus on making sure that failures on handoff are detected immediately.
The thing that complicates it is the io.Reader
JSON-RPC encoder: it is incapable of knowing if the transfer was sent in full or not!
Possible solution:
- allow the caller of the JSON-RPC client to pass special readers with a
Len() int64
method. - encoder: type assert the
io.Reader
, if it contains aLen() int64
method, call it to obtain the length of the transfer and set it in theContent-Length
header of the POST/PUT. - decoder: get the content length, and assert that the
io.Copy
copies that many bytes. If not, declare the transfer as failed, which should fail the original JSON-RPC invocation (thus the client -- markets -- would notice and would retry immediately).
(note: 95% of the diff is testing, the bugfix is the few lines changed at the bottom of
states_failed.go
)I'm not exactly sure what has to happen to make the FSM put wrong piece into a sector, but this has happened on my miner after it has crashed with OOM while processing a whole bunch of deal, so it's probably a rather unlikely edge-case.
One likely explanation is that a call to
AddPiece
didn't add the correct data (e.g. the reader stream dropped early), which got the wrong PieceCID intoSectorInfo.Pieces[].Piece.PieceCID
. If that's the case (which I pretty strongly suspect it is), we can only do 2 things:Drop the piece from the sector (but that's rather complicated, requires new storage methods, and will require non-trivial shuffling data around in sectors to fix padding if the removed piece was not at the end of the sector)
Remove the sector
checkPieces
as the first step), we don't loose much work (only putting deals into the sector)