Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep retrying the proof until we run out of sectors to skip #4633

Merged
merged 2 commits into from
Oct 30, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 23 additions & 14 deletions storage/wdpost_run.go
Original file line number Diff line number Diff line change
Expand Up @@ -510,10 +510,10 @@ func (s *WindowPoStScheduler) runPost(ctx context.Context, di dline.Info, ts *ty

skipCount := uint64(0)
postSkipped := bitfield.New()
var postOut []proof2.PoStProof
somethingToProve := true
somethingToProve := false

for retries := 0; retries < 5; retries++ {
// Retry until we run out of sectors to prove.
for retries := 0; ; retries++ {
magik6k marked this conversation as resolved.
Show resolved Hide resolved
var partitions []miner.PoStPartition
var sinfos []proof2.SectorInfo
for partIdx, partition := range batch {
Expand Down Expand Up @@ -567,7 +567,6 @@ func (s *WindowPoStScheduler) runPost(ctx context.Context, di dline.Info, ts *ty

if len(sinfos) == 0 {
// nothing to prove for this batch
somethingToProve = false
break
}

Expand All @@ -585,27 +584,43 @@ func (s *WindowPoStScheduler) runPost(ctx context.Context, di dline.Info, ts *ty
return nil, err
}

var ps []abi.SectorID
postOut, ps, err = s.prover.GenerateWindowPoSt(ctx, abi.ActorID(mid), sinfos, abi.PoStRandomness(rand))
postOut, ps, err := s.prover.GenerateWindowPoSt(ctx, abi.ActorID(mid), sinfos, abi.PoStRandomness(rand))
elapsed := time.Since(tsStart)

log.Infow("computing window post", "batch", batchIdx, "elapsed", elapsed)

if err == nil {
// Proof generation successful, stop retrying
params.Partitions = append(params.Partitions, partitions...)
if len(postOut) == 0 {
return nil, xerrors.Errorf("received no proofs back from generate window post")
}

// Proof generation successful, stop retrying
somethingToProve = true
params.Partitions = partitions
params.Proofs = postOut
break
}

// Proof generation failed, so retry

if len(ps) == 0 {
// If we didn't skip any new sectors, we failed
// for some other reason and we need to abort.
return nil, xerrors.Errorf("running window post failed: %w", err)
}
// TODO: maybe mark these as faulty somewhere?

log.Warnw("generate window post skipped sectors", "sectors", ps, "error", err, "try", retries)

// Explicitly make sure we haven't aborted this PoSt
// (GenerateWindowPoSt may or may not check this).
// Otherwise, we could try to continue proving a
// deadline after the deadline has ended.
if ctx.Err() != nil {
log.Warnw("aborting PoSt due to context cancellation", "error", ctx.Err(), "deadline", di.Index)
return nil, ctx.Err()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will explicitly check the context. We should cancel in

// Replace the aborted postWindow with a new one so that we can
// submit again at any time without the state getting clobbered
// when the abort completes
abort := pw.abort
if abort != nil {
pw = &postWindow{
di: pw.di,
ts: advance,
submitState: SubmitStateStart,
}
s.postWindows[pw.di.Open] = pw
// Abort the current submit
abort()
}
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also stop retrying once we get, e.g., 2/3rds of the way through the proof time but I'm not sure if that really makes sense. I guess sectors assigned to a single partition are somewhat correlated in time so their failure may be correlated? But I don't wan to:

  1. Spend a lot of time trying to prove one partition.
  2. Give up on that partition because we're running out of time.
  3. Spend a little time trying to prove all other partitions in the deadline and fail because we have a lot of faulty sectors.

When we could have eventually submitted a valid proof for the first partition, if we had simply stuck with it.


skipCount += uint64(len(ps))
for _, sector := range ps {
postSkipped.Set(uint64(sector.Number))
Expand All @@ -617,12 +632,6 @@ func (s *WindowPoStScheduler) runPost(ctx context.Context, di dline.Info, ts *ty
continue
}

if len(postOut) == 0 {
return nil, xerrors.Errorf("received no proofs back from generate window post")
}

params.Proofs = postOut

posts = append(posts, params)
}

Expand Down