Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel participation API: make joins crash fault tolerant #1608

Merged
merged 1 commit into from
Sep 15, 2020

Conversation

wlahti
Copy link
Contributor

@wlahti wlahti commented Jul 15, 2020

Type of change

  • New feature

Description

First commit:

  • save joinblock on JoinChannel

Second commit:

  • process joinblocks in filerepo during registrar init

Related issues

FAB-18008

@wlahti wlahti changed the title Channel participation: make joins crash fault tolerant Channel participation API: make joins crash fault tolerant Jul 15, 2020
orderer/common/multichannel/registrar.go Outdated Show resolved Hide resolved
Comment on lines +593 to +903
func (r *Registrar) removeJoinBlock(channelID string) error {
if err := r.joinBlockFileRepo.Remove(channelID); err != nil {
return errors.Wrapf(err, "failed removing joinblock for channel %s", channelID)
}

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a version of this exported and protected with a mutex, so that when a follower finishes on boarding it will be able to call and remove the join block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason the follower should handle removing the join block and not the registrar itself? For now, I've added the remove logic into SwitchFollowerToChain.

Copy link
Contributor

@tock-ibm tock-ibm Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the ledger progresses past the join block, there is no reason to keep it around... there is no harm in that, but it still feels a bit weird. Note that the follower can live for quite some time past the join-block...

Nevertheless, this is acceptable.

Another thing we could do in support of lazy removal is to use loadJoinBlocks() and remove a join block if the ledger's height > join-block.Number.

orderer/common/multichannel/registrar.go Outdated Show resolved Hide resolved
orderer/common/multichannel/registrar.go Outdated Show resolved Hide resolved
orderer/common/multichannel/registrar.go Outdated Show resolved Hide resolved
orderer/common/multichannel/registrar.go Outdated Show resolved Hide resolved
Comment on lines 561 to 622
err = errors.Wrap(err, "failed creating chain support for join")

if err2 := r.removeJoinBlock(channelID); err2 != nil {
// Stack file repo error on top of chain support creation error
err = errors.WithStack(err2)
}

return types.ChannelInfo{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point we have to clean the ledger as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wlahti
Copy link
Contributor Author

wlahti commented Jul 22, 2020

I haven't addressed any comments yet. Just rebased this onto master and adjusted the code as necessary.

@wlahti
Copy link
Contributor Author

wlahti commented Aug 12, 2020

Just another rebase to spread out the effort of keeping up with master.

@wlahti wlahti force-pushed the fab-18008 branch 2 times, most recently from 7136b04 to 9ae563b Compare August 12, 2020 19:32
@wlahti wlahti force-pushed the fab-18008 branch 5 times, most recently from 4c1544d to f7e2c61 Compare August 25, 2020 21:55
@wlahti wlahti marked this pull request as ready for review August 25, 2020 22:12
@wlahti wlahti requested a review from a team as a code owner August 25, 2020 22:12
// TODO if follower is onboarding, remove the joinblock from file repo
_, status := follower.StatusReport()
if status == types.StatusOnBoarding {
if err := r.removeJoinBlock(channelID); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not going to work if you have a follower that passed the join block (i.e. active) but did not yet remove the join block. That is especially true with the lazy removal that you implemented in this commit, but may also be true regardless because those things happen concurrently.

So a better approach IMO would be to either (1) remove it unconditionally and ignore the error if it is not found (best to make a special error type to identify that) or (2) list join blocks and remove it only if it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to always attempt removing the join block.

Comment on lines +628 to +692
if err := r.joinBlockFileRepo.Save(channelID, blockBytes); err != nil {
return types.ChannelInfo{}, errors.Wrapf(err, "failed saving joinblock to file repo for channel %s", channelID)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isMember, err := clusterConsenter.IsChannelMember(configBlock) (one line down) returns an error, the ledger and join block are not removed.

So, instead of having the "cleanup code" repeated 3 times, my original idea was to have named return arguments, i.e. ... (info types.ChannelInfo, err error), and right here do a defer that looks at the returned err, and if it is not nil, do the cleanup.

An alternative is to save the join-block after the call to clusterConsenter.IsChannelMember(configBlock).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +623 to +688
blockBytes, err := proto.Marshal(configBlock)
if err != nil {
return types.ChannelInfo{}, errors.Wrap(err, "failed marshaling joinblock")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this goes bad, you have to remove the empty ledger created above. Same if the joinBlockFileRepo.Save below fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 675 to 678
if err2 := r.ledgerFactory.Remove(channelID); err2 != nil {
// Stack ledger removal error on top of chain support creation error
err = errors.WithStack(err2)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this in a defer that checks the return err right after the ledger is created,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err2 := r.removeJoinBlock(channelID); err2 != nil {
// Stack file repo error on top of ledger removal + chain support creation error
err = errors.WithStack(err2)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this in a defer that checks the return err right after the join-block is saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wlahti wlahti force-pushed the fab-18008 branch 2 times, most recently from f56230d to 109b5e6 Compare September 3, 2020 23:07
@tock-ibm
Copy link
Contributor

tock-ibm commented Sep 4, 2020

Looks good to me.

@@ -608,19 +619,41 @@ func (r *Registrar) JoinChannel(channelID string, configBlock *cb.Block, isAppCh
if err != nil {
return types.ChannelInfo{}, err
}
defer func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tock-ibm What about the case where the initLedgerResourcesClusterConsenter fails? That will leave around the ledger for the channel. Do you think newLedgerResources should be updated to remove the ledger if the call to ledgerFactory.GetOrCreate() fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. GetOrCreate is the last thing in newLedgerResources so I am not too worried about that. Hopefully GetOrCreate does not leave an empty ledger behind if it fails. initLedgerResourcesClusterConsenter is used in other places as well, in which the ledger is not empty, so we cant effort to cleanup within it as well.
However if initLedgerResourcesClusterConsenter fails after newLedgerResources, that requires cleanup in this method, that is correct.

Here is an idea: how about putting the defer func() from L622 just before initLedgerResourcesClusterConsenter (to L617)?
It might try to remove a non-existing ledger, but that is not an error according to the blockstore_provider.go#L131

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@jyellick jyellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me, one nit noted below. Will wait to approve until other conversations have resolved.


func (r *Registrar) removeJoinBlock(channelID string) error {
if err := r.joinBlockFileRepo.Remove(channelID); err != nil {
return errors.Wrapf(err, "failed removing joinblock for channel %s", channelID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we wrapping an error we generate? Why not simply 'WithMessage', looking up the tree, it looks like these errors are already generated with a stack.

Copy link
Contributor Author

@wlahti wlahti Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I've just been a little lazy with differentiating between Wrap and WithMessage lately... plus it's what the rest of the file (and maybe package?) does with the errors. I think it would be worth a single pass through to go with the "standard" we set forth for fabric: use Wrap for errors that come from an external library and use WithMessage for errors that fabric already generated using pkg/errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened FAB-18229

Remove joinblock upon completion of JoinChannel (succcess
or failure)..

FAB-18008 #done

Signed-off-by: Danny Cao <dcao@us.ibm.com>
Signed-off-by: Will Lahti <wtlahti@us.ibm.com>
@jyellick jyellick merged commit ca9b032 into hyperledger:master Sep 15, 2020
@wlahti wlahti deleted the fab-18008 branch September 25, 2020 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants