Introduce prepare timer #127

nhoriguchi · 2019-11-22T04:58:40Z

This pull request suggests to introduce prepare timer based on the discussion #5. The outline is described mainly in the 1st patch. But I admit that I'm still on the way and there're some points to discuss/solve:

Using generatedMessageConsumer to forward REQUEST messages: I just used existing code for the current purpose, but it seems not comply with the current abstraction rule. I guess that we need have prepareTimeoutHandler, but that's not done in this version.
Adding some "implement" method on type messages.Request: this is to make the type implement interface messages.ReplicaMessage. I feel that this change has some impact and might need more adjustment on messages/messages.go.
Fault simulator: I do some testing by modifying the sample application to do some faulty behavior, I think this might be helpful to merge it into mainline code as "faulty mode", but maybe that need more consideration on design.
To refactor duplicated code between request timer and prepare timer: I copied most of prepare timer from request timer, so there's lines of duplicate code, I tried to define common type/interface and defining two timers in hierarchal manner, but I still don't make it.

I'm glad if I get some feedbacks.

CircleCI status is here.

sergefdrv

Thanks for taking the initiative to solve the issue. This doesn't look an easy one. Perhaps the most difficult part is the actual mechanism to relay REQUEST message. Therefore, I would suggest to split the work. Maybe first focus on handling the timeout itself, and just report the timeout event in the diagnostic message log. Then we could tackle message relaying as a separate issue.

I was not sure about changing the request timeout semantics. Originally, it was the maximum expected time for a received request to become ready for execution. This PR delays starting the request timer until the PREPARE message is received. However, when a correct replica has accepted a PREPARE message, it will effectively multicast it to other replicas in COMMIT message. This ensures eventual request execution unless view change is triggered. In other words, once a single correct replica has accepted a PREPARE message, reaching the consensus does not depend on the primary anymore. Thus, the request timer started at that point would not really serve for faulty primary detection.

I suggest we start both prepare and request timer upon reception of a new REQUEST message. If the replica does not accept the corresponding PREPARE message for longer than the prepare timeout, it would first assume the client was faulty. Then it will relay the REQUEST message to the primary and restart the request timer. I am currently discussing this scheme with experts, but I am not sure how soon I can get their feedback.

Using generatedMessageConsumer to forward REQUEST messages: I just used existing code for the current purpose, but it seems not comply with the current abstraction rule. I guess that we need have prepareTimeoutHandler, but that's not done in this version.

Adding some "implement" method on type messages.Request: this is to make the type implement interface messages.ReplicaMessage. I feel that this change has some impact and might need more adjustment on messages/messages.go.

Those points need more consideration and there is clearly something to be changed. It would not be appropriate to treat REQUEST as replica message. When a prepare timeout expires, the REQUEST message should only be forwarded to the primary, there is no need to multicast it to all replicas. Maybe we should open a dedicated message stream to the primary for REQUEST message forwarding.

Another consideration is cancelling of the transmission. If a new REQUEST is accepted from the same client, forwarding of the old REQUEST should be cancelled. At least, there should be a clear limit for REQUEST messages being forwarded, they should not pile up indefinitely.

Fault simulator: I do some testing by modifying the sample application to do some faulty behavior, I think this might be helpful to merge it into mainline code as "faulty mode", but maybe that need more consideration on design.

As opposed to incorporating faulty behavior into normal code, we can create some kind of tool for testing. The tool would act as a client or replica following some predefined scenario, including Byzantine behavior. What do you think?

I think we could continue reviewing and merge the commit 2006f59 in a separate PR.

sample/peer/cmd/request.go

core/internal/clientstate/client-state.go

core/internal/clientstate/prepare-timeout.go

core/request.go

nhoriguchi · 2019-11-29T01:44:45Z

... Therefore, I would suggest to split the work. Maybe first focus on handling the timeout itself, and just report the timeout event in the diagnostic message log. Then we could tackle message relaying as a separate issue.

OK.

I was not sure about changing the request timeout semantics. Originally, it was the maximum expected time for a received request to become ready for execution. This PR delays starting the request timer until the PREPARE message is received. However, when a correct replica has accepted a PREPARE message, it will effectively multicast it to other replicas in COMMIT message. This ensures eventual request execution unless view change is triggered. In other words, once a single correct replica has accepted a PREPARE message, reaching the consensus does not depend on the primary anymore. Thus, the request timer started at that point would not really serve for faulty primary detection.

I suggest we start both prepare and request timer upon reception of a new REQUEST message. If the replica does not accept the corresponding PREPARE message for longer than the prepare timeout, it would first assume the client was faulty. Then it will relay the REQUEST message to the primary and restart the request timer. I am currently discussing this scheme with experts, but I am not sure how soon I can get their feedback.

Thanks for detailed explanation, this note is very important for me. I'll change request timer behavior as you stated above.

...

Another consideration is cancelling of the transmission. If a new REQUEST is accepted from the same client, forwarding of the old REQUEST should be cancelled. At least, there should be a clear limit for REQUEST messages being forwarded, they should not pile up indefinitely.

In the current version, a backup replica "forwards" (i.e. multicasts) a REQUEST message only once when the prepare timer expires, so maybe no worry about request's pile-up. And if a replica receives the old REQUEST message after the new REQUEST messages is accepted, the old one should be properly rejected by CaptureRequestSeq mechanism.

Fault simulator: I do some testing by modifying the sample application to do some faulty behavior, I think this might be helpful to merge it into mainline code as "faulty mode", but maybe that need more consideration on design.

As opposed to incorporating faulty behavior into normal code, we can create some kind of tool for testing. The tool would act as a client or replica following some predefined scenario, including Byzantine behavior. What do you think?

Any faulty behaviors on client side can be mimicked by testing tools. As for replica side, some external attacks (like replay attack) can be simulated by tools, but we might need some internal module for testing inside attacks. The internal module should be switched on or off on compile time to ensure that it's never used on production setting.

I think we could continue reviewing and merge the commit 2006f59 in a separate PR.

OK, I'll separate the suggestion for this commit too.

sergefdrv · 2019-11-29T09:45:22Z

In the current version, a backup replica "forwards" (i.e. multicasts) a REQUEST message only once when the prepare timer expires, so maybe no worry about request's pile-up.

If I understood correctly, the prepare timer is started for each new REQUEST message. When it expires, the message is appended to the message log. My concern was that there is no limit for the number of REQUEST messages from the same client that can be appended to the message log.

And if a replica receives the old REQUEST message after the new REQUEST messages is accepted, the old one should be properly rejected by CaptureRequestSeq mechanism.

That is correct, I was just thinking of wasted network bandwidth.

[...] we might need some internal module for testing inside attacks. The internal module should be switched on or off on compile time to ensure that it's never used on production setting.

I was not quite clear about what you mean by "inside attacks". Do you mean simulation of intrusion attacks on replica's untrusted code? Such compromised replica can be simply considered faulty. And the fraction of faulty replicas is assumed limited. What is your perspective on this?

sergefdrv · 2019-11-29T09:47:40Z

The internal module should be switched on or off on compile time to ensure that it's never used on production setting.

My biggest concern with this approach is code clarity. I'm afraid embedding faulty logic for testing can make the code hard to follow.

nhoriguchi · 2019-12-01T09:16:17Z

[...] we might need some internal module for testing inside attacks. The internal module should be switched on or off on compile time to ensure that it's never used on production setting.

I was not quite clear about what you mean by "inside attacks". Do you mean simulation of intrusion attacks on replica's untrusted code?

Yes, I have that in mind.

Such compromised replica can be simply considered faulty. And the fraction of faulty replicas is assumed limited. What is your perspective on this?

What I thought was that MinBFT code is public, so attackers (note that some legitimate member could be an attacker on trustless network) can learn it and easily embed any compromised behavior (which might be hard for correct replicas to find). So I thought that we will be better to provide working demonstration code of typical scenarios with byzantine faulty replicas for anyone to verify.

The internal module should be switched on or off on compile time to ensure that it's never used on production setting.

My biggest concern with this approach is code clarity. I'm afraid embedding faulty logic for testing can make the code hard to follow.

Yes, I understand that, so I'm ok to start with separate cod/branch.

Anyway, thank you for the idea.

sergefdrv · 2019-12-02T11:50:43Z

What I thought was that MinBFT code is public, so attackers (note that some legitimate member could be an attacker on trustless network) can learn it and easily embed any compromised behavior (which might be hard for correct replicas to find). So I thought that we will be better to provide working demonstration code of typical scenarios with byzantine faulty replicas for anyone to verify.

So you consider simulating a tolerated fraction of Byzantine replicas? If so, I thought we might better have some dummy replica nodes that behave Byzantine, following some predefined scenario.

What you were probably concerned with was that the replica behavior we would like to test might be quite complicated. In that case, we might need some smarter tool, which behaves almost the same as correct replica.

I think we can start exploring this issue with a scenario-based tool and see how it goes.

I can think of two different approaches with the tool: (1) the tool mimics an individual replica's behavior, and (2) the tool mimics the behavior of all replicas in the network, except the tested one. In (1), it would be easier to define the Byzantine scenario for the simulated faulty replica. In (2), it would be easier to define the scenario, taking asynchronous message exchange into account. Maybe there are more possibilities, or it might make sense to use a hybrid approach.

nhoriguchi · 2019-12-03T04:18:43Z

So you consider simulating a tolerated fraction of Byzantine replicas?

Yes, in most severe cases, f replicas are in conspiracy to disturb consensus process.

If so, I thought we might better have some dummy replica nodes that behave Byzantine, following some predefined scenario.

What you were probably concerned with was that the replica behavior we would like to test might be quite complicated. In that case, we might need some smarter tool, which behaves almost the same as correct replica.

I think we can start exploring this issue with a scenario-based tool and see how it goes.

I can think of two different approaches with the tool: (1) the tool mimics an individual replica's behavior, and (2) the tool mimics the behavior of all replicas in the network, except the tested one. In (1), it would be easier to define the Byzantine scenario for the simulated faulty replica. In (2), it would be easier to define the scenario, taking asynchronous message exchange into account. Maybe there are more possibilities, or it might make sense to use a hybrid approach.

The hybrid approach sounds nice to me, because these approaches complement each other: (1) is for integration testing which should define expected behaviors as a system for given scenarios, so that's close to what I imaged, but (2) is also important because it tests whether a correct replica behave as expected for given scenarios.

ynamiki

As Sergey has wrote, I think it is better to have timerState struct. Sorry for my late response.

nhoriguchi · 2019-12-10T04:51:08Z

Summary of changes, the latest version:

introduces timerState struct,
proposes only prepare timer implementation, and message forwarding is not included (Relay request messages to primary replica #140), and
postpones to make timeout parameter mandatory (Mandatory parameters for timeout values #139)

nhoriguchi · 2019-12-11T11:27:45Z

I updated the branch again. In this version I tried two things:

timeout duration is passed to client-state provider as a mandatory parameter, and
more code generalization (so that there's no longer requestTimerState)

sergefdrv

Thanks for your effort, Naoya, but I think the code needs some more adjustment.

core/internal/clientstate/request-timeout.go

core/request.go

core/internal/clientstate/client-state.go

core/prepare.go

Timeout duration is an optional parameter for client state until now, now it's a mandatory parameter. Closes hyperledger-labs#139 Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

We'll add another timer to clientstate later, so as a preparation, this patch factors out common part of timeout handling code by generalizing requestTimerState into timerState. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Any client application could behave improperly whether or not intended, so MinBFT cluster has to handle faulty client behaviors. But currently replica set processes a request only when the primary replica receives it, so any client can easily trigger view-change by sending request messages only to all backup replicas. This patch tries to solve the situation by introducing prepare timer which expires when the replica doesn't receive a prepare message from the primary for a specified timeout duration. This timer allows us to handle occasional transmission errors between clients and replicas for example by forwarding request messages to the primary replica, which will be implemented later. The semantics of another existing timer, request timer, stays unchanged. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Comment on makeReplicaMessageApplier() is out-of-sync from the code, so let's fix it. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

sergefdrv

Maybe also extend core/internal/clientstate/timeout_test.go so that it tests prepare timer methods, or just test timerState and assume its methods are invoked correctly from the methods of clientState?

Client state has multiple timers in it, so this patch proposes to expand the unit test to cover that. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

nhoriguchi · 2019-12-13T01:46:12Z

Maybe also extend core/internal/clientstate/timeout_test.go so that it tests prepare timer methods, or just test timerState and assume its methods are invoked correctly from the methods of clientState?

Pushed one additional commit, thank you.

sergefdrv · 2019-12-13T10:49:48Z

@Naoya-Horiguchi Looks good. Thanks!

nhoriguchi requested review from ynamiki and sergefdrv November 22, 2019 04:58

sergefdrv suggested changes Nov 26, 2019

View reviewed changes

sample/peer/cmd/request.go Outdated Show resolved Hide resolved

core/internal/clientstate/client-state.go Outdated Show resolved Hide resolved

core/internal/clientstate/prepare-timeout.go Outdated Show resolved Hide resolved

core/request.go Show resolved Hide resolved

nhoriguchi mentioned this pull request Dec 2, 2019

peer/cmd/request: Introduce request timeout option #134

Merged

sergefdrv mentioned this pull request Dec 3, 2019

Develop framework for testing with Byzantine bahavior #137

Open

ynamiki reviewed Dec 9, 2019

View reviewed changes

nhoriguchi force-pushed the post/issue.5/v1 branch from 2006f59 to 0f841c4 Compare December 10, 2019 04:01

This was referenced Dec 10, 2019

Mandatory parameters for timeout values #139

Closed

Relay request messages to primary replica #140

Closed

nhoriguchi force-pushed the post/issue.5/v1 branch from 0f841c4 to dc97ab6 Compare December 11, 2019 11:20

sergefdrv suggested changes Dec 11, 2019

View reviewed changes

core/internal/clientstate/request-timeout.go Outdated Show resolved Hide resolved

core/request.go Show resolved Hide resolved

core/internal/clientstate/client-state.go Outdated Show resolved Hide resolved

core/prepare.go Outdated Show resolved Hide resolved

nhoriguchi added 4 commits December 12, 2019 04:46

core: Pass timeout duration to clientstate.NewProvider()

40d78f4

Timeout duration is an optional parameter for client state until now, now it's a mandatory parameter. Closes hyperledger-labs#139 Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

core/internal/clientstate: Introduce timerState

64c3ca3

We'll add another timer to clientstate later, so as a preparation, this patch factors out common part of timeout handling code by generalizing requestTimerState into timerState. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Synchronize comment on makeReplicaMessageApplier()

73877fb

Comment on makeReplicaMessageApplier() is out-of-sync from the code, so let's fix it. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

nhoriguchi force-pushed the post/issue.5/v1 branch from dc97ab6 to 73877fb Compare December 12, 2019 04:48

sergefdrv suggested changes Dec 12, 2019

View reviewed changes

core/internal/clientstate: expand timeout_test.go for prepare timer

71bd92c

Client state has multiple timers in it, so this patch proposes to expand the unit test to cover that. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

sergefdrv approved these changes Dec 13, 2019

View reviewed changes

sergefdrv merged commit 5acfc24 into hyperledger-labs:master Dec 13, 2019

sergefdrv mentioned this pull request Dec 13, 2019

Fix prepare/request timers #142

Merged

nhoriguchi deleted the post/issue.5/v1 branch December 16, 2019 00:36

nhoriguchi mentioned this pull request Apr 12, 2020

Introduce tests for a few Byzantine faulty cases #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce prepare timer #127

Introduce prepare timer #127

nhoriguchi commented Nov 22, 2019

sergefdrv left a comment

nhoriguchi commented Nov 29, 2019

sergefdrv commented Nov 29, 2019

sergefdrv commented Nov 29, 2019

nhoriguchi commented Dec 1, 2019

sergefdrv commented Dec 2, 2019

nhoriguchi commented Dec 3, 2019

ynamiki left a comment

nhoriguchi commented Dec 10, 2019

nhoriguchi commented Dec 11, 2019

sergefdrv left a comment

sergefdrv left a comment

nhoriguchi commented Dec 13, 2019

sergefdrv commented Dec 13, 2019

Introduce prepare timer #127

Introduce prepare timer #127

Conversation

nhoriguchi commented Nov 22, 2019

sergefdrv left a comment

Choose a reason for hiding this comment

nhoriguchi commented Nov 29, 2019

sergefdrv commented Nov 29, 2019

sergefdrv commented Nov 29, 2019

nhoriguchi commented Dec 1, 2019

sergefdrv commented Dec 2, 2019

nhoriguchi commented Dec 3, 2019

ynamiki left a comment

Choose a reason for hiding this comment

nhoriguchi commented Dec 10, 2019

nhoriguchi commented Dec 11, 2019

sergefdrv left a comment

Choose a reason for hiding this comment

sergefdrv left a comment

Choose a reason for hiding this comment

nhoriguchi commented Dec 13, 2019

sergefdrv commented Dec 13, 2019