-
Notifications
You must be signed in to change notification settings - Fork 23
Introduce prepare timer #127
Introduce prepare timer #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the initiative to solve the issue. This doesn't look an easy one. Perhaps the most difficult part is the actual mechanism to relay REQUEST message. Therefore, I would suggest to split the work. Maybe first focus on handling the timeout itself, and just report the timeout event in the diagnostic message log. Then we could tackle message relaying as a separate issue.
I was not sure about changing the request timeout semantics. Originally, it was the maximum expected time for a received request to become ready for execution. This PR delays starting the request timer until the PREPARE message is received. However, when a correct replica has accepted a PREPARE message, it will effectively multicast it to other replicas in COMMIT message. This ensures eventual request execution unless view change is triggered. In other words, once a single correct replica has accepted a PREPARE message, reaching the consensus does not depend on the primary anymore. Thus, the request timer started at that point would not really serve for faulty primary detection.
I suggest we start both prepare and request timer upon reception of a new REQUEST message. If the replica does not accept the corresponding PREPARE message for longer than the prepare timeout, it would first assume the client was faulty. Then it will relay the REQUEST message to the primary and restart the request timer. I am currently discussing this scheme with experts, but I am not sure how soon I can get their feedback.
Using
generatedMessageConsumer
to forwardREQUEST
messages: I just used existing code for the current purpose, but it seems not comply with the current abstraction rule. I guess that we need haveprepareTimeoutHandler
, but that's not done in this version.Adding some "implement" method on type
messages.Request
: this is to make the type implement interfacemessages.ReplicaMessage
. I feel that this change has some impact and might need more adjustment onmessages/messages.go
.
Those points need more consideration and there is clearly something to be changed. It would not be appropriate to treat REQUEST as replica message. When a prepare timeout expires, the REQUEST message should only be forwarded to the primary, there is no need to multicast it to all replicas. Maybe we should open a dedicated message stream to the primary for REQUEST message forwarding.
Another consideration is cancelling of the transmission. If a new REQUEST is accepted from the same client, forwarding of the old REQUEST should be cancelled. At least, there should be a clear limit for REQUEST messages being forwarded, they should not pile up indefinitely.
- Fault simulator: I do some testing by modifying the sample application to do some faulty behavior, I think this might be helpful to merge it into mainline code as "faulty mode", but maybe that need more consideration on design.
As opposed to incorporating faulty behavior into normal code, we can create some kind of tool for testing. The tool would act as a client or replica following some predefined scenario, including Byzantine behavior. What do you think?
I think we could continue reviewing and merge the commit 2006f59 in a separate PR.
OK.
Thanks for detailed explanation, this note is very important for me. I'll change request timer behavior as you stated above. ...
In the current version, a backup replica "forwards" (i.e. multicasts) a REQUEST message only once when the prepare timer expires, so maybe no worry about request's pile-up. And if a replica receives the old REQUEST message after the new REQUEST messages is accepted, the old one should be properly rejected by
Any faulty behaviors on client side can be mimicked by testing tools. As for replica side, some external attacks (like replay attack) can be simulated by tools, but we might need some internal module for testing inside attacks. The internal module should be switched on or off on compile time to ensure that it's never used on production setting.
OK, I'll separate the suggestion for this commit too. |
If I understood correctly, the prepare timer is started for each new REQUEST message. When it expires, the message is appended to the message log. My concern was that there is no limit for the number of REQUEST messages from the same client that can be appended to the message log.
That is correct, I was just thinking of wasted network bandwidth.
I was not quite clear about what you mean by "inside attacks". Do you mean simulation of intrusion attacks on replica's untrusted code? Such compromised replica can be simply considered faulty. And the fraction of faulty replicas is assumed limited. What is your perspective on this? |
My biggest concern with this approach is code clarity. I'm afraid embedding faulty logic for testing can make the code hard to follow. |
Yes, I have that in mind.
What I thought was that MinBFT code is public, so attackers (note that some legitimate member could be an attacker on trustless network) can learn it and easily embed any compromised behavior (which might be hard for correct replicas to find). So I thought that we will be better to provide working demonstration code of typical scenarios with byzantine faulty replicas for anyone to verify.
Yes, I understand that, so I'm ok to start with separate cod/branch. Anyway, thank you for the idea. |
So you consider simulating a tolerated fraction of Byzantine replicas? If so, I thought we might better have some dummy replica nodes that behave Byzantine, following some predefined scenario. What you were probably concerned with was that the replica behavior we would like to test might be quite complicated. In that case, we might need some smarter tool, which behaves almost the same as correct replica. I think we can start exploring this issue with a scenario-based tool and see how it goes. I can think of two different approaches with the tool: (1) the tool mimics an individual replica's behavior, and (2) the tool mimics the behavior of all replicas in the network, except the tested one. In (1), it would be easier to define the Byzantine scenario for the simulated faulty replica. In (2), it would be easier to define the scenario, taking asynchronous message exchange into account. Maybe there are more possibilities, or it might make sense to use a hybrid approach. |
Yes, in most severe cases,
The hybrid approach sounds nice to me, because these approaches complement each other: (1) is for integration testing which should define expected behaviors as a system for given scenarios, so that's close to what I imaged, but (2) is also important because it tests whether a correct replica behave as expected for given scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Sergey has wrote, I think it is better to have timerState
struct. Sorry for my late response.
2006f59
to
0f841c4
Compare
Summary of changes, the latest version:
|
0f841c4
to
dc97ab6
Compare
I updated the branch again. In this version I tried two things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort, Naoya, but I think the code needs some more adjustment.
Timeout duration is an optional parameter for client state until now, now it's a mandatory parameter. Closes hyperledger-labs#139 Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
We'll add another timer to clientstate later, so as a preparation, this patch factors out common part of timeout handling code by generalizing requestTimerState into timerState. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Any client application could behave improperly whether or not intended, so MinBFT cluster has to handle faulty client behaviors. But currently replica set processes a request only when the primary replica receives it, so any client can easily trigger view-change by sending request messages only to all backup replicas. This patch tries to solve the situation by introducing prepare timer which expires when the replica doesn't receive a prepare message from the primary for a specified timeout duration. This timer allows us to handle occasional transmission errors between clients and replicas for example by forwarding request messages to the primary replica, which will be implemented later. The semantics of another existing timer, request timer, stays unchanged. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Comment on makeReplicaMessageApplier() is out-of-sync from the code, so let's fix it. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
dc97ab6
to
73877fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also extend core/internal/clientstate/timeout_test.go
so that it tests prepare timer methods, or just test timerState
and assume its methods are invoked correctly from the methods of clientState
?
Client state has multiple timers in it, so this patch proposes to expand the unit test to cover that. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Pushed one additional commit, thank you. |
@Naoya-Horiguchi Looks good. Thanks! |
This pull request suggests to introduce prepare timer based on the discussion #5. The outline is described mainly in the 1st patch. But I admit that I'm still on the way and there're some points to discuss/solve:
generatedMessageConsumer
to forwardREQUEST
messages: I just used existing code for the current purpose, but it seems not comply with the current abstraction rule. I guess that we need haveprepareTimeoutHandler
, but that's not done in this version.messages.Request
: this is to make the type implement interfacemessages.ReplicaMessage
. I feel that this change has some impact and might need more adjustment onmessages/messages.go
.I'm glad if I get some feedbacks.
CircleCI status is here.