Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd test write timeout #10

Closed
MasseGuillaume opened this issue Apr 20, 2022 · 4 comments
Closed

etcd test write timeout #10

MasseGuillaume opened this issue Apr 20, 2022 · 4 comments

Comments

@MasseGuillaume
Copy link

I think the etcd specification is incorrect:

porcupine/porcupine_test.go

Lines 151 to 152 in 7cbc834

// write
return true, inp.arg1

A write can timeout

INFO jepsen.util - 4 :info :write :timed-out

In this case, the Step function should not apply the write

@anishathalye
Copy link
Owner

The specification is correct. When the server receives a write operation, it performs the write.

From the client's perspective, an operation can time out. At that point, there are two possibilities, because the client can't tell whether the request was dropped by the network before reaching the server, or the request was received and processed but the response was dropped by the network before reaching the client. After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Here is where we handle all the write/CAS operations that time out, by appending their return (with unknown value) to the end of the history: https://github.com/anishathalye/porcupine/blob/master/porcupine_test.go#L285-L287

@MasseGuillaume
Copy link
Author

MasseGuillaume commented Apr 21, 2022

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

Maybe the state could be marked as unknown after a write timeout. This way, the next time we have a successful write we can catch up. Read and CAS should be dropped until the next successful write.

@MasseGuillaume
Copy link
Author

I found some counterexamples:

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

INFO jepsen.util - 3 :fail :cas [4 3]
INFO jepsen.util - 0 :invoke :cas [1 2]
INFO jepsen.util - 0 :fail :cas [1 2]
INFO jepsen.util - 1 :invoke :read nil
INFO jepsen.util - 1 :ok :read 2
INFO jepsen.util - 2 :invoke :read nil
INFO jepsen.util - 2 :ok :read 2
INFO jepsen.util - 4 :invoke :cas [2 2]
INFO jepsen.util - 3 :invoke :read nil
INFO jepsen.util - 4 :ok :cas [2 2]
INFO jepsen.util - 0 :invoke :write 3
INFO jepsen.util - 3 :ok :read 2

also in etcd_101.log and etcd_102.log

@MasseGuillaume
Copy link
Author

MasseGuillaume commented Apr 21, 2022

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Ok I understand now, setting a realy large return time does make sense since it would model both cases: a) the write was performed, b) the write was not performed and it's too far in the future to invalidate it with a read. Thanks a lot!

Ah another clarification, you only parse for timeout read

porcupine/porcupine_test.go

Lines 203 to 209 in 7cbc834

invokeRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:read\s+nil$`)
invokeWrite, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:write\s+(\d+)$`)
invokeCas, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:cas\s+\[(\d+)\s+(\d+)\]$`)
returnRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:ok\s+:read\s+(nil|\d+)$`)
returnWrite, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:ok\s+:write\s+(\d+)$`)
returnCas, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:(ok|fail)\s+:cas\s+\[(\d+)\s+(\d+)\]$`)
timeoutRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:fail\s+:read\s+:timed-out$`)

Therefore, all timeout write and timeout cas are appended at the end, after all read that are actually parsed and appended at the correct position.

porcupine/porcupine_test.go

Lines 285 to 287 in 7cbc834

for proc, matchId := range procIdMap {
events = append(events, Event{proc, ReturnEvent, etcdOutput{unknown: true}, matchId})
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants