etcd test write timeout #10

MasseGuillaume · 2022-04-20T04:17:58Z

I think the etcd specification is incorrect:

Lines 151 to 152 in 7cbc834

    
           // write 
        
           return true, inp.arg1

A write can timeout

porcupine/test_data/jepsen/etcd_000.log

Line 61 in 7cbc834

INFO jepsen.util - 4 :info :write :timed-out

In this case, the Step function should not apply the write

anishathalye · 2022-04-20T19:29:17Z

The specification is correct. When the server receives a write operation, it performs the write.

From the client's perspective, an operation can time out. At that point, there are two possibilities, because the client can't tell whether the request was dropped by the network before reaching the server, or the request was received and processed but the response was dropped by the network before reaching the client. After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Here is where we handle all the write/CAS operations that time out, by appending their return (with unknown value) to the end of the history: https://github.com/anishathalye/porcupine/blob/master/porcupine_test.go#L285-L287

MasseGuillaume · 2022-04-21T02:26:22Z

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

~~Maybe the state could be marked as unknown after a write timeout. This way, the next time we have a successful write we can catch up. Read and CAS should be dropped until the next successful write.~~

MasseGuillaume · 2022-04-21T02:34:40Z

I found some counterexamples:

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

porcupine/test_data/jepsen/etcd_100.log

Lines 20 to 31 in 7cbc834

    
           INFO  jepsen.util - 3   :fail   :cas    [4 3] 
        
           INFO  jepsen.util - 0   :invoke :cas    [1 2] 
        
           INFO  jepsen.util - 0   :fail   :cas    [1 2] 
        
           INFO  jepsen.util - 1   :invoke :read   nil 
        
           INFO  jepsen.util - 1   :ok     :read   2 
        
           INFO  jepsen.util - 2   :invoke :read   nil 
        
           INFO  jepsen.util - 2   :ok     :read   2 
        
           INFO  jepsen.util - 4   :invoke :cas    [2 2] 
        
           INFO  jepsen.util - 3   :invoke :read   nil 
        
           INFO  jepsen.util - 4   :ok     :cas    [2 2] 
        
           INFO  jepsen.util - 0   :invoke :write  3 
        
           INFO  jepsen.util - 3   :ok     :read   2

also in etcd_101.log and etcd_102.log

MasseGuillaume · 2022-04-21T02:44:13Z

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Ok I understand now, setting a realy large return time does make sense since it would model both cases: a) the write was performed, b) the write was not performed and it's too far in the future to invalidate it with a read. Thanks a lot!

Ah another clarification, you only parse for timeout read

porcupine/porcupine_test.go

Lines 203 to 209 in 7cbc834

    
           invokeRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:read\s+nil$`) 
        
           invokeWrite, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:write\s+(\d+)$`) 
        
           invokeCas, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:invoke\s+:cas\s+\[(\d+)\s+(\d+)\]$`) 
        
           returnRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:ok\s+:read\s+(nil|\d+)$`) 
        
           returnWrite, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:ok\s+:write\s+(\d+)$`) 
        
           returnCas, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:(ok|fail)\s+:cas\s+\[(\d+)\s+(\d+)\]$`) 
        
           timeoutRead, _ := regexp.Compile(`^INFO\s+jepsen\.util\s+-\s+(\d+)\s+:fail\s+:read\s+:timed-out$`)

Therefore, all timeout write and timeout cas are appended at the end, after all read that are actually parsed and appended at the correct position.

porcupine/porcupine_test.go

Lines 285 to 287 in 7cbc834

    
           for proc, matchId := range procIdMap { 
        
           	events = append(events, Event{proc, ReturnEvent, etcdOutput{unknown: true}, matchId}) 
        
           }

MasseGuillaume closed this as completed Apr 21, 2022

ahrtr mentioned this issue Dec 6, 2022

Improve support for failed requests in linearizability tests etcd-io/etcd#14880

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd test write timeout #10

etcd test write timeout #10

MasseGuillaume commented Apr 20, 2022

anishathalye commented Apr 20, 2022

MasseGuillaume commented Apr 21, 2022 •

edited

Loading

MasseGuillaume commented Apr 21, 2022

MasseGuillaume commented Apr 21, 2022 •

edited

Loading

etcd test write timeout #10

etcd test write timeout #10

Comments

MasseGuillaume commented Apr 20, 2022

anishathalye commented Apr 20, 2022

MasseGuillaume commented Apr 21, 2022 • edited Loading

MasseGuillaume commented Apr 21, 2022

MasseGuillaume commented Apr 21, 2022 • edited Loading

MasseGuillaume commented Apr 21, 2022 •

edited

Loading

MasseGuillaume commented Apr 21, 2022 •

edited

Loading