Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency anomaly detected by a Jepsen bank transfer test #4687

Closed
knz opened this issue Feb 26, 2016 · 28 comments
Closed

Consistency anomaly detected by a Jepsen bank transfer test #4687

knz opened this issue Feb 26, 2016 · 28 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change.
Milestone

Comments

@knz
Copy link
Contributor

knz commented Feb 26, 2016

The bank test is a transfer test that should preserve the total sum. The anomaly is that the sum is not preserved.

The test scenario, as encoded in https://github.com/cockroachdb/jepsen/blob/master/cockroachdb/src/jepsen/cockroach.clj#L839:

create table accounts (id int not null primary key, balance bigint not null)
insert into accounts values (0, 10)
insert into accounts values (1, 10)
insert into accounts values (2, 10)
insert into accounts values (3, 10)
-- in a loop: transfer(rand, rand, rand) where:
   transfer(From, To, Amount) =
      begin
         b1 := (select balance from accounts where id = From) - Amount
         b2 := (select balance from accounts where id = To) + Amount
         if b1 >= 0 and b2 >= 0 then
            update accounts set balance = b1 where id = From
            update accounts set balance = b2 where id = To
         endif
      commit

And then also select balance from accounts from time to time to
check the intermediate states of the table. That's how the inconsistency is detected.

The errors/traces:
http://52.91.194.28/cgi-bin/display.py#20160226T122420
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt

http://52.91.194.28/cgi-bin/display.py#20160226T121719
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121719.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T121719.000Z/history.txt

http://52.91.194.28/cgi-bin/display.py#20160226T121608
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121608.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T121608.000Z/history.txt

In all 3 cases the state becomes inconsistent at about the time a network partition is resolved (:nemesis :info :stop "fully connected")

Nothing special in the server logs (all available from the UI in the rightmost column at http://52.91.194.28/)

Perhaps not relevant, but I also noticed the following oddity on one node:

W160226 12:24:45.486123 storage/replica.go:1774 could not GC completed transaction: context deadline exceeded
W160226 12:24:47.699457 storage/replica.go:1873 unable to resolve local intents; range 0: replica <nil> not leader; leader is <nil>

Source: http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fn1l%2Fcockroach.stderr

@tbg
Copy link
Member

tbg commented Feb 26, 2016

Against which revision?

@tbg
Copy link
Member

tbg commented Feb 26, 2016

alpha.v1-455-ga05cb9c

@tbg
Copy link
Member

tbg commented Feb 26, 2016

http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt

4   :ok :transfer   {:from 3, :to 1, :amount 2}
0   :invoke :read   nil
0   :ok :read   [33 6 1 0]
2   :invoke :transfer   {:from 2, :to 1, :amount 1}
2   :ok :transfer   {:from 2, :to 1, :amount 1}
21  :ok :read   [33 6 1 0]
0   :invoke :transfer   {:from 2, :to 1, :amount 1}
0   :fail   :transfer   [:negative 2 -1]
23  :invoke :read   nil
2   :invoke :read   nil
2   :ok :read   [32 7 0 0]
0   :invoke :read   nil
0   :ok :read   [32 7 0 0]
:nemesis    :info   :stop   "fully connected"
4   :invoke :read   nil
4   :ok :read   [32 7 0 0]

This is the part of the log where things go bad - 0 and 2 both subtract one from the third account (which initially has only one) and erroneously don't push it into the negative.
What does the error 0 :fail :transfer [:negative 2 -1] tell me?

@bdarnell
Copy link
Contributor

That error means that the transfer did not occur because it would make one of the accounts negative. (format is [:negative $account $new_balance]). So that part of the log looks fine to me. The problem is that account 0 is decreasing from 33 to 32 even though it's not involved in any of the transfers visible in this snippet.

@bdarnell
Copy link
Contributor

These two errors are very suspicious (The first is the conclusion of the 23 :invoke :read nil line quoted above). Where do we change isolation levels in this test?

23 :fail :read nil PSQLException: Cannot change transaction isolation level in the middle of a transaction.

43 :fail :transfer {:from 2, :to 0, :amount 1} PSQLException: Cannot change transaction isolation level in the middle of a transaction.

@knz
Copy link
Contributor Author

knz commented Feb 26, 2016

@bdarnell the client code doesn't (I checked with network traces). The JDBC wrappers does emit prepare/bind/execute for SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE before most its uses of BEGIN; but this oughts to make no difference.

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Feb 26, 2016
@knz knz added this to the Beta milestone Feb 26, 2016
@tbg
Copy link
Member

tbg commented Feb 26, 2016

This one here is interesting (from the original log, but will look into similar occurrences in the others)

4   :ok :read   [24 8 5 3]
0   :invoke :transfer   {:from 1, :to 0, :amount 0}
0   :ok :transfer   {:from 1, :to 0, :amount 0}
11  :invoke :transfer   {:from 0, :to 1, :amount 1}
4   :invoke :transfer   {:from 0, :to 1, :amount 1}
4   :ok :transfer   {:from 0, :to 1, :amount 1}
0   :ok :read   [23 9 5 3]
0   :ok :read   [23 5 9 3]
2   :ok :read   [23 5 9 3]

this is way up from the actual failure, but client 11s transfer of one never gets acknowledged, nor does it execute, but it matches the deduction from account 0 which causes the anomaly.

@knz
Copy link
Contributor Author

knz commented Feb 26, 2016

Are you saying you don't see neither :ok :fail nor :timeout for it in the log? That is very strange. Perhaps a fault in the test logic too then. (But the inconsistency remains)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@tbg
Copy link
Member

tbg commented Feb 26, 2016

I only see a single entry with actor 11 in the log. Maybe that guy just got stuck on his first transfer indefinitely somehow? Or maybe I'm misreading the log.

@tbg
Copy link
Member

tbg commented Feb 26, 2016

Just going to work my way through these. Snippets are the last "coherent" read to the next broken one.

http://52.91.194.28/cgi-bin/display.py?entry=20160226T152821#20160226T152821:

2   :ok :read   [24 0 13 3]
1   :invoke :transfer   {:from 2, :to 3, :amount 3}
1   :ok :transfer   {:from 2, :to 3, :amount 3}
1   :invoke :transfer   {:from 0, :to 2, :amount 0}
1   :ok :transfer   {:from 0, :to 2, :amount 0}
2   :invoke :transfer   {:from 2, :to 0, :amount 4}
2   :ok :transfer   {:from 2, :to 0, :amount 4}
24  :ok :transfer   {:from 3, :to 1, :amount 1}
24  :invoke :read   nil
24  :ok :read   [28 1 6 2] // off by [0, 0, 2, -4]


pending:

13  1  // client 13 has a write which never returned: 4 from 3 to 1 -> the deduction could match -4 above
4   1  // client 4 has a write which never returned: 1 from 1 to 2 -> the deposit could match +2 above

@tbg
Copy link
Member

tbg commented Feb 26, 2016

Next one: http://52.91.194.28/cgi-bin/display.py?entry=20160226T161022#20160226T161022

0       :ok     :read   [10 2 9 19]
24      :ok     :transfer       {:from 3, :to 1, :amount 1}
22      :invoke :transfer       {:from 0, :to 2, :amount 1}
22      :ok     :transfer       {:from 0, :to 2, :amount 1}
1       :invoke :transfer       {:from 1, :to 2, :amount 4}
1       :fail   :transfer       [:negative 1 -1]
3       :invoke :transfer       {:from 1, :to 2, :amount 4}
3       :fail   :transfer       [:negative 1 -1]
0       :invoke :transfer       {:from 1, :to 3, :amount 0}
0       :ok     :transfer       {:from 1, :to 3, :amount 0}
1       :invoke :transfer       {:from 1, :to 3, :amount 3}
1       :ok     :transfer       {:from 1, :to 3, :amount 3}
24      :invoke :transfer       {:from 0, :to 2, :amount 4}
24      :ok     :transfer       {:from 0, :to 2, :amount 4}
3       :invoke :transfer       {:from 0, :to 3, :amount 1}
3       :ok     :transfer       {:from 0, :to 3, :amount 1}
22      :invoke :read   nil
22      :ok     :read   [6 0 12 24]


pending:

Actor cur_actor:        1 ops
* 19    :info   :transfer       :timeout
* 19    :invoke :transfer       {:from 3, :to 0, :amount 2}
Actor cur_actor:        1 ops
* 4     :info   :transfer       :timeout
* 4     :invoke :transfer       {:from 2, :to 3, :amount 4}
Actor cur_actor:        1 ops
* 9     :info   :transfer       :timeout
* 9     :invoke :transfer       {:from 2, :to 3, :amount 2}

Here we would have expected to read [4, 0, 14, 22] but we got [6, 0, 12, 24]. Taking into account the never-acked (i.e. timeouted) ops, the first one likely executed (giving us the 6 on the first account). So we have [6,0,12,22] but see [6,0,12,24]. This could have been caused by the second lost write never happening, and the third one erroneously only depositing, but not deducting.

Do you happen to know whether both the updates and the final commit are sent individually? That would make a big difference for figuring out whether this is a replay-related issue or something with stale reads.

@bdarnell
Copy link
Contributor

As far as I can tell there's no batching (j/update! returns the number of rows affected, even though it's not used here. The only way these statements could be transparently batched would be if there was something clever enough to see that the return value is discarded). The BEGIN might be batched with the first read, but the second read and both updates (if they occur) are each sent one at a time (waiting for each response), and then the COMMIT happens on its own.

I suspect that the problem has to do with mismanagement of the transaction status. We see both "Cannot change transaction isolation level in the middle of a transaction" and "ERROR: there is no transaction in progress". Each failure has at least one of these messages in (very) rough proximity to the first faulty read. (although this is far from conclusive because these messages occur in many passing runs as well). It looks like the "cannot change isolation" message comes from the JDBC driver while the "no transaction in progress" message comes from our server (the serverMsgReady message we send includes the transaction status, which is how the JDBC driver could believe a transaction was in progress).

@tbg
Copy link
Member

tbg commented Feb 26, 2016

Hmm. Are you suggesting that the transaction switches to SNAPSHOT or something like that? Seems weird. Maybe we should just write down everything that goes through Executor to a file?
Or maybe we're accidentally nesting the transactions in some funny way?

@bdarnell
Copy link
Contributor

No, I don't think it's switching to SNAPSHOT; there's nothing I can see that could do that. I think that in some cases we're failing to clear the session.Txn field so a future operation may inherit the ID of a transaction that never committed.

@tbg
Copy link
Member

tbg commented Feb 26, 2016

Interesting. Some of the code in executor about terminating transaction indeed did seem sketchy when I looked at it with @andreimatei the other day. Should be able to whip up a patch that removes some of the sketchiness.

@andreimatei
Copy link
Contributor

Maybe we already removed the sketchiness?
R, could you please try with
#4576 patched in?

On Fri, Feb 26, 2016 at 5:36 PM, Tobias Schottdorf <notifications@github.com

wrote:

Interesting. Some of the code in executor about terminating transaction
indeed did seem sketchy when I looked at it with @andreimatei
https://github.com/andreimatei the other day. Should be able to whip up
a patch that removes some of the sketchiness.


Reply to this email directly or view it on GitHub
#4687 (comment)
.

@knz
Copy link
Contributor Author

knz commented Feb 27, 2016

@andreimatei I don't get a clean merge on master, can you rebase?

@knz
Copy link
Contributor Author

knz commented Feb 27, 2016

Now with complete network traces, starting with http://52.91.194.28/cgi-bin/display.py?entry=20160227T110036

(Rightmost column)

@knz
Copy link
Contributor Author

knz commented Feb 27, 2016

@tschottdorf as you requested, a result with no zero transactions: http://52.91.194.28/cgi-bin/display.py?entry=20160227T112601

@bdarnell
Copy link
Contributor

Thanks for the packet captures! Looking at this one, I've found some strange behavior. I've found looking at the tag field in the CommandComplete message a good way to get an overview of what's going on. Here's the tshark invocation to show all the tags:

tshark -r Downloads/trace.pcap -d tcp.port==26257,pgsql -T fields -e frame.number -e tcp.stream -e pgsql.tag  -Y pgsql.tag

This prints three columns: the frame number (for identification), the stream number (to make sure that we're not seeing a mixing of two TCP connections), and the command tag. Normal traffic has a few different transaction patterns:

Successful update:

212     1       BEGIN
219     1       SELECT 1
227     1       SELECT 1
235     1       UPDATE 1
243     1       UPDATE 1
248     1       COMMIT

Cancelled update (insufficient balance):

272     1       BEGIN
277     1       SELECT 1
282     1       SELECT 1
287     1       COMMIT

Aborted update (other errors; rollback shows up with an empty tag):

2683    1       BEGIN
2686    1       SELECT 1
2691    1       SELECT 1
2698    1       UPDATE 1
2710    1       

Read everything:

311     1       BEGIN
316     1       SELECT 4
321     1       COMMIT

And a surprising number of SET and SHOW commands dealing with the isolation level.

After an ordinary transaction commits at frame 4263, something goes off the rails. We see a bunch of consecutive SHOWs and SETs, then one transaction that does five selects and four updates:

4263    1       COMMIT
4269    1       SET
4275    1       SHOW 1
4282    1       SET
4286    1       BEGIN
4294    1       
4300    1       SHOW 1
4305    1       SHOW 1
4310    1       SHOW 1
4315    1       SHOW 1
4320    1       SET
4325    1       SET
4329    1       SET
4334    1       SET
4338    1       BEGIN
4348    1       
4353    1       BEGIN
4359    1       SELECT 1
4364    1       SELECT 1
4370    1       SET
4374    1       SELECT 1
4381    1       SELECT 1
4387    1       SET
4394    1       UPDATE 1
4401    1       UPDATE 1
4408    1       SELECT 1
4415    1       UPDATE 1
4422    1       UPDATE 1
4427    1       COMMIT

There is a seven-second gap between the BEGIN at frame 4286 and the rollback at frame 4294 (in between the transaction attempted to do the consistent read and failed after 7s with an "encountered future timestamp" error). The rollback at 4348 is another future-timestamp on a consistent read, although the time gap here is less than 1 second.

By looking at the timestamps in the error message, we can see that this corresponds to this segment of history.txt, which is also where the accounts go from 19+2+19+0=40 to 21+0+19+2=42.

0   :ok :read   [19 2 19 0]
1   :invoke :transfer   {:from 2, :to 0, :amount 0}
1   :ok :transfer   {:from 2, :to 0, :amount 0}
0   :invoke :transfer   {:from 0, :to 1, :amount 0}
24  :invoke :transfer   {:from 1, :to 3, :amount 2}
0   :fail   :transfer   {:from 0, :to 1, :amount 0} BatchUpdateException: Batch entry 0 UPDATE accounts SET balance = 2 WHERE id = 1 was aborted.  Call getNextException to see the cause.
ERROR: retry txn "sql/executor.go:453 sql" id=b17c022e key=/Table/51/1/0/2/1 rw=true pri=0.00539324 iso=SERIALIZABLE stat=PENDING epo=1 ts=1456570874.194177909,1 orig=1456570874.194177909,1 max=1456570874.443595548,0
24  :ok :transfer   {:from 1, :to 3, :amount 2}
42  :fail   :read   nil PSQLException: ERROR: read at time 1456570873.575008201,0 encountered previous write with future timestamp 1456570874.250578728,0 within uncertainty interval
23  :invoke :read   nil
24  :invoke :transfer   {:from 2, :to 1, :amount 3}
23  :ok :read   [21 0 19 2]

Clojure uses a mix of prepared and non-prepared statements so I don't have a slick way to dump that messed-up transaction, but here's a by-hand reconstruction:

4351 BEGIN; select balance from accounts where id = 2
4359 result: 19
4362 select balance from accounts where id = 3
4364 result: 2
4372 select balance from accounts where id = 2
4374 result: 19
4377 select balance from accounts where id = 0
4381 result: 19
4389 update accounts set balance = 17 where id = 2
4397 update accounts set balance = 2 where id = 3
4404 select balance from accounts where id = 1
4408 result: 0
4411 update accounts set balance = 19 where id = 2
4418 update accounts set balance = 21 where id = 0
4425 commit

So we have two transfers (actually 3, but the third didn't complete) executing simultaneously in the same transaction. One is moving 2 from account 2 to 0, and the other is "moving" 0 from 3 to 2, which has the effect of overwriting the previous change with its original value. Those transfers don't exist in the quoted snippet of history.txt, but attempted transfers do appear elsewhere in the file (by actors 32 and 37) which later show up as timeouts.

This is looking to me like a client bug (I've seen just such a bug in sqlalchemy's connection pool in the past): a connection is being returned to the pool after a timeout but the original thread is not properly interrupted, and eventually will continue while a new thread is using the same connection.

@bdarnell
Copy link
Contributor

In this test, timeouts are implemented using this macro, which is sketchy. In the JVM, threads only check their interrupt status during certain operations, so it is possible for a thread to continue for some time after being interrupted. Whenever there is a timeout, we need to close the connection and open a new one. (It would be possible, if the underlying jdbc driver checks the interruption status in all the right places, for the interrupted thread to rollback its transaction and then acknowledge the interrupt so the connection can be used again, but that's tricky to get right).

@tbg
Copy link
Member

tbg commented Feb 27, 2016

👍 great analysis @bdarnell. I still sat down and am porting the nemesis stuff to Docker, so we'll have this test in acceptance soon enough (and we'll also be able to run all the others with network partitions).

@knz
Copy link
Contributor Author

knz commented Feb 28, 2016

OK thanks for the tip. I'll force opening a new connection on timeouts and see what happens.

@knz
Copy link
Contributor Author

knz commented Feb 28, 2016

Ok it looks that after forcing reconnections, this issue disappears. @bdarnell many thanks for the weekend analysis!

@tbg
Copy link
Member

tbg commented Feb 28, 2016

Great!

On Sun, Feb 28, 2016, 07:18 kena notifications@github.com wrote:

Ok it looks that after forcing reconnections, this issue disappears.
@bdarnell https://github.com/bdarnell many thanks for the weekend
analysis!


Reply to this email directly or view it on GitHub
#4687 (comment)
.

@knz
Copy link
Contributor Author

knz commented Feb 28, 2016

Closed via cockroachdb/jepsen#11.

@knz knz closed this as completed Feb 28, 2016
@knz
Copy link
Contributor Author

knz commented Feb 29, 2016

( #4036 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

4 participants