-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistency anomaly detected by a Jepsen bank transfer test #4687
Comments
Against which revision? |
alpha.v1-455-ga05cb9c |
http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt
This is the part of the log where things go bad - |
That error means that the transfer did not occur because it would make one of the accounts negative. (format is |
These two errors are very suspicious (The first is the conclusion of the
|
@bdarnell the client code doesn't (I checked with network traces). The JDBC wrappers does emit prepare/bind/execute for |
This one here is interesting (from the original log, but will look into similar occurrences in the others)
this is way up from the actual failure, but client |
Are you saying you don't see neither :ok :fail nor :timeout for it in the log? That is very strange. Perhaps a fault in the test logic too then. (But the inconsistency remains)Sent from my Android device with K-9 Mail. Please excuse my brevity. |
I only see a single entry with actor |
Just going to work my way through these. Snippets are the last "coherent" read to the next broken one.
|
Next one: http://52.91.194.28/cgi-bin/display.py?entry=20160226T161022#20160226T161022
Here we would have expected to read Do you happen to know whether both the updates and the final commit are sent individually? That would make a big difference for figuring out whether this is a replay-related issue or something with stale reads. |
As far as I can tell there's no batching ( I suspect that the problem has to do with mismanagement of the transaction status. We see both "Cannot change transaction isolation level in the middle of a transaction" and "ERROR: there is no transaction in progress". Each failure has at least one of these messages in (very) rough proximity to the first faulty read. (although this is far from conclusive because these messages occur in many passing runs as well). It looks like the "cannot change isolation" message comes from the JDBC driver while the "no transaction in progress" message comes from our server (the |
Hmm. Are you suggesting that the transaction switches to SNAPSHOT or something like that? Seems weird. Maybe we should just write down everything that goes through |
No, I don't think it's switching to SNAPSHOT; there's nothing I can see that could do that. I think that in some cases we're failing to clear the |
Interesting. Some of the code in executor about terminating transaction indeed did seem sketchy when I looked at it with @andreimatei the other day. Should be able to whip up a patch that removes some of the sketchiness. |
Maybe we already removed the sketchiness? On Fri, Feb 26, 2016 at 5:36 PM, Tobias Schottdorf <notifications@github.com
|
@andreimatei I don't get a clean merge on master, can you rebase? |
Now with complete network traces, starting with http://52.91.194.28/cgi-bin/display.py?entry=20160227T110036 (Rightmost column) |
@tschottdorf as you requested, a result with no zero transactions: http://52.91.194.28/cgi-bin/display.py?entry=20160227T112601 |
Thanks for the packet captures! Looking at this one, I've found some strange behavior. I've found looking at the tag field in the
This prints three columns: the frame number (for identification), the stream number (to make sure that we're not seeing a mixing of two TCP connections), and the command tag. Normal traffic has a few different transaction patterns: Successful update:
Cancelled update (insufficient balance):
Aborted update (other errors; rollback shows up with an empty tag):
Read everything:
And a surprising number of SET and SHOW commands dealing with the isolation level. After an ordinary transaction commits at frame 4263, something goes off the rails. We see a bunch of consecutive SHOWs and SETs, then one transaction that does five selects and four updates:
There is a seven-second gap between the BEGIN at frame 4286 and the rollback at frame 4294 (in between the transaction attempted to do the consistent read and failed after 7s with an "encountered future timestamp" error). The rollback at 4348 is another future-timestamp on a consistent read, although the time gap here is less than 1 second. By looking at the timestamps in the error message, we can see that this corresponds to this segment of history.txt, which is also where the accounts go from 19+2+19+0=40 to 21+0+19+2=42.
Clojure uses a mix of prepared and non-prepared statements so I don't have a slick way to dump that messed-up transaction, but here's a by-hand reconstruction:
So we have two transfers (actually 3, but the third didn't complete) executing simultaneously in the same transaction. One is moving 2 from account 2 to 0, and the other is "moving" 0 from 3 to 2, which has the effect of overwriting the previous change with its original value. Those transfers don't exist in the quoted snippet of history.txt, but attempted transfers do appear elsewhere in the file (by actors 32 and 37) which later show up as timeouts. This is looking to me like a client bug (I've seen just such a bug in sqlalchemy's connection pool in the past): a connection is being returned to the pool after a timeout but the original thread is not properly interrupted, and eventually will continue while a new thread is using the same connection. |
In this test, timeouts are implemented using this macro, which is sketchy. In the JVM, threads only check their interrupt status during certain operations, so it is possible for a thread to continue for some time after being interrupted. Whenever there is a timeout, we need to close the connection and open a new one. (It would be possible, if the underlying jdbc driver checks the interruption status in all the right places, for the interrupted thread to rollback its transaction and then acknowledge the interrupt so the connection can be used again, but that's tricky to get right). |
👍 great analysis @bdarnell. I still sat down and am porting the nemesis stuff to Docker, so we'll have this test in |
OK thanks for the tip. I'll force opening a new connection on timeouts and see what happens. |
Ok it looks that after forcing reconnections, this issue disappears. @bdarnell many thanks for the weekend analysis! |
Great! On Sun, Feb 28, 2016, 07:18 kena notifications@github.com wrote:
|
Closed via cockroachdb/jepsen#11. |
( #4036 ) |
The bank test is a transfer test that should preserve the total sum. The anomaly is that the sum is not preserved.
The test scenario, as encoded in https://github.com/cockroachdb/jepsen/blob/master/cockroachdb/src/jepsen/cockroach.clj#L839:
And then also
select balance from accounts
from time to time tocheck the intermediate states of the table. That's how the inconsistency is detected.
The errors/traces:
http://52.91.194.28/cgi-bin/display.py#20160226T122420
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt
http://52.91.194.28/cgi-bin/display.py#20160226T121719
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121719.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T121719.000Z/history.txt
http://52.91.194.28/cgi-bin/display.py#20160226T121608
http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121608.000Z%2Fresults.edn
http://52.91.194.28/cockroachdb-bank/20160226T121608.000Z/history.txt
In all 3 cases the state becomes inconsistent at about the time a network partition is resolved (
:nemesis :info :stop "fully connected"
)Nothing special in the server logs (all available from the UI in the rightmost column at http://52.91.194.28/)
Perhaps not relevant, but I also noticed the following oddity on one node:
Source: http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fn1l%2Fcockroach.stderr
The text was updated successfully, but these errors were encountered: