-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: import/tpch/nodes=32 failed #31732
Comments
Ugh, hoped we were done with this sort of thing for a while. A number of ranges simply seems to "freeze up". The SHA notably has #31689.
|
However, this seems across the board, for multiple tests at once:
I guess the only way to really make progress on this is a live repro. Going to kick off a run (with fewer nodes). |
At least this repro'ed right away:
(among others) |
Got the RangeID from the stack trace (0x6e5 = 1765) http://35.231.122.120:26258/debug/logspy?count=10000&duration=10s&grep=r1765/ |
Interestingly this had resolved itself when I just looked again. From the logspy above, I can see that it talked to replicaIDs 1, 3, 5 (there aren't any responses in the 10s window)
This is interesting because that makes four members of the range in total. The range status page shows that replicaIDs 1, 2, 5 are part of the replica now, so 3 is gone. But I'm having trouble piecing this together. We only send the MsgApp to 3, why not to the other two? We're also only sending a |
cc @bdarnell in case you see something I don't. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982526&tab=buildLog
|
^- stuck raft proposals until the bitter end. |
The Raft snapshot queue is absolutely backed up: and so are some of these other queues. My best guess as to why this is happening so suddenly is that the test is done with the "import" part and begins a restore part (where it actually splits and sends SSTs around). In the situation above, where two replicas need a snapshot, it would make sense that the range would just fall over. And it also makes sense that once you have thousands of pending Raft snapshots, there is big trouble. |
logspying on one of the nodes for
|
The delay between "sending raft snapshot" and "sending Raft snapshot d9244af9 at applied index 20" of ~32s is suspicious. What's taking so long? This is a) creating the snapshot (but that's just opening an iterator) and exchanging metadata with the target store. It's not actually transferring any data yet. We see that the actual transfer is more like 5s. The 39747ms log line indicates that we do spend most of that time in
Ah, of course, this will block on the receiver's snapshot apply semaphore: cockroach/pkg/storage/store_snapshot.go Line 366 in 5a37344
Presumably all nodes are sending snapshots to all other nodes, so everything's completely clogged. Ok, it seems that one very fundamental badness here indeed has to do with why we're seeing all these snapshots being necessary in the first place. I'm going to make it a priority to look into the related truncation logic discussion in #31409 (comment) to see if we're doing stupid things when AddSSTable is involved (or generally). cc @andreimatei |
I'd start by looking at the replica GC queue. We send preemptive snapshots before adding replicas; this leaves a brief window between the preemptive snapshot and the finalization of of the replica change during which the replica can be GC'd and require an immediate raft snapshot. Previously, we relied on the fact that the replica GC queue was not responsive enough to make this scenario likely. Maybe now the replica GC is happening faster and is more likely to catch replicas during this vulnerable window |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=983962&tab=buildLog
|
I'm starting to suspect this change: e436e56 |
I had two successful runs after "fixing" (or rather, disarming) the Raft truncation bugs, but good point about the GC queue. I'll keep an eye out. Our goal really ought to be to get these import tests passing with zero (or barely any) Raft snapshots. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog
|
The AddSSTable requests are ~32mb each ( cockroach/pkg/ccl/storageccl/import.go Lines 34 to 38 in 7ccc197
What I don't understand however is why things get so bad with the bug. When the next AddSSTable comes in, the I might be missing something or the behavior is more subtle (for example, it might be worse when ranges are newly created, or when there are additional small proposals like lease requests involved). I'm going to see if there's something here that I can reproduce with a bit of hacking on a local 3x cluster and |
I'm running a local experiment via roachprod three nodes, where the third node has a 100ms sleep in
I additionally have a diff applied that makes the tracked size of the raft log always 8mb (i.e. well above what would trigger an aggressive truncation, 4mb). I'm watching for Raft snapshots by tailing the logs and by watching I seem to be able to wedge the cluster in a stuck state: The rightmost replica isn't catching up over minutes. It's clearly in need of a Raft snapshot, but this snapshot doesn't seem to receive the appropriate priority. Instead, snapshots intersecting that replica are tried in an endless loop and, of course, fail. The slow replica is sent for GC repeatedly, but of course it can't be GC'ed -- it's still in the range, but needs to catch up on the split trigger.
This situation resolved itself after a few minutes, but imagine this happening with thousands of ranges at the same time. Looking at the code that adds replicas into the raft snapshot queue, I see that the same priority (zero) is used for all. This suggests that we could use additional smarts. For example, instead of sending a replica blocking a snapshot to GC, we should (also?) offer it to the Raft snapshot queue, ideally at a priority that's higher if we suspect that the snapshot will shrink the replica. |
I can also repro a stats inconsistency (running with
|
Ok, much simpler:
With the truncation hack that simulates overly aggressive truncations (and the delay on n3), that gives me approximately ~700 successful Raft snapshots and ~1100 failures. Ouch. If I remove my hacks (i.e. run master without aggressive truncations or delay), I get typically <300 Raft snapshot successes and very few failures. I'm just going to go ahead and fix the truncation problems to see what sticks after. |
Experimenting some more, I can see that (on master with merges turned off) I still see Raft snapshots if I disable both the replicaGCQueue and the Raft log queue. I think this is due to the race in which the RHS of the split gets created via an incoming message from another replica that processed the split faster. |
I tried to "hack" this problem away by disabling the Raft snapshot queue, but this leads to badness. I assume this is because once Raft asks for a snapshot, you have to give it one. You can't just forget about that request. Not 100% sure this is what's happening, but it would make sense. |
I was able to hack this problem away by simply reporting all MsgSnap back to the raft group as completed. I opened my repro for this with a comment as a PR so that I don't lose track of it: #31875 |
Attempting to productionize @petermattis' patch https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/tpcc-9x64?expand=1 I went down a little rabbit hole wondering why my test saw unexpected raft log sizes. Turns we're mixing something up in log truncation. A TruncateRequest has a field Digging with @nvanbenschoten. |
Looks mostly like a stats issue. We're subtracting a little more (one entry worth) from the log size than we should. We generate the correct TruncatedState (in which the Index is the last deleted one, not the first undeleted one) by populating it with edit: PR here https://github.com/cockroachdb/cockroach/pull/31885/files |
Oh wow, are you ready for another small dumpster fire? |
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=987340&tab=buildLog
|
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as #31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches #31732. Touches #31740. Touches #30261. Touches #31768. Touches #31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990998&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=991714&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog
|
^- this last run has #31988. |
Stuck Raft requests all the way through. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog
|
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog
The text was updated successfully, but these errors were encountered: