-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/nodes=3/w=max failed #35337
Comments
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160398&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1161673&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163706&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163702&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165134&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165148&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168756&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169984&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170799&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176952&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
|
uhm what? This sounds alarming -- @jordanlewis |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178894&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178890&tab=buildLog
|
I pulled #35337 (comment) into a separate tracking issue: #35812. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1180757&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182995&tab=buildLog
|
Ok, this test should really pass if it doesn't highlight specific failures. I'm digging into these weird failure modes now. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183682&tab=buildLog
|
One issue is definitely the lease imbalance here: Test has been running for 45 minutes and My run (from which the above screenshot was captured) has definitely already made it past the failures above, but we can also see that the import makes some progress but never gets |
This is my ssh invocation running the import on node four:
Doesn't look like keep alives are on. They're off by default. Might be worth a shot. |
This can't hurt, and could explain cockroachdb#35337 (comment) Release note: None
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195736&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195714&tab=buildLog
|
So far I have not hit a reproduction of the Looking at the logs from the failures it seems like the cluster is overloaded given the slow liveness heartbeats and handle raft ready. This could also cause Will look at the goroutine dumps to see if there's anything interesting. I've also been wondering if:
@tbg I don't think I would classify this failure in the context of this test as a release blocker, but let me know if you think otherwise. It's essentially a hard to hit query timeout. |
I'll let dan chime in too but this test currently constitutes our fledging release whitelist, meaning that if this test fails, we don't release. Perhaps the conclusion can be that we're pushing The logs of n2 and n3 do have a few slow heartbeats before the
i.e. there's nothing going on except a system hard at work, and boom suddenly the inbound streams pop up. But when we look at The log of n1 around the time of the inbound stream problems indicates that the node was under heavy stress there as well. It hovers at around 7k goroutines (other nodes have roughly half that), various
@ajwerner I'd appreciate if you gave this last repro a good look and maybe ran this yourself, watching out for signs of an overload. It's hard for me to judge whether this is all just reaching it's natural breaking point or (which I edit:don't assume) something stupid is going on. The memory pressure doesn't seem too high though (if I'm reading the 0.2% gc right). BTW can someone help me explain this beauty (probably @vivekmenezes)
These lines repeat at roughly this cadence for around four minutes. Looks like something needs to be fixed there. Line 307 in 8ee8477
|
It's also unclear why n1 gets slammed so much more than the other nodes. Is the balance maybe just off? |
@bdarnell to look into this failure mode with @petermattis as a second pair of 👀 |
It's as retriable as other RPC errors. |
I verified that saturating network links does not seem to be the problem. On the hunch that |
Isn't that approx ~100mb transferred over 10s anyway in the log line?
That's closer to 100mbit/s. Maybe that log line should be in mbit/s to
avoid room for Interpretation.
…On Wed, Mar 27, 2019, 22:47 ajwerner ***@***.***> wrote:
I verified that saturating network links does not seem to be the problem.
On the hunch that 96 MiB/108 MiB (r/w)net might be coming close to
saturating a 1Gbit network link I went and measured the network throughput
between hosts of this machine type and I can report that that is not the
problem. iperf gets 8.57 Gbits/sec over a single TCP connection.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35337 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135Fvl4PlmbS497JXPJO0H59MSCNvWks5va-brgaJpZM4balPE>
.
|
I just blindly assumed that was a per second number given the others were. I prefer bytes to bits even though it's networking. I'd prefer to never have to deal in bit. I just got thrown by the time unit. |
By the way, One other thing I've tried is manually triggering a stats job for |
Maybe it's the load-based-rebalancing death spiral seen in #34590 ? |
The
This is the error that is expected if an intermediary network device forgets about the connection due to inactivity. However, we're running with ServerAliveInterval=60 to prevent the connection from being completely idle (unless there's something in the path that has a 60s timeout and we need to send our pings every 55s to reliably keep the connection alive. But 60s would be very short for a timeout at this level). We see this in a variety of network conditions, including teamcity where everything should be staying within the GCP network. I would expect problems on the host we're sshing to to result in "connection reset" errors, not "broken pipe". But I can't rule it out; there may be something that could cause this failure mode instead. Within minutes of first boot we're getting (and rejecting) brute force ssh attacks. Shouldn't make a difference; just an observation:
In any case, the cockroach processes seem fine even after the test has failed with |
^- on the off chance that it might help, should we try lowering the ServerAliveInterval further, to something that's definitely good enough like 10s? |
Seems unlikely to help, but couldn't hurt. I assume in the CI environment we've sorted out all the versioning issues and are using the right build of roachprod? That's something that still often trips me up running locally. |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog
|
That last failure is an OOM on node 2. Memory and goroutines were stable before spiking in the last two minutes before the crash.
Right before the crash there were a lot of slow handleRaftReadies:
Maybe the disk just got slow and the lack of admission control let it blow up? The goroutine stacks look fairly normal (although more than usual are shown as "runnable", even those inside Mutex.Lock calls). |
Nothing in |
I believe all of these failure modes (except the "exit status 255") may just be the failure modes that are possible when a cluster is under sustained overload conditions. According to Use #36097 for discussing why things have gotten slower and what we can do about it. |
Make large cuts to deflake the test until we can get enough data to put a tighter bound on it. The AWS case does not appear to have passed since its introduction. Closes cockroachdb#35337 Updates cockroachdb#36097 Release note: None
36401: roachtest: Reduce tpcc/w=max targets r=tbg a=bdarnell Make large cuts to deflake the test until we can get enough data to put a tighter bound on it. The AWS case does not appear to have passed since its introduction. Closes #35337 Updates #36097 Release note: None Co-authored-by: Ben Darnell <ben@bendarnell.com>
Make large cuts to deflake the test until we can get enough data to put a tighter bound on it. The AWS case does not appear to have passed since its introduction. Closes cockroachdb#35337 Updates cockroachdb#36097 Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog
The text was updated successfully, but these errors were encountered: