-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #62339
Comments
Oomkill |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-21.1@f602e37e31a256980ae897917f45cba9c135b412:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
Likely oomkill, log stops mid-file, but no evidence |
I've been seeing this a number of times lately. I'm mostly alarmed by the lack of log messages. I was able to reproduce a very similar looking failure in #62320 but also found no evidence of an OOM kill in the journalctl logs. I also saw no evidence from the runtime. I'm presently looking at changing the logging to avoid hijacking stderr and see if I can confirm that this is an issue with the runtime failing to allocate memory. The heap profile and memory usage from the logging certainly point in that direction. The fact that we have roachtests failing this way is alarming. |
I was wrong in this case, we're failing during But I don't think I've made that mistake 100% of the time and I also feel like logs go missing here. I looked at #62320 and agree that that is just a case of straight up missing files. |
Is it missing files? I don't have any evidence that there should be more files. It's most like the files we have are missing data. The other nodes stop being able to connect to n4 in that issue before the logs stop. I'm worried that things seem so much worse in this release than the last one. |
Sorry, I meant it is missing evidence of the node stopping as you outlined. I agree that something has gotten worse. My conjecture would be that we're losing the "runtime: out of memory" messages when we didn't before? At least I haven't seen an "out of memory" message in a long time despite looking at lots of ooms. |
This comment has been minimized.
This comment has been minimized.
Likely oom but again no trace #62763 |
This has gone cold and the date of last failure lines up well with the 21.1 perf regression fixes (see https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=gce), so considering this closed. |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-21.1@2bdb62260a178e5bb63cf15f704944c5384f4347:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #61974 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-master release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #59044 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-20.1
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: