-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: cdc/initial-scan/rangefeed=true failed #35327
Comments
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163702&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165130&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169980&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178890&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182991&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1185396&tab=buildLog
|
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog
|
^- I190320 06:30:56.262630 161 server/status/runtime.go:500 [n3] runtime stats: 13 GiB RSS, 973 goroutines, 8.0 GiB/711 MiB/9.2 GiB GO alloc/idle/total, 3.5 GiB/4.1 GiB CGO alloc/total, 2762.1 CGO/sec, 1551.0/4.3 %(u/s)time, 0.1 %gc (1x), 179 MiB/13 MiB (r/w)net |
looks like both node 1 and 3 had high memory usage. i downloaded the memprof dumps in the artifacts but nothing jumped out at me. node 1 taken at 06_29_32 when memory usage was around
node 3 is similar. i'm not sure why the 3962.90MB covered by the memprof doesn't line up with the log line. ran the test locally and it worked like a charm |
@danhhz I've just talked with some people about this issue (@lucy-zhang, @dt, some people on the Gophers slack). My current working theory is that the |
The ones I was referencing were collected for me, but I have a copy of the test running right now if you want me to verify the theory. Do you know the argument offhand or should I go read some docs? |
Oh, it's just |
@jordanlewis maybe? running with |
My method was tailing the log and repeatedly running
|
Hm, okay, too bad. Maybe the |
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog
|
Seems like a similar problem as yesterday. I don't see anything new and helpful in the logs. Gonna try and repro with additional logging again. |
After some digging, I feel pretty confidant that the last two failures are the same issue as the most recent comments in #35947. This potentially affects any tpcc cluster with stats, so I'm going to send a PR to disable auto stats in the changefeed tests until that gets sorted out. Every night's run is important data right now. |
36034: roachtest: temporarily disable auto stats collection for cdc tests r=tbg a=danhhz Workaround for #35947. The optimizer currently plans a bad query for TPCC when it has stats, so disable stats for now. Touches #35327 where local tests saw this happen. Perhaps it's also been the cause of the last two nightly run failures. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1224702&tab=buildLog
|
@tbg do you know what to make of this ^^? I don't see any failures in logs or dmesg |
Looked at the debug.zip too and all I saw was a bunch of ranges marked as raft_log_too_large. Dunno what to make of that though |
The error originates here: cockroach/pkg/cmd/roachtest/cdc.go Lines 173 to 177 in 5ed62b9
This is talking to the first node. The first node seems to be running just fine. https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1553980575046300 and recent discussions with @nvanbenschoten come to mind (though I think that latter one had to do with preparing a gazillion statements over a wide area, but I'm not sure the specific failure mode there was understood). Either way, I would be mildly surprised if this were a CDC bug. But there's definitely some bug here, either in the networking setup (in roachtest or in CRDB) that sometimes creates obscure errors. Also cc @bdarnell who was also looking at similar issues which I can't find any more. |
I was looking at "exit status 255" in #35337 (comment) The most expected cause of "connection reset by peer" is if the server process you're talking to crashes (or a NAT/load balancer/proxy, although I don't think any of those are in play here). I'm not sure what could cause this if there's nothing in the logs or dmesg. |
(Besides this last networking issue), this test hasn't failed since the last round of fixes went in. Gonna go ahead and finally close this. Any new failures are likely to be a new(ly discovered) issue at this point |
SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog
The text was updated successfully, but these errors were encountered: