-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build stall when CAS reset #245
Comments
Getting quite a lot of this error from workers:
|
It looks like the CAS was under a lot of load and the scheduler tried to do some lookups but the CAS was unable to process more streams due to some other bottleneck and resulted in the queue filling up, then dropping new connections. We can mitigate some of this with: However, in this case having more than one CAS might be needed in order to deal with excessive loads. |
It was running at a load average of 12 and had 300% CPU usage. I was trying to avoid having multiple CAS instances because it becomes a pain to manage. I may have to migrate to one of the S3 style storage solutions and host Turbo-Cache to front it. It should all have been serving out of the memory store though, so it's surprising that it couldn't handle it. |
I wounder if it got caught up in a lock somewhere? It would be great if this happens again if we could get a core dump. The only N^2 problem that I am aware of in this project is on the scheduler matching, and since this happened on the CAS i'm certain it's a external influenced thing. Do you happen to know the disk usage? Tokio's disk commands are not actually async, it actually uses blocking calls in a "block safe thread". By default the system is limited to a fixed amount of these blocking threads. So maybe it ended up with too much backpressure from the disk io? |
Didn't appear to have a particularly high wa but that's anecdotal. |
I'm actually getting a lot of this even when it's not heavily loaded:
I think we may have introduced a problem recently. I just updated the cluster to the latest main. Only thing in the CAS logs are:
|
This would be very useful to get some metrics published under prometheus. |
I'm getting this on the worker:
|
Can you try reverting That commit was significant and is the only major change I can think of that might cause this. |
Just doing a build with that reverted now. |
Now running with that build and it appears a lot happier, although it's only been up for a few seconds, so I'll try and exercise it to check. |
I've just hit it with a totally new branch and it's flying along without any issues. The CAS is running at a load average of 18 and rising and 300% CPU with a 2.8 wa but there's no stalls or breakages. Looks like there might be an issue with 67b90e2. |
Dang, I will revert that commit. I may need to find another way to do what it's trying to do. I tried a few strategies, but this was the one i hated the least. |
I spoke too soon, getting stalls again. When it stalls the scheduler and remoteexec_proxy CPU usage drops to a tiny amount and get these:
|
I think before I got these issues I was running a really old version back at around 259224b |
Hmmm, would it be too much work to bisect it? What hash where you running before attempting to upgrade? |
We might be conflating two issues. The failed to upload stuff looks like it's failing a soft assert:
Triggered in: I could see a couple ways this might happen, but none of them should be critical. Does this actually fail the build or just put noise in the logs? |
The build doesn't fail it gets a RETRY entry in the Goma logs and stalls everything, it's like it locks up. I'm currently trying with just 8ac4824 reverted to see what I get. |
Reverting that change does appear to have fixed it. And I think I have a reason why, previously when 128 simultaneous builds produced their outputs they only generated 128 concurrent uploads of results to the CAS. However, now they upload the stdout and stderr too, so they have 384 simultaneous uploads. Combine that with the downloads and you can see how the limit might be reached on the CAS. This is simply a scale thing I think. |
Having said that, there are still four RETRY entries in Goma for this build, so perhaps it's still happening sometimes? |
I think that these logs in the CAS indicate that an issue occurred:
Although they don't really explain what that issue was... |
In looking into this, it's super strange, the reader and writer are both complaining about each other:
This is super confusing. |
Should there just be a retry in there perhaps? |
I guess we could. I'd rather figure out the root cause though, because it might point to a more fundamental issue. I'm still investigating what might be going on... The entire reason I creates |
Can you share your worker config file? |
|
Well, found one bug. I'm pretty sure it is not the root cause of anything, but it's still wrong ( |
That would explain why the log says it's populating the fast store but it's remote... |
I'm pretty sure what is happening is that the GRPC store (or somewhere down stream) is getting the exact number of bytes needed, then closing the stream before the sender is able to send an EOF. In other words, somewhere down stream says: "oh, i got all the data you are going to send, i might as well close the connection", even though there's a zero byte message that is required to be sent (the EOF message). If this is the case, I'd bet money that it's on the fence-post (like 4096 byte), and somewhere there's a short circuit on "exact bytes received". |
Another fence post? [edit] should finish reading your message before commenting :) Stream errors all show current byte when it fails:
|
Out of a test theory can you change: |
For the theory to work |
Yes. But my idea is that there are so many files being uploaded that one of them is bound to be hit. |
Unfortunately, I'll have to look at it tomorrow now. I think we're certainly on to something though. It might also explain why the integration tests are sometimes flaky. |
Can you also share your CAS config? |
Oh the red herring... This error message is super miss-leading... All error messages after: I'm very certain |
Si you still need the CAS config? |
Let me see how far I can go, but it'd be nice to have. |
|
I think I found the problem: It sees that the exact number of bytes have been received, publishes a EOF to the downstream store, then closes the channel without reading the possible EOF from the GRPC client's store. I think to fix this we just need to try to read one last message here: and if it returns EOF or an error ignore it. If it returns more data, it means it's an error. |
DropCloserWriteHalf.send_eof should check close_after_size before producing an error, right? Or the drop for DropCloserWriteHalf should check before sending the error |
The only way that error can become set is if |
The big problem is that |
I tried adding:
to the unfold in
It's super weird. I modified all of the paths to merge errors now and I don't have any errors from any other paths, so I don't understand how the Receiver is being dropped. |
Can I suggest we also add this to avoid closing both ends of the pipe in error cases:
|
Confirming that I am not seeing this issue with PR #250 applied. |
Added your suggestion (but modified). |
I had everything stall during a full Chromium build, it did all come back to life after a minute or so, but the cause appears to be this:
I'm not sure why this happened and it's the first time I've seen it.
The text was updated successfully, but these errors were encountered: