-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to understand the bottleneck that some workers experience #114
Comments
Is there anything in the logs about it? e.g. |
No, i could not find anything weird in the logs. But maybe I'm missing something |
I just tried with ocurrent/obuilder#62 and at one point I got an unprecedented 41 jobs actually running concurrently out of 80 of max capacity! It seems settle between 30 and 41 and it all seems much much faster than previously. I'll continue monitoring today to see if that changes later |
It's possible that it's this mutex, which I added to prevent running btrfs operations concurrently: Might be worth removing that to see if it's OK now. |
That might explain the behaviour I've just seen where the worked waited that all the jobs actually finished doing something (there was no more I had seen the same the past few days but it wasn't as drastic of a change. I'll test with ocurrent/obuilder@fbeea1d reverted locally |
Reverting ocurrent/obuilder@fbeea1d doesn't seem to help much in the long-run. The following cycle is still there:
I've eliminated the IO bottleneck with a slighly modified version of ocurrent/obuilder#62 reducing the size of the tmpfs to 4Go and reducing the worker capacity from 80 to 30. The still stagnate at around ~20 and follow the cycle above. |
Here is an example of this issue:
PS: I have to say, #120 is really handy, thanks a lot! |
Interesting! And it looks like the cancel really did come just as it was ending by itself:
We should probably just reduce the capacity of pima then. |
I've been running tests on pisto exclusively the past few days and I'm a bit surprised by how many jobs actually are running at once.
In my case, I have:
However most of the jobs stop at:
I've only seen a maximum of 23 out of the 79 jobs that actually start the said command, I'm not sure what's happening to the rest (runc isn't even started).
Maybe there is some kind of IO bottleneck partially caused by ocaml/opam#4586 + maybe the
opam-archives
cache might be too big and btrfs is struggling to pull it (?)The load average of the machine in this state is around 15% so if there is a bottleneck it must be some kind of IO or syscall bottleneck
The text was updated successfully, but these errors were encountered: