Trying to understand the bottleneck that some workers experience #114

kit-ty-kate · 2021-03-07T21:50:07Z

I've been running tests on pisto exclusively the past few days and I'm a bit surprised by how many jobs actually are running at once.
In my case, I have:

an opam-health-check instance giving exactly 80 jobs to the test pool
the 80 jobs fill the worker capacity to its maximum

However most of the jobs stop at:

[...]
---> using "486382d4d06e1d3cd81a0566311ede22726942ab1a24bce8a41318c6a5717b58" from cache

/: (run (cache (opam-archives (target /home/opam/.opam/download-cache)))
        (network host)
        (shell  "<cmd>"))

I've only seen a maximum of 23 out of the 79 jobs that actually start the said command, I'm not sure what's happening to the rest (runc isn't even started).
Maybe there is some kind of IO bottleneck partially caused by ocaml/opam#4586 + maybe the opam-archives cache might be too big and btrfs is struggling to pull it (?)

The load average of the machine in this state is around 15% so if there is a bottleneck it must be some kind of IO or syscall bottleneck

The text was updated successfully, but these errors were encountered:

talex5 · 2021-03-08T10:18:15Z

Is there anything in the logs about it?

e.g. journalctl -u ocluster-worker --since 2021-03-07

kit-ty-kate · 2021-03-08T11:01:07Z

No, i could not find anything weird in the logs. But maybe I'm missing something

kit-ty-kate · 2021-03-08T13:47:49Z

I just tried with ocurrent/obuilder#62 and at one point I got an unprecedented 41 jobs actually running concurrently out of 80 of max capacity! It seems settle between 30 and 41 and it all seems much much faster than previously.
The CPU core are actually maxed out now, which wasn't the case before.

I'll continue monitoring today to see if that changes later

talex5 · 2021-03-08T13:51:58Z

It's possible that it's this mutex, which I added to prevent running btrfs operations concurrently:

https://github.com/ocurrent/obuilder/blob/b38e96b12ef6f44ae8b415daaa35773c5723ba5e/lib/btrfs_store.ml#L28

Might be worth removing that to see if it's OK now.

kit-ty-kate · 2021-03-08T13:59:07Z

That might explain the behaviour I've just seen where the worked waited that all the jobs actually finished doing something (there was no more opam install processes), the ocluster-worker process took the lead in disk writes (as seen from htop), kept doing that for a bit, then some opam processes started to be actually running again all at once (>= 25 jobs).

I had seen the same the past few days but it wasn't as drastic of a change.

I'll test with ocurrent/obuilder@fbeea1d reverted locally

kit-ty-kate · 2021-03-09T00:30:59Z

Reverting ocurrent/obuilder@fbeea1d doesn't seem to help much in the long-run. The following cycle is still there:

jobs actually running starting to reduce in number
number of jobs gets around 1 to 4
btrfs subvolume snapshot <cache> writing something on the disk
ocluster-worker writing something on the disk
run another batch of jobs

I've eliminated the IO bottleneck with a slighly modified version of ocurrent/obuilder#62 reducing the size of the tmpfs to 4Go and reducing the worker capacity from 80 to 30. The still stagnate at around ~20 and follow the cycle above.
It is faster with this bottleneck eliminated but there is still an issue that I'm still failing to understand

kit-ty-kate · 2021-04-21T09:27:29Z

Here is an example of this issue: pima took 20 minutes between the time the moment it got a job assigned and started, and the moment the job actually started:

2021-04-20 14:49.53: Got resource from pool OCluster
Building on pima.ocamllabs.io
All commits already cached
[...]

(from ocaml/opam:debian-10-ocaml-4.12@sha256:ba0fef121e2ae917c603aff89e93453854129eaf22b201e71b906e1570484547)
2021-04-20 15:11.41 ---> using "fab18bb96e8b0969eded4fae715f19af976226248100162ef57f67add6df113c" from cache

Full log: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/312d09f2c3d4afe5e940c2f9ba61561d008c996c/variant/compilers,4.12,mirage-crypto-ec.0.10.0,revdeps,plotkicadsch.0.7.0

PS: I have to say, #120 is really handy, thanks a lot!

talex5 · 2021-04-21T10:13:38Z

Interesting! And it looks like the cancel really did come just as it was ending by itself:

2021-04-20 15:49.39 ---> saved as "b0b283f02043e3151860b54a8b59833c5b3904c2a7ca4fdb8851ef881f9f2749"
2021-04-20 15:49.53: Cancelling: Timeout (60.0 minutes)
Job succeeded
2021-04-20 15:52.56: Timeout (60.0 minutes)

We should probably just reduce the capacity of pima then.

kit-ty-kate mentioned this issue Jul 26, 2022

Speedup and trim the jobs ocurrent/opam-repo-ci#175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to understand the bottleneck that some workers experience #114

Trying to understand the bottleneck that some workers experience #114

kit-ty-kate commented Mar 7, 2021

talex5 commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

talex5 commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

kit-ty-kate commented Mar 9, 2021 •

edited

Loading

kit-ty-kate commented Apr 21, 2021

talex5 commented Apr 21, 2021

Trying to understand the bottleneck that some workers experience #114

Trying to understand the bottleneck that some workers experience #114

Comments

kit-ty-kate commented Mar 7, 2021

talex5 commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

talex5 commented Mar 8, 2021

kit-ty-kate commented Mar 8, 2021

kit-ty-kate commented Mar 9, 2021 • edited Loading

kit-ty-kate commented Apr 21, 2021

talex5 commented Apr 21, 2021

kit-ty-kate commented Mar 9, 2021 •

edited

Loading