-
Notifications
You must be signed in to change notification settings - Fork 0
building the bridge requires a lot of memory #106
Comments
Discussions of the docker error above all seem to end with "I restarted docker and it came back" (one example). Some reports that this can occur if the docker daemon can't get as much memory as it wants. Which might've been the case after whatever error state caused the running job to fail. |
builds were only failing on the node that had run the job that I terminated.
|
Would this have been |
yep, |
also, to your comment about not getting enough memory, the build that I terminated indeed had failed because it ran out of RAM. From what I can see we are not limiting the RAM permitted for the container so it should use all that's available (otherwise I'd suggest upping the limit). @nuclearsandwich do you know if the linux nodes are configured with the same RAM? I'm wondering why node linux-c1657d00 has issues building the bridge but linux-707ef141 doesn't (might have just been bad luck, I'm not sure how repeatable these issues are) |
All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM)
We currently don't place any resource limits on running containers. However if memory spikes like this become frequent it might be wise to reserve some memory for the system to avoid docker getting in this particular snit. If the docker daemon properly crashed the supervisor would just restart it but instead it is staying up in an unrecoverable state. |
yeah, I was surprised that the build was hanging instead of just failing, that might be why. I won't do anything for now but will mention it in the handoff ticket for the next buildcop to keep an eye on. Thanks for the input! |
this one failed! https://ci.ros2.org/job/ci_packaging_linux/70/console. It was on linux-64e6f5f3. After that this one seemed to run fine: https://ci.ros2.org/job/ci_linux/4176/ on the same node, without me needing to restart docker. So I'm drawing two conclusions: still not sure why 707ef141 is fine though..? |
https://ci.ros2.org/job/packaging_linux/1030 hung and restarting dockerd on linux-64e6f5f3 was required. |
This weekend started another round of hangs during packaging. It's unclear whether the currently running build () will complete. But this is now at least a weekly occurrence. The linux hosts are c5.large machines with 4g of RAM but that appears no longer sufficient to run Docker, the Jenkins agent, and build the bridge. As mentioned previously, we could make the failure easier to recover from by restricting the container memory (which would have docker kill the container if it exceeds the threshold) or we would need to move to a larger instance size in order to accommodate the builds as-is. There's a substantial jump in specs and price between the c5.large we're using and the c5.xlarge. https://aws.amazon.com/ec2/pricing/on-demand/ it would also be possible to move laterally to an r4.large instance type which would give us 15.25g memory but we'd lose an elastic compute unit reducing our overall CPU power relative to the current c5.large instance type. Edit: the build above failed due to complications of the pypi.org warehouse cutover. A packaging job has since succeeded so this issue is still frequent but not persistent. |
@nuclearsandwich, just want to point out that you previously said:
in case there's actually a mix of node configurations. (Which would explain why some builds run fine but not others) Anyhow, IMO we could survive with only one linux node capable of running packaging/ci_packaging jobs, and just restrict the label. That could be a/the baseline "always on" node if appropriate and just have the others as elastic. |
Thanks for pointing that out. Looks like I was previously mistaken. The scaling group configuration that all linux hosts are part of is for c5.large instances. I either didn't fact-check myself or was looking at a buildfarm configuration which uses the c4.xlarge/c5.xlarge instance types. We could add one of these larger instances or we could move the entire configuration up to c5.xlarge and pay the associated cost. We've gotten a lot of value from homogeneous configuration but we're not going to want to move every host to a big GPU instance just to run rviz display tests on Linux occasionally. |
I was shelled into the machine shortly before and when it "planked" on https://ci.ros2.org/view/colcon/job/colcon_ci_packaging_linux/14/consoleFull The fact I noticed:
Since the colcon branch was building with two threads that obviously pushed the machine over the cliff. This has been fixed on the colcon branch by now. But I thought the general idea about the memory usage might be helpful. |
Saw same error, but not while building the bridge. https://ci.ros2.org/job/ci_linux/5978/console It happened very early in the build
|
Same error again, but early during a linux nightly (not while building bridge) https://ci.ros2.org/view/nightly/job/nightly_linux_debug/1074/console
|
Happened on two different linux swarm agents last night https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/227/console
|
I got this on a CI run recently: https://ci.ros2.org/job/ci_linux/6044/console And @nuclearsandwich had a resolution:
|
The return of these issues suggests that memory issues on our nodes has increased. We ought to schedule some time to determine whether that's due to infrastructure changes (Ubuntu and Java updates) or increases in the actual build memory footprint. |
I figured out what's going on here. When I deployed the most recent round of new agents I accidentally pulled in a group of smaller nodes from Feb 2018 instead of the freshly created configuration. We should see a downturn in the occurrences of this issue although the question of memory consumption when building the bridge is still open. |
Addressed during #168. |
e.g. https://ci.ros2.org/job/ci_linux/4166/
I terminated a hanging build from a packaging build this morning (https://ci.ros2.org/view/packaging/job/packaging_linux/1024/) and straight away this packaging build started fine (https://ci.ros2.org/view/packaging/job/packaging_linux/1025/), but CI jobs since then have failed
The text was updated successfully, but these errors were encountered: