building the bridge requires a lot of memory #106

dhood · 2018-04-02T16:50:56Z

e.g. https://ci.ros2.org/job/ci_linux/4166/

Step 31/43 : RUN echo "2018-04-02"
 ---> Running in a71b44327fca
connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
Build step 'Execute shell' marked build as failure

I terminated a hanging build from a packaging build this morning (https://ci.ros2.org/view/packaging/job/packaging_linux/1024/) and straight away this packaging build started fine (https://ci.ros2.org/view/packaging/job/packaging_linux/1025/), but CI jobs since then have failed

The text was updated successfully, but these errors were encountered:

nuclearsandwich · 2018-04-02T17:36:37Z

Discussions of the docker error above all seem to end with "I restarted docker and it came back" (one example). Some reports that this can occur if the docker daemon can't get as much memory as it wants. Which might've been the case after whatever error state caused the running job to fail.

dhood · 2018-04-02T17:54:10Z

builds were only failing on the node that had run the job that I terminated.

docker service restart on the node appears to have resolved it.

nuclearsandwich · 2018-04-02T19:06:11Z

docker service restart on the node appears to have resolved it.

Would this have been service docker restart? The docker service ... commands are related to docker swarm and to the best of my knowledge restart isn't a supported command. Since Ubuntu Xenial uses sytemd the service serves as a wrapper / init compatibility layer around systemctl. systemctl restart docker is what I'd recommend in the future though there isn't likely a huge difference on the current system.

dhood · 2018-04-02T19:08:59Z

yep, service docker restart*

dhood · 2018-04-03T16:06:37Z

also, to your comment about not getting enough memory, the build that I terminated indeed had failed because it ran out of RAM. From what I can see we are not limiting the RAM permitted for the container so it should use all that's available (otherwise I'd suggest upping the limit).

@nuclearsandwich do you know if the linux nodes are configured with the same RAM? I'm wondering why node linux-c1657d00 has issues building the bridge but linux-707ef141 doesn't (might have just been bad luck, I'm not sure how repeatable these issues are)

nuclearsandwich · 2018-04-03T16:21:25Z

do you know if the linux nodes are configured with the same RAM?

All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM)

From what I can see we are not limiting the RAM permitted for the container so it should use all that's available (otherwise I'd suggest upping the limit).

We currently don't place any resource limits on running containers. However if memory spikes like this become frequent it might be wise to reserve some memory for the system to avoid docker getting in this particular snit. If the docker daemon properly crashed the supervisor would just restart it but instead it is staying up in an unrecoverable state.

dhood · 2018-04-03T16:52:22Z

yeah, I was surprised that the build was hanging instead of just failing, that might be why. I won't do anything for now but will mention it in the handoff ticket for the next buildcop to keep an eye on. Thanks for the input!

dhood · 2018-04-04T00:56:51Z

this one failed! https://ci.ros2.org/job/ci_packaging_linux/70/console. It was on linux-64e6f5f3. After that this one seemed to run fine: https://ci.ros2.org/job/ci_linux/4176/ on the same node, without me needing to restart docker.

So I'm drawing two conclusions:
(1) the builds are hanging these days (we've been adding new messages to the bridge lately, e.g. ros2/ros1_bridge#106, ros2/ros2#477) and
(2) the job failing looks better than the job hanging.

still not sure why 707ef141 is fine though..?

nuclearsandwich · 2018-04-06T19:11:01Z

https://ci.ros2.org/job/packaging_linux/1030 hung and restarting dockerd on linux-64e6f5f3 was required.

nuclearsandwich · 2018-04-16T15:13:13Z

This weekend started another round of hangs during packaging. It's unclear whether the currently running build () will complete. But this is now at least a weekly occurrence.

The linux hosts are c5.large machines with 4g of RAM but that appears no longer sufficient to run Docker, the Jenkins agent, and build the bridge. As mentioned previously, we could make the failure easier to recover from by restricting the container memory (which would have docker kill the container if it exceeds the threshold) or we would need to move to a larger instance size in order to accommodate the builds as-is. There's a substantial jump in specs and price between the c5.large we're using and the c5.xlarge. https://aws.amazon.com/ec2/pricing/on-demand/ it would also be possible to move laterally to an r4.large instance type which would give us 15.25g memory but we'd lose an elastic compute unit reducing our overall CPU power relative to the current c5.large instance type.

Edit: the build above failed due to complications of the pypi.org warehouse cutover. A packaging job has since succeeded so this issue is still frequent but not persistent.

dhood · 2018-04-17T00:13:44Z

@nuclearsandwich, just want to point out that you previously said:

All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM)

in case there's actually a mix of node configurations. (Which would explain why some builds run fine but not others)

Anyhow, IMO we could survive with only one linux node capable of running packaging/ci_packaging jobs, and just restrict the label. That could be a/the baseline "always on" node if appropriate and just have the others as elastic.

nuclearsandwich · 2018-04-17T12:03:33Z

@nuclearsandwich, just want to point out that you previously said:
All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM

Thanks for pointing that out. Looks like I was previously mistaken. The scaling group configuration that all linux hosts are part of is for c5.large instances. I either didn't fact-check myself or was looking at a buildfarm configuration which uses the c4.xlarge/c5.xlarge instance types. We could add one of these larger instances or we could move the entire configuration up to c5.xlarge and pay the associated cost. We've gotten a lot of value from homogeneous configuration but we're not going to want to move every host to a big GPU instance just to run rviz display tests on Linux occasionally.

dirk-thomas · 2018-04-21T01:10:46Z

I was shelled into the machine shortly before and when it "planked" on https://ci.ros2.org/view/colcon/job/colcon_ci_packaging_linux/14/consoleFull The fact I noticed:

When building a single cpp file containing the factories of one package a single thread used something between 30% and 65% of the available memory (the machine had overall 4 GiB)

/usr/lib/gcc/x86_64-linux-gnu/5/cc1plus -fpreprocessed /home/rosbuild/.ccache/tmp/std_msgs_f.stdout.79d6b8e6cbb2.6109.BbXvPA.ii -quiet -dumpbase std_msgs_f.stdout.79d6b8e6cbb2.6109.BbXvPA.ii -mtune=generic -march=x86-64 -auxbase-strip CMakeFiles/ros1_bridge.dir/generated/std_msgs_factories.cpp.o -g -O2 -Wall -Wextra -Wno-unused-parameter -std=gnu++14 -fPIC -fstack-protector-strong -Wformat-security -o /tmp/cc4gk954.s

Since the colcon branch was building with two threads that obviously pushed the machine over the cliff. This has been fixed on the colcon branch by now. But I thought the general idea about the memory usage might be helpful.

sloretz · 2019-01-04T21:37:57Z

Saw same error, but not while building the bridge. https://ci.ros2.org/job/ci_linux/5978/console

It happened very early in the build

# BEGIN SECTION: Run Dockerfile
21:07:35 + export CONTAINER_NAME=ros2_batch_ci
21:07:35 + docker network create -o com.docker.network.bridge.enable_icc=false isolated_network
21:07:35 Error response from daemon: network with name isolated_network already exists
21:07:35 + true
21:07:35 + id -u
21:07:35 + id -g
21:07:35 + pwd
21:07:35 + docker run --rm --net=isolated_network --privileged -e UID=1001 -e GID=1001 -e CI_ARGS=--do-venv --force-ansi-color --workspace-path /home/jenkins-agent/workspace/ci_linux --ignore-rmw rmw_fastrtps_dynamic_cpp rmw_opensplice_cpp --repo-file-url https://gist.githubusercontent.com/sloretz/3e69a10ac93b85f888168f467e81551a/raw/ddfef1b13627c22c4bda95d6ed9ea5f05e9edc9f/ros2.repos --isolated --build-args --event-handlers console_cohesion+ console_package_list+ --cmake-args -DINSTALL_EXAMPLES=OFF -DSECURITY=ON --packages-up-to rclcpp --test-args --event-handlers console_direct+ --executor sequential --retest-until-pass 10 --packages-up-to rclcpp -e CCACHE_DIR=/home/rosbuild/.ccache -i -v /home/jenkins-agent/workspace/ci_linux:/home/rosbuild/ci_scripts -v /home/jenkins-agent/.ccache:/home/rosbuild/.ccache ros2_batch_ci
21:07:36 docker: Error response from daemon: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused".
21:07:36 time="2019-01-04T21:07:36Z" level=error msg="error waiting for container: context canceled"
21:07:36 Build step 'Execute shell' marked build as failure
21:07:36 $ ssh-agent -k

sloretz · 2019-01-17T17:46:24Z

Same error again, but early during a linux nightly (not while building bridge)

https://ci.ros2.org/view/nightly/job/nightly_linux_debug/1074/console

07:30:11  ---> Running in b4dcb3fa0d68
07:30:12 connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
07:30:12 Build step 'Execute shell' marked build as failure

sloretz · 2019-01-17T17:49:33Z

Happened on two different linux swarm agents last night

https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/227/console

07:30:14 Step 40/54 : RUN echo "2019-01-17"
07:30:14  ---> Running in 0c924217f2f6
07:30:15 connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"

mjcarroll · 2019-01-17T17:51:11Z

I got this on a CI run recently:

https://ci.ros2.org/job/ci_linux/6044/console

And @nuclearsandwich had a resolution:

nuclearsandwich [2 days ago]
[...] That's a playbook issue. Usually means the node exhausted memory for a sec
nuclearsandwich [2 days ago]
Recovered that node with a little turning it off and on again (sudo systemctl restart docker)

nuclearsandwich · 2019-01-17T20:12:11Z

The return of these issues suggests that memory issues on our nodes has increased. We ought to schedule some time to determine whether that's due to infrastructure changes (Ubuntu and Java updates) or increases in the actual build memory footprint.

nuclearsandwich · 2019-01-18T03:12:06Z

I figured out what's going on here. When I deployed the most recent round of new agents I accidentally pulled in a group of smaller nodes from Feb 2018 instead of the freshly created configuration. We should see a downturn in the occurrences of this issue although the question of memory consumption when building the bridge is still open.

dirk-thomas · 2019-04-12T21:09:29Z

Addressed during #168.

dhood closed this as completed Apr 2, 2018

dhood reopened this Apr 4, 2018

dhood changed the title ~~Linux jobs failing with Docker connection error~~ Linux packaging jobs failing/hanging when building the bridge Apr 4, 2018

dhood mentioned this issue Apr 4, 2018

Build farmer handoff 2018-04-04 #107

Closed

nuclearsandwich mentioned this issue May 9, 2018

Build Farmer Handoff 2018-05-09 #117

Closed

mjcarroll mentioned this issue May 23, 2018

Build Farmer Handoff 2018-05-23 #119

Closed

1 task

dirk-thomas mentioned this issue Jun 6, 2018

Build Farmer Handoff 2018-06-06 #122

Closed

clalancette mentioned this issue Jul 11, 2018

Build Farmer Handoff 2018-07-11 #140

Closed

wjwwood mentioned this issue Aug 15, 2018

Build Farmer Handoff 2018-08-15 #145

Closed

nuclearsandwich mentioned this issue Oct 11, 2018

Build Farmer handoff 2018-10-11 #148

Closed

mjcarroll mentioned this issue Nov 29, 2018

Build Farmer Handoff 2018-11-29 #151

Closed

jacobperron mentioned this issue Jan 16, 2019

Build farmer handoff 2019-01-16 #160

Closed

sloretz mentioned this issue Mar 1, 2019

Build farmer handoff 2019-03-01 #168

Closed

6 tasks

dirk-thomas changed the title ~~Linux packaging jobs failing/hanging when building the bridge~~ building the bridge requires a lot of memory Apr 10, 2019

dirk-thomas self-assigned this Apr 12, 2019

dirk-thomas added the in progress Actively being worked on (Kanban column) label Apr 12, 2019

dirk-thomas mentioned this issue Apr 12, 2019

Interface specific compilation units ros2/ros1_bridge#183

Merged

dirk-thomas added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Apr 12, 2019

dirk-thomas closed this as completed in ros2/ros1_bridge#183 Apr 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

building the bridge requires a lot of memory #106

building the bridge requires a lot of memory #106

dhood commented Apr 2, 2018

nuclearsandwich commented Apr 2, 2018 •

edited

Loading

dhood commented Apr 2, 2018

nuclearsandwich commented Apr 2, 2018

dhood commented Apr 2, 2018

dhood commented Apr 3, 2018

nuclearsandwich commented Apr 3, 2018

dhood commented Apr 3, 2018

dhood commented Apr 4, 2018

nuclearsandwich commented Apr 6, 2018

nuclearsandwich commented Apr 16, 2018 •

edited

Loading

dhood commented Apr 17, 2018

nuclearsandwich commented Apr 17, 2018

dirk-thomas commented Apr 21, 2018

sloretz commented Jan 4, 2019

sloretz commented Jan 17, 2019

sloretz commented Jan 17, 2019

mjcarroll commented Jan 17, 2019

nuclearsandwich commented Jan 17, 2019

nuclearsandwich commented Jan 18, 2019

dirk-thomas commented Apr 12, 2019

building the bridge requires a lot of memory #106

building the bridge requires a lot of memory #106

Comments

dhood commented Apr 2, 2018

nuclearsandwich commented Apr 2, 2018 • edited Loading

dhood commented Apr 2, 2018

nuclearsandwich commented Apr 2, 2018

dhood commented Apr 2, 2018

dhood commented Apr 3, 2018

nuclearsandwich commented Apr 3, 2018

dhood commented Apr 3, 2018

dhood commented Apr 4, 2018

nuclearsandwich commented Apr 6, 2018

nuclearsandwich commented Apr 16, 2018 • edited Loading

dhood commented Apr 17, 2018

nuclearsandwich commented Apr 17, 2018

dirk-thomas commented Apr 21, 2018

sloretz commented Jan 4, 2019

sloretz commented Jan 17, 2019

sloretz commented Jan 17, 2019

mjcarroll commented Jan 17, 2019

nuclearsandwich commented Jan 17, 2019

nuclearsandwich commented Jan 18, 2019

dirk-thomas commented Apr 12, 2019

nuclearsandwich commented Apr 2, 2018 •

edited

Loading

nuclearsandwich commented Apr 16, 2018 •

edited

Loading