Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

building the bridge requires a lot of memory #106

Closed
dhood opened this issue Apr 2, 2018 · 20 comments · Fixed by ros2/ros1_bridge#183
Closed

building the bridge requires a lot of memory #106

dhood opened this issue Apr 2, 2018 · 20 comments · Fixed by ros2/ros1_bridge#183
Assignees
Labels
in review Waiting for review (Kanban column)

Comments

@dhood
Copy link
Member

dhood commented Apr 2, 2018

e.g. https://ci.ros2.org/job/ci_linux/4166/

Step 31/43 : RUN echo "2018-04-02"
 ---> Running in a71b44327fca
connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
Build step 'Execute shell' marked build as failure

I terminated a hanging build from a packaging build this morning (https://ci.ros2.org/view/packaging/job/packaging_linux/1024/) and straight away this packaging build started fine (https://ci.ros2.org/view/packaging/job/packaging_linux/1025/), but CI jobs since then have failed

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Apr 2, 2018

Discussions of the docker error above all seem to end with "I restarted docker and it came back" (one example). Some reports that this can occur if the docker daemon can't get as much memory as it wants. Which might've been the case after whatever error state caused the running job to fail.

@dhood
Copy link
Member Author

dhood commented Apr 2, 2018

builds were only failing on the node that had run the job that I terminated.

docker service restart on the node appears to have resolved it.

@dhood dhood closed this as completed Apr 2, 2018
@nuclearsandwich
Copy link
Member

docker service restart on the node appears to have resolved it.

Would this have been service docker restart? The docker service ... commands are related to docker swarm and to the best of my knowledge restart isn't a supported command. Since Ubuntu Xenial uses sytemd the service serves as a wrapper / init compatibility layer around systemctl. systemctl restart docker is what I'd recommend in the future though there isn't likely a huge difference on the current system.

@dhood
Copy link
Member Author

dhood commented Apr 2, 2018

yep, service docker restart*

@dhood
Copy link
Member Author

dhood commented Apr 3, 2018

also, to your comment about not getting enough memory, the build that I terminated indeed had failed because it ran out of RAM. From what I can see we are not limiting the RAM permitted for the container so it should use all that's available (otherwise I'd suggest upping the limit).

@nuclearsandwich do you know if the linux nodes are configured with the same RAM? I'm wondering why node linux-c1657d00 has issues building the bridge but linux-707ef141 doesn't (might have just been bad luck, I'm not sure how repeatable these issues are)

@nuclearsandwich
Copy link
Member

do you know if the linux nodes are configured with the same RAM?

All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM)

From what I can see we are not limiting the RAM permitted for the container so it should use all that's available (otherwise I'd suggest upping the limit).

We currently don't place any resource limits on running containers. However if memory spikes like this become frequent it might be wise to reserve some memory for the system to avoid docker getting in this particular snit. If the docker daemon properly crashed the supervisor would just restart it but instead it is staying up in an unrecoverable state.

@dhood
Copy link
Member Author

dhood commented Apr 3, 2018

yeah, I was surprised that the build was hanging instead of just failing, that might be why. I won't do anything for now but will mention it in the handoff ticket for the next buildcop to keep an eye on. Thanks for the input!

@dhood
Copy link
Member Author

dhood commented Apr 4, 2018

this one failed! https://ci.ros2.org/job/ci_packaging_linux/70/console. It was on linux-64e6f5f3. After that this one seemed to run fine: https://ci.ros2.org/job/ci_linux/4176/ on the same node, without me needing to restart docker.

So I'm drawing two conclusions:
(1) the builds are hanging these days (we've been adding new messages to the bridge lately, e.g. ros2/ros1_bridge#106, ros2/ros2#477) and
(2) the job failing looks better than the job hanging.

still not sure why 707ef141 is fine though..?

@dhood dhood reopened this Apr 4, 2018
@dhood dhood changed the title Linux jobs failing with Docker connection error Linux packaging jobs failing/hanging when building the bridge Apr 4, 2018
@nuclearsandwich
Copy link
Member

https://ci.ros2.org/job/packaging_linux/1030 hung and restarting dockerd on linux-64e6f5f3 was required.

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Apr 16, 2018

This weekend started another round of hangs during packaging. It's unclear whether the currently running build (Build Status) will complete. But this is now at least a weekly occurrence.

The linux hosts are c5.large machines with 4g of RAM but that appears no longer sufficient to run Docker, the Jenkins agent, and build the bridge. As mentioned previously, we could make the failure easier to recover from by restricting the container memory (which would have docker kill the container if it exceeds the threshold) or we would need to move to a larger instance size in order to accommodate the builds as-is. There's a substantial jump in specs and price between the c5.large we're using and the c5.xlarge. https://aws.amazon.com/ec2/pricing/on-demand/ it would also be possible to move laterally to an r4.large instance type which would give us 15.25g memory but we'd lose an elastic compute unit reducing our overall CPU power relative to the current c5.large instance type.

Edit: the build above failed due to complications of the pypi.org warehouse cutover. A packaging job has since succeeded so this issue is still frequent but not persistent.

@dhood
Copy link
Member Author

dhood commented Apr 17, 2018

@nuclearsandwich, just want to point out that you previously said:

All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM)

in case there's actually a mix of node configurations. (Which would explain why some builds run fine but not others)

Anyhow, IMO we could survive with only one linux node capable of running packaging/ci_packaging jobs, and just restrict the label. That could be a/the baseline "always on" node if appropriate and just have the others as elastic.

@nuclearsandwich
Copy link
Member

@nuclearsandwich, just want to point out that you previously said:

All linux hosts are c5.xlarge AWS instances (4 vCPUs, 8GiB RAM

Thanks for pointing that out. Looks like I was previously mistaken. The scaling group configuration that all linux hosts are part of is for c5.large instances. I either didn't fact-check myself or was looking at a buildfarm configuration which uses the c4.xlarge/c5.xlarge instance types. We could add one of these larger instances or we could move the entire configuration up to c5.xlarge and pay the associated cost. We've gotten a lot of value from homogeneous configuration but we're not going to want to move every host to a big GPU instance just to run rviz display tests on Linux occasionally.

@dirk-thomas
Copy link
Member

I was shelled into the machine shortly before and when it "planked" on https://ci.ros2.org/view/colcon/job/colcon_ci_packaging_linux/14/consoleFull The fact I noticed:

  • When building a single cpp file containing the factories of one package a single thread used something between 30% and 65% of the available memory (the machine had overall 4 GiB)

    /usr/lib/gcc/x86_64-linux-gnu/5/cc1plus -fpreprocessed /home/rosbuild/.ccache/tmp/std_msgs_f.stdout.79d6b8e6cbb2.6109.BbXvPA.ii -quiet -dumpbase std_msgs_f.stdout.79d6b8e6cbb2.6109.BbXvPA.ii -mtune=generic -march=x86-64 -auxbase-strip CMakeFiles/ros1_bridge.dir/generated/std_msgs_factories.cpp.o -g -O2 -Wall -Wextra -Wno-unused-parameter -std=gnu++14 -fPIC -fstack-protector-strong -Wformat-security -o /tmp/cc4gk954.s

Since the colcon branch was building with two threads that obviously pushed the machine over the cliff. This has been fixed on the colcon branch by now. But I thought the general idea about the memory usage might be helpful.

@sloretz
Copy link

sloretz commented Jan 4, 2019

Saw same error, but not while building the bridge. https://ci.ros2.org/job/ci_linux/5978/console

It happened very early in the build

# BEGIN SECTION: Run Dockerfile
21:07:35 + export CONTAINER_NAME=ros2_batch_ci
21:07:35 + docker network create -o com.docker.network.bridge.enable_icc=false isolated_network
21:07:35 Error response from daemon: network with name isolated_network already exists
21:07:35 + true
21:07:35 + id -u
21:07:35 + id -g
21:07:35 + pwd
21:07:35 + docker run --rm --net=isolated_network --privileged -e UID=1001 -e GID=1001 -e CI_ARGS=--do-venv --force-ansi-color --workspace-path /home/jenkins-agent/workspace/ci_linux --ignore-rmw rmw_fastrtps_dynamic_cpp rmw_opensplice_cpp --repo-file-url https://gist.githubusercontent.com/sloretz/3e69a10ac93b85f888168f467e81551a/raw/ddfef1b13627c22c4bda95d6ed9ea5f05e9edc9f/ros2.repos --isolated --build-args --event-handlers console_cohesion+ console_package_list+ --cmake-args -DINSTALL_EXAMPLES=OFF -DSECURITY=ON --packages-up-to rclcpp --test-args --event-handlers console_direct+ --executor sequential --retest-until-pass 10 --packages-up-to rclcpp -e CCACHE_DIR=/home/rosbuild/.ccache -i -v /home/jenkins-agent/workspace/ci_linux:/home/rosbuild/ci_scripts -v /home/jenkins-agent/.ccache:/home/rosbuild/.ccache ros2_batch_ci
21:07:36 docker: Error response from daemon: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused".
21:07:36 time="2019-01-04T21:07:36Z" level=error msg="error waiting for container: context canceled"
21:07:36 Build step 'Execute shell' marked build as failure
21:07:36 $ ssh-agent -k

@sloretz
Copy link

sloretz commented Jan 17, 2019

Same error again, but early during a linux nightly (not while building bridge)

https://ci.ros2.org/view/nightly/job/nightly_linux_debug/1074/console

07:30:11  ---> Running in b4dcb3fa0d68
07:30:12 connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
07:30:12 Build step 'Execute shell' marked build as failure

@sloretz
Copy link

sloretz commented Jan 17, 2019

Happened on two different linux swarm agents last night

https://ci.ros2.org/view/nightly/job/nightly_linux_extra_rmw_release/227/console

07:30:14 Step 40/54 : RUN echo "2019-01-17"
07:30:14  ---> Running in 0c924217f2f6
07:30:15 connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"

@mjcarroll
Copy link
Member

I got this on a CI run recently:

https://ci.ros2.org/job/ci_linux/6044/console

And @nuclearsandwich had a resolution:

nuclearsandwich [2 days ago]
[...] That's a playbook issue. Usually means the node exhausted memory for a sec
nuclearsandwich [2 days ago]
Recovered that node with a little turning it off and on again (sudo systemctl restart docker)

@nuclearsandwich
Copy link
Member

The return of these issues suggests that memory issues on our nodes has increased. We ought to schedule some time to determine whether that's due to infrastructure changes (Ubuntu and Java updates) or increases in the actual build memory footprint.

@nuclearsandwich
Copy link
Member

I figured out what's going on here. When I deployed the most recent round of new agents I accidentally pulled in a group of smaller nodes from Feb 2018 instead of the freshly created configuration. We should see a downturn in the occurrences of this issue although the question of memory consumption when building the bridge is still open.

@dirk-thomas dirk-thomas changed the title Linux packaging jobs failing/hanging when building the bridge building the bridge requires a lot of memory Apr 10, 2019
@dirk-thomas dirk-thomas self-assigned this Apr 12, 2019
@dirk-thomas dirk-thomas added the in progress Actively being worked on (Kanban column) label Apr 12, 2019
@dirk-thomas
Copy link
Member

Addressed during #168.

@dirk-thomas dirk-thomas added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Apr 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in review Waiting for review (Kanban column)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants