Skip to content
This repository has been archived by the owner on Nov 24, 2022. It is now read-only.

Investigate lxc 1.0 / Ubuntu 13.10 issues #150

Closed
fgrehm opened this issue Sep 28, 2013 · 30 comments
Closed

Investigate lxc 1.0 / Ubuntu 13.10 issues #150

fgrehm opened this issue Sep 28, 2013 · 30 comments

Comments

@fgrehm
Copy link
Owner

fgrehm commented Sep 28, 2013

Fork of @gwillem's comment on #129 (comment)

Same issue with lxc 1.0.0~alpha1-0ubuntu3 from Ubuntu 13.10 beta. @fgrehm's template yields another error:

 INFO subprocess: Starting process: ["/usr/bin/sudo", "/home/willem/bin/lxc-vagrant-wrapper", "lxc-start", "-d", "--name", "precise01-1379533350"]
DEBUG subprocess: Selecting on IO
DEBUG subprocess: stderr: lxc-start: command get_init_pid failed to receive response
DEBUG subprocess: Waiting for process to exit. Remaining to timeout: 31994
DEBUG subprocess: Exit status: 255
ERROR warden: Error occurred: There was an error executing ["sudo", "/home/willem/bin/lxc-vagrant-wrapper", "lxc-start", "-d", "--name", "precise01-1379533350"]

I then added the ubuntu-lxc ppa and installed version 1.0.0alpha1+staging20130916-1417-0ubuntu1ppa1saucy1 but now the error is changed into:

lxc-start: command get_cgroup failed to receive response

@rrerolle also reported that he was having issues and @MSch has mentioned that "With lxc 1.0 lxc-shutdown has been renamed to lxc-stop"

@fgrehm
Copy link
Owner Author

fgrehm commented Sep 28, 2013

@gwillem @rrerolle

After some quick debugging I found out that the issue is around these lines and I've got a potential fix for it which is to change those lines under your boxes' ~/.vagrant.d/boxes/<BOX>/lxc/lxc-template to:

    mkdir -p $rootfs
    (cd $rootfs && tar xfz $tarball --strip-components=2)

The problem is that lxc-create uses an intermediary directory under /usr/lib/x86_64-linux-gnu/lxc and the current code ends up extracting the rootfs contents on the wrong path. As a result /var/lib/lxc/<CONTAINER>/rootfs ends up empty and the container is not able to boot.

I'm marking this as a bug on the base boxes and will get to it when I have a chance. Will be awesome if someone can confirm that it works as well :)

@gwillem
Copy link
Contributor

gwillem commented Sep 29, 2013

Fabio, thanks for your continuous efforts! I tried your latest template, including the dirname patch, but something went wrong, see https://gist.github.com/gwillem/6751384

I'm not sure what, as I can't find any relevant debug output, other than that the polling for success didn't work.
My next step would be to isolate lxc and see if that works, but I would have to figure out how to manually operate lxc first and that will take some time.

Cheers!

@gwillem
Copy link
Contributor

gwillem commented Sep 29, 2013

I think the problems lies within lxc (version 1.0.0alpha1.0+master20130927-1542-0ubuntu1ppa1saucy1):

# lxc-create --template ~willem/.vagrant.d/boxes/precise64/lxc/lxc-template --name test -f /home/willem/.vagrant.d/boxes/precise64/lxc/lxc.conf -- --tarball /home/willem/.vagrant.d/boxes/precise64/lxc/rootfs.tar.gz --auth-key /opt/vagrant/embedded/gems/gems/vagrant-1.3.3/keys/vagrant.pub
Extracting /home/willem/.vagrant.d/boxes/precise64/lxc/rootfs.tar.gz ...

##
# The default user is 'vagrant' with password 'vagrant'!
# Use the 'sudo' command to run tasks as root in the container.
##

# lxc-ls
test                                   
# lxc-info --name test
state:  STOPPED
Segmentation fault (core dumped)

After downgrading to the normal saucy package (not the latest builds) it works! Hurray!

@fgrehm
Copy link
Owner Author

fgrehm commented Sep 29, 2013

@gwillem awesome! I just need to make sure it won't break things on other distros and will rebuild the base boxes again, tks for the feedback :)

@fgrehm
Copy link
Owner Author

fgrehm commented Sep 29, 2013

s/rebuild/publish/

@rrerolle
Copy link

Hi Fabio! Sadly, I had no luck with this patch. I do get the same segfault as @gwillem with the lxc-daily. When reverting to saucy's stock package, I still get the get_init_pid failed to receive response.

I can confirm that I have a correctly installed rootfs in /var/lib/lxc//rootfs, so I'm a little bit puzzled here about what might have gone wrong.

@fgrehm
Copy link
Owner Author

fgrehm commented Sep 30, 2013

@rrerolle thats pretty weird, I'll soon update my "vagrant-lxc" playground with support for saucy so we have a common ground to debug the problem, I'll keep you posted :)

@leorochael
Copy link

I've got a similar issue. LXC from ppa on Ubuntu Saucy.

After patching ~/.vagrant.d/boxes/<BOX>/lxc/lxc-template as instructed, it moved past lxc-start: command get_cgroup failed to receive response, but stopped in a loop trying to obtain the IP address:

 INFO subprocess: Starting process: ["/usr/bin/sudo", "lxc-attach", "--name", "geekie_default-1381858322", "--namespac
es", "NETWORK", "--", "/sbin/ip", "-4", "addr", "show", "scope", "global", "eth0"]
DEBUG subprocess: Selecting on IO
DEBUG subprocess: Waiting for process to exit. Remaining to timeout: 32000
DEBUG subprocess: Exit status: 0
 INFO retryable: Retryable exception raised: #<Vagrant::LXC::Errors::ExecuteError: There was an error executing lxc-at
tach

For more information on the failure, enable detailed logging by setting
the environment variable VAGRANT_LOG to DEBUG.>
 INFO subprocess: Starting process: ["/usr/bin/sudo", "lxc-attach", "--name", "geekie_default-1381858322", "--namespac
es", "NETWORK", "--", "/sbin/ip", "-4", "addr", "show", "scope", "global", "eth0"]
DEBUG subprocess: Selecting on IO
DEBUG subprocess: Waiting for process to exit. Remaining to timeout: 32000
DEBUG subprocess: Exit status: 0
 INFO retryable: Retryable exception raised: #<Vagrant::LXC::Errors::ExecuteError: There was an error executing lxc-at
tach

For more information on the failure, enable detailed logging by setting
the environment variable VAGRANT_LOG to DEBUG.>
 INFO subprocess: Starting process: ["/usr/bin/sudo", "lxc-attach", "--name", "geekie_default-1381858322", "--namespac
es", "NETWORK", "--", "/sbin/ip", "-4", "addr", "show", "scope", "global", "eth0"] INFO warden: Calling IN action: #<Vagrant::LXC::Action::FetchIpFromDnsmasqLeases:0x0000000146e4d8>
DEBUG fetch_ip_from_dnsmasq_leases: Attempting to load ip from dnsmasq leases (mac: 00:16:3e:49:a0:d9)
DEBUG fetch_ip_from_dnsmasq_leases: 
DEBUG fetch_ip_from_dnsmasq_leases: Ip could not be parsed from dnsmasq leases file
DEBUG fetch_ip_from_dnsmasq_leases: Attempting to load ip from dnsmasq leases (mac: 00:16:3e:49:a0:d9)
DEBUG fetch_ip_from_dnsmasq_leases: 
DEBUG fetch_ip_from_dnsmasq_leases: Ip could not be parsed from dnsmasq leases file

Indeed, if I try to run lxc-attach --name geekie_default-1381858322 -- /bin/bash and then ip addr, I get an output that indicates eth0 doesn't actually have an IPv4 address:

# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
11: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:16:3e:49:a0:d9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::216:3eff:fe49:a0d9/64 scope link 
       valid_lft forever preferred_lft forever

Trying to run dhclient manually results in the system not finding a "broadcast interface", which is odd considering eth0 is in the list above AND has a BROADCAST flag:

# dhclient -d
Internet Systems Consortium DHCP Client 4.2.4
Copyright 2004-2012 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

No broadcast interfaces found - exiting.

But pointing it specifically at eth0 works:

# dhclient -v eth0
Internet Systems Consortium DHCP Client 4.1-ESV-R4
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/00:16:3e:49:a0:d9
Sending on   LPF/eth0/00:16:3e:49:a0:d9
Sending on   Socket/fallback
DHCPREQUEST of 10.0.3.132 on eth0 to 255.255.255.255 port 67
DHCPACK of 10.0.3.132 from 10.0.3.1
bound to 10.0.3.132 -- renewal in 1594 seconds.

At this point, the vagrant up process breaks out of the loop above and proceeds. The same happens if I call ifup eth0 instead of dhclient eth0 above.

However, it then quickly gets stuck in another loop:

 INFO ssh: Attempting SSH connnection...
 INFO ssh: Attempting to connect to SSH...
 INFO ssh:   - Host: 10.0.3.132
 INFO ssh:   - Port: 22
 INFO ssh:   - Username: vagrant
 INFO ssh:   - Key Path: /home/leo/.vagrant.d/insecure_private_key
DEBUG ssh: == Net-SSH connection debug-level log START ==
DEBUG ssh: D, [2013-10-15T14:52:01.781569 #26439] DEBUG -- net.ssh.transport.session[11a4c20]: establishing connection
 to 10.0.3.132:22

DEBUG ssh: == Net-SSH connection debug-level log END ==
 INFO retryable: Retryable exception raised: #<Errno::ECONNREFUSED: Connection refused - connect(2)>
 INFO ssh: Attempting to connect to SSH...

This loop ends if I also run ssh start in the lxc-attach shell above where I ran dhclient:

At this point vagrant up consider the machine is online, and I can run vagrant ssh normally.

So, apparently, /sbin/init is not configuring the network nor starting ssh. This could be because it doesn't think its supposed to go into a runlevel or something. Indeed, calling runlevel returns unknown.

@rcarmo
Copy link

rcarmo commented Oct 15, 2013

Hi. I'm investigating this as well. It happened (also) on Ubuntu 13.04 after upgrading to lxc 0.9.0-0ubuntu-3.5, and I have the following suspicions:

  • A change in AppArmor profiles (haven't tested disabling it yet since I'm on a Mac and currently building a new Linux VM from scratch to reproduce this)
  • Some missing/changed setting in LXC config that pertains to LXC network interfaces and/or broadcast.

(@fgrehm, it's worth noting either of these would probably bite us if lxcbr0 isn't around ;))

We did manage to get Vagrant-LXC working by manually specifying an IP address for the LXC provider to use in the Vagrantfile - the container would still try to talk to dnsmasq, but it would have a valid IP when it gave up.

IIRC, there were some recent changes to LXC in order to prevent containers from messing up host networking. They might have a bearing on this.

@rcarmo
Copy link

rcarmo commented Oct 15, 2013

I think I can rule out AppArmor. used both aa-complain and aa-disable on both lxc-start and dnsmasq to no avail.

Based on this I'm thinking it's an LXC thing. Going to try a vanilla container next.

@rcarmo
Copy link

rcarmo commented Oct 15, 2013

Okay. I fixed my problem by following the advice on #153 - adding the checksumming to iptables made things work for me - so this can most likely be fixed inside vagrant-lxc itself! :)

@leorochael
Copy link

Ok, so in my situation above, I don't have a problem with UDP or the bridge.

Doing the workarounds I mentioned above on every boot (ifup eth0; start ssh), allowed me to get to call vagrant ssh. But I then had problems with Python Semaphores not working. I traced that to /run not being mounted as tmpfs, which reinforces the notion that the boot process for the container was incomplete.

Trying to mount tmpfs on /run by hand failed with a weird message claiming, first that none was readonly, and then that none was unaccessible. I suspected some sort of permission problem and looked for clues in /var/log/kern.log where I found apparmor complaining that it had to block lxc-start from mounting filesystems.

So I did aa-complain /usr/bin/lxc-start, and now vagrant up runs to completion without issues, and I can immediately call vagrant ssh.

Since lxc-start is now in "apparmor complain mode", I get the following messages during vagrant up:

Oct 17 14:25:55 pelican kernel: [  235.775550] type=1400 audit(1382030755.726:73): apparmor="ALLOWED" operation="getattr" info="Failed name lookup - disconnected path" error=-13 parent=1824 profile="/usr/bin/lxc-start" name="dev/lxc/console" pid=3624 comm="lxc-start" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Oct 17 14:25:55 pelican kernel: [  235.883573] type=1400 audit(1382030755.832:74): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/proc/sys/fs/binfmt_misc/" pid=3691 comm="mount" fstype="binfmt_misc" srcname="none" flags="rw, nosuid, nodev, noexec"
Oct 17 14:25:55 pelican kernel: [  235.886885] type=1400 audit(1382030755.836:75): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/sys/fs/fuse/connections/" pid=3692 comm="mount" fstype="fusectl" srcname="none" flags="rw"
Oct 17 14:25:55 pelican kernel: [  235.896210] type=1400 audit(1382030755.848:76): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/sys/kernel/security/" pid=3694 comm="mount" fstype="securityfs" srcname="none" flags="rw"
Oct 17 14:25:55 pelican kernel: [  235.901495] type=1400 audit(1382030755.852:77): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/run/" pid=3697 comm="mount" fstype="tmpfs" srcname="none" flags="rw, nosuid, noexec"
Oct 17 14:25:55 pelican kernel: [  235.908796] type=1400 audit(1382030755.860:78): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/run/lock/" pid=3699 comm="mount" fstype="tmpfs" srcname="none" flags="rw, nosuid, nodev, noexec"
Oct 17 14:25:55 pelican kernel: [  235.915588] type=1400 audit(1382030755.864:79): apparmor="ALLOWED" operation="mount" info="failed type match" error=-13 parent=3675 profile="/usr/bin/lxc-start" name="/run/shm/" pid=3703 comm="mount" fstype="tmpfs" srcname="none" flags="rw, nosuid, nodev, noexec"
Oct 17 14:31:53 pelican kernel: [  593.747566] perf samples too long (2531 > 2500), lowering kernel.perf_event_max_smple_rate to 50000

So, apparently, my apparmor profiles are incomplete for running lxc, or at least running lxc with the official precise64 box from vagrant-lxc.

To recap my setup:

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 17, 2013

@leorochael tks for all the info :)

would you be able to try to create and use a "vanilla container" from scratch to check if things are working fine over there so we can be sure that the problem is on vagrant-lxc / base boxes?

I'm still using ubuntu 13.04 + stock lxc 0.9.0 down here and I don't have plans to upgrade to 13.10 before the end of the year. I fear that it might impact my current workflow as things are working just fine for me and I haven't experienced any of the bugs you guys have been reporting.

if you or someone else can get me reproducible steps on a Vagrant VBox it will make things a lot easier for me to debug. for a head start you might want to check https://github.com/fgrehm/vagrant-lxc-vbox-hosts =]

@gwillem
Copy link
Contributor

gwillem commented Oct 18, 2013

I upgraded my 13.04 office desktop today, to discover a snakepit :)
Initially (with the latest lxc-template from master) I get this:

      lxc-start 1382103524.080 ERROR    lxc_conf - No such file or directory - failed to mount 'proc' on '/usr/lib/x86_64-linux-gnu/lxc/proc'
      lxc-start 1382103524.080 ERROR    lxc_conf - failed to setup the mount entries for 'deploy_precise01-1382103020'

With aa-complain /usr/bin/lxc-start this is resolved, but now I get:

      lxc-start 1382104309.847 INFO     lxc_cgroup - cgroup has been setup
      lxc-start 1382104309.874 DEBUG    lxc_conf - move '(null)' to '16910'
      lxc-start 1382104309.900 ERROR    lxc_sync - invalid sequence number 1. expected 2
      lxc-start 1382104309.900 WARN     lxc_conf - failed to remove interface '(null)'
      lxc-start 1382104309.930 ERROR    lxc_start - failed to spawn 'deploy_precise01-1382103603'
      lxc-start 1382104309.930 ERROR    lxc_commands - command get_init_pid failed to receive response

My home desktop buzzes along fine with 13.10 alpha 2 though. I'll try to investigate the differences.

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 19, 2013

Quick update: I did some testing with the saucy VBox VM available on vagrant-lxc-vbox-hosts and after patching the lxc-template as I previously pointed out things worked out just fine.

@rcarmo
Copy link

rcarmo commented Oct 19, 2013

Hmmm. Where's that patch again?

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 19, 2013

Oh, right above on #150 (comment), I'll copy & paste here to make things easier as the thread is too big already :P

... around these lines and I've got a potential fix for it which is to change those lines under your boxes' ~/.vagrant.d/boxes/<BOX>/lxc/lxc-template to:

    mkdir -p $rootfs
    (cd $rootfs && tar xfz $tarball --strip-components=2)

I'll try to release new boxes with that built in over the week :)

@gwillem
Copy link
Contributor

gwillem commented Oct 20, 2013

FYI, I fixed my broken Ubuntu 13.10 install (upgraded from alpha2) by reinstalling all the lxc components:

apt-get --purge remove lxc
apt-get --purge autoremove
rm -rf /usr/lib/python3/dist-packages/lxc
apt-get install lxc

The versions of Vagrant, Vagrant-lxc and LXC were the same but apparently something went wrong during the upgrade path of the 13.10 prerelease packages.

@leorochael
Copy link

@gwillem, are you using stock lxc from 13.10 or the daily ppa for lxc?

@gwillem
Copy link
Contributor

gwillem commented Oct 20, 2013

Stock!

@gwillem
Copy link
Contributor

gwillem commented Oct 21, 2013

BTW, I also had to remove ~/.vagrant.d and $PROJECTROOT/.vagrant to get it to work

@FrancoTampieri
Copy link

My Configuration:

  • Saucy Salamander, on kernel 3.11.0-12-generic
  • lxc (1.0.0alpha2+master20131019-0306 from ppa:ubuntu-lxc/stable
  • Using the rootfs fix above from @fgrehm
  • Using the script from vagrant-lxc to build a amd64 lxc container

Now when I create the container with vagrant-up I obtain this:

>vagrant up
Bringing machine 'default' up with 'lxc' provider...
[default] Importing base box 'vagrant-lxc-precise-amd64'...
[default] Setting up mount entries for shared folders...
[default] -- /vagrant
[default] -- /tmp/vagrant-notify
[default] Starting container...


There was an error executing lxc-attach

For more information on the failure, enable detailed logging by setting
the environment variable VAGRANT_LOG to DEBUG.

After when I start the con

And this is the full output with LOG level DEBUG pastebin.com/KVUwYGZg

I can't login to the container :(

Regards

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 21, 2013

@drdran looks like you have been bitten by #153. for potential workarounds you might want to try this or this

Please let us know if it works for you too :)

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 24, 2013

Folks, I've done some initial testing with the patch on lxc-template and the other patch related to the missing lxc-shutdown command and things seem to be working fine and I'd love a 👍 from someone before closing this issue and pushing a new version to rubygems :)

Feel free to pick a box on https://github.com/fgrehm/vagrant-lxc/wiki/Base-boxes and install the plugin from sources and please LMK how it goes!

@gwillem
Copy link
Contributor

gwillem commented Oct 25, 2013

👍 🐹

Works like a charm! Tested with host 13.10 and base box http://bit.ly/vagrant-lxc-precise64-2013-10-23

Thanks Fabio for your continuous efforts!

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 25, 2013

Thats awesome! I'll push a new release to rubygems as soon as I get to a
computer :)
Thanks for trying it out!

Fábio Rehm
Sent on the run
On Oct 25, 2013 11:37 AM, "Willem de Groot" notifications@github.com
wrote:

[image: 👍][image: 🐹]

Works like a charm! Tested with host 13.10 and base box
http://bit.ly/vagrant-lxc-precise64-2013-10-23

Thanks Fabio for your continuous efforts!


Reply to this email directly or view it on GitHubhttps://github.com//issues/150#issuecomment-27092793
.

@fgrehm
Copy link
Owner Author

fgrehm commented Oct 28, 2013

0.6.4 is out! This thread has grown too big so please open up a new issue if you still have problems over there ;)

@fgrehm fgrehm closed this as completed Oct 28, 2013
@gwillem
Copy link
Contributor

gwillem commented Nov 5, 2013

For people (such as my colleague ;) ) Googling this error lxc-start: command get_cgroup failed to receive response and ending up on this thread: the LXC error is a very generic one and could have many causes. You could try debugging LXC first with:

LXC_START_LOG_FILE=/tmp/lxc-start.log VAGRANT_LOG=debug vagrant up

See also: https://github.com/fgrehm/vagrant-lxc/wiki/Troubleshooting

@fgrehm
Copy link
Owner Author

fgrehm commented Nov 5, 2013

Thanks for helping out! I've just added another trick over there which is to start the container on the foreground with lxc-start :)

@vdloo
Copy link

vdloo commented Nov 2, 2015

thanks @gwillem I googled this error!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants