Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running container.execute() fails with "NotFound" #280

Closed
jochumdev opened this issue Mar 2, 2018 · 20 comments
Closed

Long running container.execute() fails with "NotFound" #280

jochumdev opened this issue Mar 2, 2018 · 20 comments
Milestone

Comments

@jochumdev
Copy link
Contributor

I try to run a script that installs a lot of packages through container.execute() but it fails with a pylxd.exceptions.NotFound Exception.

After digging around i found that getting the result from the operation fails, see L299.

Running the same script through lxc exec bootstraptest /root/bootstrap.sh works fine.

@ajkavanagh
Copy link
Contributor

@pcdummy that sounds interesting. Can you supply some more details please:

  • Which version of pylxd
  • Which version of lxd/lxc

Roughly, how to reproduce it. i.e. what types of scripts are being run, how many in the batch, etc.

I'm just trying to get a handle on what could be going on. It looks like lxd may be clearing out the operation_id before pylxd can get a hold of it, or some other problem in lxd. Knowing the version will really help.

@maikeueule
Copy link

maikeueule commented Mar 2, 2018

This seems related to an issue i have observed across all versions of LXD/pylxd but unfortunately I could not so far reproduce it easily to publish it. In my case it occurred both when using lxc exec through ssh, or through pylxd, mostly with commands that terminate fast (eg. ip link show).
https://github.com/lxc/lxd/issues/2797
My initial idea was that it occurs when the websocket enters timeout mode (5s), and the operation in LXD has just been purged (after 5s too).

@jochumdev
Copy link
Contributor Author

jochumdev commented Mar 2, 2018

@ajkavanagh Thanks for your fast response.
@maikeueule Yeah, its very hard to reproduce.

I'm using LXD 2.21 and pylxd 2.2.5 on python 2.7 and ws4py 0.5.1 atm.

I can't reproduce the bug (tried over 2 hours to find a simple case to reproduce), it happens all time in my saltstack bootstrap script but not outside of it.

I tried it with 'sleep' and running my script manualy (over salt), both works as excepted. But when i run
/bin/sleep 2 first then my bootstrap.sh then salt-call state.apply it fails.

@ajkavanagh
Copy link
Contributor

@pcdummy thanks for the updated info. Yup, it looks like a hard-to-reproduce, but very annoying bug.

@mkorcha
Copy link

mkorcha commented Mar 10, 2018

I've run into this issue myself, using LXD version 2.0.11 on Ubuntu and the latest pylxd. An example of what I'm doing that reproduces it for me (given that lxd is a pylxd Client):

base = lxd.containers.create({
		'name': 'ubuntu-1604',
		'source': {
			'type': 'image',
			'protocol': 'simplestreams',
			'server': 'https://images.linuxcontainers.org',
			'alias': 'ubuntu/xenial/amd64'
		}}, wait=True)
base.start(wait=True)
# wait for network to come online
while len(base.state().network['eth0']['addresses']) < 2:
	time.sleep(1)
commands = [['apt-get update'],
		    ['apt-get', 'install', 'openssh-server', 'sudo', 'man', '-y']]
for command in commands:
	base.execute(command)

Hopefully this is helpful!

@maikeueule
Copy link

I confirm i could reproduce with lxd 2.21 from xenial-backports.

@ajkavanagh
Copy link
Contributor

@mkorcha thanks for the example code. I've reproduced the problem using it (there's a slight bug in the first command ['apt-get update'] which needs to be ['apt-get', 'update']).

The good news, is that I've also found and fixed the problem. I'll get a patch set up as soon as I can, once I've done some more rigorous testing with it. It's not a bug in lxd; very much in when the sockets are being closed before fetching the result code from lxd.

@maikeueule
Copy link

Great, i'll be eager to see the patch. As i had the same issue without pylxd, is there a chance the same bug can exist with the sockets of the regular lxd client ?

@ajkavanagh
Copy link
Contributor

@maikeueule hmm, interesting. The bug was a race condition in pylxd in closing the websockets for stderr and stdout and then asking for the return code of the command. I switched it around, so that it leaves the sockets open at the pylxd end until it's got the return code, and then closes them. There still could be a race if lxd closes the sockets and clears out its data before pylxd has managed to get the return code; I'm staring at the lxd socket code to see if it does that!

@maikeueule
Copy link

I had already tried such a fix before you added the ws4py manager. It only partially solved the problem for me. What I observe is a timeout of the websocket which is 5 seconds. Then upon closing them, we try to get the result from LXD, and since the operation terminated about 5s ago, it was removed from the list, and hence the NotFound exception.
My gut feeling was that it was related to SSL somehow... but i might be completely wrong. Also I am very unfamilar with Go which does not help.

@maikeueule
Copy link

@ajkavanagh
Copy link
Contributor

I spent a few hours trying to break the lxc client in the same way and it appears to be rock solid. Tried all sorts of short and long running scripts, with no output and lots of output from the commands, but the lxc client appeared to behave itself. Just sorting out the tests on this change and then I'll PR it.

@mkorcha
Copy link

mkorcha commented Mar 20, 2018

Thank you for your work on this! :)

@jochumdev
Copy link
Contributor Author

Yes, THANKS! :)

@edgarmarkosov
Copy link

edgarmarkosov commented Mar 22, 2018

Can you please tell when this fix will be released?

@ajkavanagh
Copy link
Contributor

@edgarmarkosov I'm trying to shepherd through a couple of PRs and then do a release. It could be a couple of weeks yet. If you need this desperately then you can always pip install from github with the commit sha:

pip install git+git://github.com/lxc/pylxd@<commit>

will let you choose where you want to grab from.

@ajkavanagh ajkavanagh added this to the 2.2.7 milestone Apr 17, 2018
@shuhaowu
Copy link

shuhaowu commented Jul 4, 2018

Not sure if this is fixed.. With 2.2.7, now execute just simply doesn't return for me anymore: it simply stalls forever on the network request.

I'm using lxd 3.2 via the snap install on xenial.

@edgarmarkosov
Copy link

@shuhaowu I also have a problem with the execute using pylxd 2.2.7 with lxd 3.0.1 on ubuntu 18.04. The simplest execute commands last about 2 minutes or more.

@marcogh
Copy link

marcogh commented Jul 4, 2018

same propblem here, with:

lxd: deb package 3.0.1-0ubuntu1
pylxd: pip install pylxd==2.2.7

solved going back to pylxd==2.2.6

running this simple script:

import logging
logging.basicConfig(level=logging.DEBUG)
import pylxd
client = pylxd.Client()
config = {'name': 'c2', 'source': {'type': 'image', "mode": "pull", "server":
    "https://cloud-images.ubuntu.com/daily", "protocol": "simplestreams", 'alias': 'bionic/amd64'}}
container = client.containers.create(config, wait=True)
container.status
container.start(wait=True)
container.status
container.execute(['ls','/'])

executing the script needs 0 seconds on 2.2.6, but it needs over 60 seconds with 2.2.7

debug execution with 2.2.6:

DEBUG:urllib3.connectionpool:http://localhost:None "POST /1.0/containers/c2/exec HTTP/1.1" 202 796
INFO:ws4py:Using epoll
INFO:ws4py:Managing websocket [Bound to '']
INFO:ws4py:Managing websocket [Bound to '']
INFO:ws4py:Removing websocket [Bound to '']
INFO:ws4py:Removing websocket [Bound to '']
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/e99c59dc-178d-4696-a2d0-5584eeff57cb HTTP/1.1" 200 446

debug execution with 2.2.7:

DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:http://localhost:None "POST /1.0/containers HTTP/1.1" 202 497
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/8fe6aa6c-5f2c-4fe8-adc4-3373317a6ce1 HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/8fe6aa6c-5f2c-4fe8-adc4-3373317a6ce1/wait HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/8fe6aa6c-5f2c-4fe8-adc4-3373317a6ce1 HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/containers/c2 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:http://localhost:None "PUT /1.0/containers/c2/state HTTP/1.1" 202 497
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/015471d4-6fc7-4a02-bb72-c80eb9488912 HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/015471d4-6fc7-4a02-bb72-c80eb9488912/wait HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/015471d4-6fc7-4a02-bb72-c80eb9488912 HTTP/1.1" 200 434
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/containers/c2 HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:http://localhost:None "POST /1.0/containers/c2/exec HTTP/1.1" 202 796
INFO:ws4py:Using epoll
INFO:ws4py:Managing websocket [Bound to '']
INFO:ws4py:Managing websocket [Bound to '']
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/26a0cce1-5060-4e7f-aec4-4a290c9c17f2 HTTP/1.1" 200 733
DEBUG:urllib3.connectionpool:http://localhost:None "GET /1.0/operations/26a0cce1-5060-4e7f-aec4-4a290c9c17f2 HTTP/1.1" 200 446
INFO:ws4py:Terminating websocket [Bound to '']
INFO:ws4py:Terminating websocket [Bound to '']

@ajkavanagh
Copy link
Contributor

I've done some testing, and there's definitely something odd going on:

  • Xenial + lxd 2.0.11 -- no delays at all
  • Xenial + lxd 3.2 (snap) -- long delays
  • Bionic + lxd 3.0.1 -- long, long, long delays
  • Bionic + lxd 3.2 (snap) -- long delays

i.e. sadly something has changed between 2.x and 3.x in lxd; I'll open a new bug and dig into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants