-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long running container.execute() fails with "NotFound" #280
Comments
@pcdummy that sounds interesting. Can you supply some more details please:
Roughly, how to reproduce it. i.e. what types of scripts are being run, how many in the batch, etc. I'm just trying to get a handle on what could be going on. It looks like lxd may be clearing out the operation_id before pylxd can get a hold of it, or some other problem in lxd. Knowing the version will really help. |
This seems related to an issue i have observed across all versions of LXD/pylxd but unfortunately I could not so far reproduce it easily to publish it. In my case it occurred both when using lxc exec through ssh, or through pylxd, mostly with commands that terminate fast (eg. ip link show). |
@ajkavanagh Thanks for your fast response. I'm using LXD 2.21 and pylxd 2.2.5 on python 2.7 and ws4py 0.5.1 atm. I can't reproduce the bug (tried over 2 hours to find a simple case to reproduce), it happens all time in my saltstack bootstrap script but not outside of it. I tried it with 'sleep' and running my script manualy (over salt), both works as excepted. But when i run |
@pcdummy thanks for the updated info. Yup, it looks like a hard-to-reproduce, but very annoying bug. |
I've run into this issue myself, using LXD version 2.0.11 on Ubuntu and the latest pylxd. An example of what I'm doing that reproduces it for me (given that
Hopefully this is helpful! |
I confirm i could reproduce with lxd 2.21 from xenial-backports. |
@mkorcha thanks for the example code. I've reproduced the problem using it (there's a slight bug in the first command The good news, is that I've also found and fixed the problem. I'll get a patch set up as soon as I can, once I've done some more rigorous testing with it. It's not a bug in lxd; very much in when the sockets are being closed before fetching the result code from lxd. |
Great, i'll be eager to see the patch. As i had the same issue without pylxd, is there a chance the same bug can exist with the sockets of the regular lxd client ? |
@maikeueule hmm, interesting. The bug was a race condition in pylxd in closing the websockets for stderr and stdout and then asking for the return code of the command. I switched it around, so that it leaves the sockets open at the pylxd end until it's got the return code, and then closes them. There still could be a race if lxd closes the sockets and clears out its data before pylxd has managed to get the return code; I'm staring at the lxd socket code to see if it does that! |
I had already tried such a fix before you added the ws4py manager. It only partially solved the problem for me. What I observe is a timeout of the websocket which is 5 seconds. Then upon closing them, we try to get the result from LXD, and since the operation terminated about 5s ago, it was removed from the list, and hence the NotFound exception. |
I spent a few hours trying to break the lxc client in the same way and it appears to be rock solid. Tried all sorts of short and long running scripts, with no output and lots of output from the commands, but the lxc client appeared to behave itself. Just sorting out the tests on this change and then I'll PR it. |
Thank you for your work on this! :) |
Yes, THANKS! :) |
Can you please tell when this fix will be released? |
@edgarmarkosov I'm trying to shepherd through a couple of PRs and then do a release. It could be a couple of weeks yet. If you need this desperately then you can always pip install from github with the commit sha: pip install git+git://github.com/lxc/pylxd@<commit> will let you choose where you want to grab from. |
Not sure if this is fixed.. With 2.2.7, now execute just simply doesn't return for me anymore: it simply stalls forever on the network request. I'm using lxd 3.2 via the snap install on xenial. |
@shuhaowu I also have a problem with the execute using pylxd 2.2.7 with lxd 3.0.1 on ubuntu 18.04. The simplest execute commands last about 2 minutes or more. |
same propblem here, with:
solved going back to pylxd==2.2.6 running this simple script:
executing the script needs 0 seconds on 2.2.6, but it needs over 60 seconds with 2.2.7 debug execution with 2.2.6:
debug execution with 2.2.7:
|
I've done some testing, and there's definitely something odd going on:
i.e. sadly something has changed between 2.x and 3.x in lxd; I'll open a new bug and dig into it. |
I try to run a script that installs a lot of packages through container.execute() but it fails with a pylxd.exceptions.NotFound Exception.
After digging around i found that getting the result from the operation fails, see L299.
Running the same script through
lxc exec bootstraptest /root/bootstrap.sh
works fine.The text was updated successfully, but these errors were encountered: