Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handoff often fails in production, perhaps due to a race condition #396

Closed
itamarst opened this issue Jul 30, 2014 · 11 comments · Fixed by #410
Closed

Handoff often fails in production, perhaps due to a race condition #396

itamarst opened this issue Jul 30, 2014 · 11 comments · Fixed by #410

Comments

@itamarst
Copy link
Contributor

Much of the time when doing a handoff, the following output is printed:

cannot open 'flocker/ca753bf2-f674-40e7-a0bd-532a3684c75d.mongodb-volume-example15': dataset does not exist
main function encountered error
Traceback (most recent call last):
Failure: twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'flocker.volume.filesystems.interfaces.FilesystemAlreadyExists'>: 
/usr/lib64/python2.7/site-packages/twisted/internet/endpoints.py:362:processEnded
/usr/lib/python2.7/site-packages/flocker/volume/filesystems/zfs.py:60:connectionLost
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:423:errback
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:490:_startRunCallbacks
--- <exception caught here> ---
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:577:_runCallbacks
/usr/lib/python2.7/site-packages/flocker/volume/filesystems/zfs.py:253:rename_failed
]]
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
--- <exception caught here> ---
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 196, in _worker
    result = context.call(ctx, function, *args, **kwargs)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/flocker/common/_ipc.py", line 90, in get_output
    raise IOError("Bad exit", remote_command, e.returncode, e.output)
exceptions.IOError: ('Bad exit', ['flocker-changestate', '"version": 1\n"nodes":\n  "172.16.255.250": []\n  "172.16.255.251": ["mongodb-volume-example15"]\n', '"version": 1\n"applications":\n  "mongodb-volume-exam
ple15":\n    "image": "mongodb-tutorial"\n    "ports":\n    - "internal": 27017\n      "external": 27017\n    "volume":\n      # The location within the container where the data volume will be\n      # mounted:\n 
     "mountpoint": "/data"\n', '172.16.255.250:\n  applications:\n    mongodb-volume-example15:\n      image: unknown\n      ports: []\n      volume: {mountpoint: null}\n  version: 1\n172.16.255.251:\n  applicatio
ns: {}\n  version: 1\n', '172.16.255.250'], 1, '')

Sometimes this error does not occur, however. My hypothesis is that there's some race condition where it thinks push has finished but it hasn't, quite, so the acquire fails. (The AlreadyExistError is bogus, it does that no matter what the reason the "zfs rename" failed).

@itamarst itamarst added this to the Release 0.1 milestone Jul 30, 2014
@itamarst
Copy link
Contributor Author

First pass for reproducing this might be trying to do a functional test for handoff.

@itamarst
Copy link
Contributor Author

Adding a 1 second sleep betwen push and acquire didn't seem to prevent the issue, so maybe my hypothesis is wrong. In general more logging would be very helpful at this point...

@itamarst
Copy link
Contributor Author

Note also that in the transcript above it's not even clear which machine is having the error, the origin or destination (both of them do zfs renames).

@itamarst
Copy link
Contributor Author

The origin did do the rename, but mountpoint was not changed, so plausibly the origin is the one that failed somehow during rename. not clear why "dataset does not exist" though so maybe not? I will debug some more tomorrow unless someone takes this on and reproduces it with functional test.

@wallrj wallrj self-assigned this Jul 31, 2014
@wallrj wallrj assigned itamarst and unassigned wallrj Jul 31, 2014
@itamarst
Copy link
Contributor Author

I still have no idea what's going on.

@itamarst
Copy link
Contributor Author

I found an error from zfs command line program in logs for flocker-volume acquire basically saying "can't change mountpoint because directory is not empty".

My hypothesis:
In tests we have zfs pool setup with mountpoint different than the one where we actually mount for purposes of our higher-level API. We have extra logic that sets mountpoint to public value after rename in change_owner().

In real code the mountpoint of pool and mountpoint used by higher level code is the same. That extra logic is trying to change mountpoint to same value, but ZFS doesn't notice and complains about directory having contents (since it's already mounted there).

@itamarst
Copy link
Contributor Author

The actual error:

2014-07-31 11:12:09-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x15e71b8>: cannot mount '/flocker/30a125b5-0208-4112-b332-242f11b5d907.mongodb-volume-example20': directory is not empty
        property may be set but unable to remount filesystem

@itamarst
Copy link
Contributor Author

My hypothesis above may be wrong and the issue is race condition where the container starts up and therefore the mount change fails for that reason.

@wallrj
Copy link
Contributor

wallrj commented Aug 1, 2014

I turned on logging in both flocker-reportstate and flocker-deploy and got the same error.

2014-08-01 04:20:17-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x3307f80>: cannot mount '/flocker/b482eb78-9936-4cde-8d94-179371803faf.mongodb-volume-example': directory is not empty

2014-08-01 04:20:17-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x3307f80>: filesystem successfully created, but not mounted

2014-08-01 04:20:17-0700 [-] main function encountered error
    Traceback (most recent call last):
    Failure: twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback (failure with no frames): <class 'flocker.volume.filesystems.zfs.CommandFailed'>: 
    ]]

@itamarst
Copy link
Contributor Author

itamarst commented Aug 1, 2014

OK, now I'm seeing the failure happening on the sending side, i.e. the one that is relinquishing contorl.

@itamarst
Copy link
Contributor Author

itamarst commented Aug 1, 2014

I think I finally got it. I added a sleep of 20 seconds after _deploy.StopApplication.run and that seems to fix it.

It takes mongod about 10 seconds to stop... but this happens asynchronously, our command to gear finishes long before that. So we tell it to stop, do a push, and then try to rename ZFS dataset, but mongod is still running and using the files so the rename fails. Plus we've pushed data from a non-quiescent server.

Either GearClient, or more plausibly StopApplication/StartApplication should poll geard until the desired action has actually finished.

wallrj added a commit that referenced this issue Aug 5, 2014
Gear.remove now fires its deferred after the unit has been removed

Author: @itamarst
Reviewer: @wallrj

Fixes #396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants