-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handoff often fails in production, perhaps due to a race condition #396
Comments
First pass for reproducing this might be trying to do a functional test for handoff. |
Adding a 1 second sleep betwen push and acquire didn't seem to prevent the issue, so maybe my hypothesis is wrong. In general more logging would be very helpful at this point... |
Note also that in the transcript above it's not even clear which machine is having the error, the origin or destination (both of them do zfs renames). |
The origin did do the rename, but mountpoint was not changed, so plausibly the origin is the one that failed somehow during rename. not clear why "dataset does not exist" though so maybe not? I will debug some more tomorrow unless someone takes this on and reproduces it with functional test. |
I still have no idea what's going on. |
I found an error from zfs command line program in logs for My hypothesis: In real code the mountpoint of pool and mountpoint used by higher level code is the same. That extra logic is trying to change mountpoint to same value, but ZFS doesn't notice and complains about directory having contents (since it's already mounted there). |
The actual error:
|
My hypothesis above may be wrong and the issue is race condition where the container starts up and therefore the mount change fails for that reason. |
I turned on logging in both flocker-reportstate and flocker-deploy and got the same error.
|
OK, now I'm seeing the failure happening on the sending side, i.e. the one that is relinquishing contorl. |
I think I finally got it. I added a sleep of 20 seconds after It takes mongod about 10 seconds to stop... but this happens asynchronously, our command to Either |
Much of the time when doing a handoff, the following output is printed:
Sometimes this error does not occur, however. My hypothesis is that there's some race condition where it thinks push has finished but it hasn't, quite, so the acquire fails. (The AlreadyExistError is bogus, it does that no matter what the reason the "zfs rename" failed).
The text was updated successfully, but these errors were encountered: