-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
installing a framework after teardown #462
Comments
from @massenz on I'm not clear what you mean by "re-install the same framework" - do you mean, just restarting the binary? And, yes, the name can stay perfectly the same (in fact, you can have several frameworks with the same name - but different IDs - connect to the same Master). Are you using the C++ API or the new HTTP API? If the latter, please have a look at the example here[0] for how to "terminate" a framework and then reconnect it. If the former, see [1] where I set the There are many (better!) examples around of frameworks also in the "Examples" folder in the Mesos source code[3]; you may want to take a look there too. [0] https://github.com/massenz/zk-mesos/blob/develop/notebooks/Demo-API.ipynb |
I believe we need to provide ways to achieve the following:
|
@philwinder @sadovnikov We should check if there is a callback to the scheduler when tearing down the framework. If so we can use that to remove records from ZK. |
(@sadovnikov - thanks for attributing the quote) Guys - if I can be of any help, feel free to ping me directly (m.massenzio (at) gmail com) - github notifications unfortunately are somewhat lacking and I saw this one with a bit of delay. cheers, |
@sadovnikov The only possible place where we may get a callback is in:
Can you check to see whether you can see a log message in the ES scheduler saying "disconnected"? |
The scheduler itself can fail and restart in marathon. |
@tymofii-polekhin @sadovnikov is specifically talking about uninstalling, i.e. the complete removal of a framework. Not failover. |
Callbacks are good, however it could be there is nothing to receive them. I think reuse of framework ID from ZooKeeper is a good idea. However if this does not work, the scheduler, if given permitting parameter, should reconnect to Master and register new ID |
That's a noble cause, but is difficult with the API. The Mesos scheduler API works like this: Note the callback. If the master flips out and decides not to register you (because the framework has already been shutdown, for example) then you won't receive a callback. The framework will just sit there waiting. So using their API's, there is no way of knowing whether it worked. The only way to achieve this is to somehow interrogate mesos to see if that framework has already terminated, before sending the registration request. I suspect this is possible, but I would have to look into where to get the information from. Also note that this is nothing to do with zookeeper. It's all Mesos. ZK is just a datastore. |
Thanks for clarification! |
I'm not sure why you may want to preserver Framework IDs across restarts (or, even more so, re-installs) but as @philwinder correctly pointed out, that's at odds with how Mesos uses them - even assuming you can hack your way round it (and I doubt it) you are likely to encounter weird (and hard-to-debug) failures. I would see the ID as a "throwaway" (mostly opaque to your framework) internal ledger-keeping for Mesos to worry about: a bit like the barcode on the ticket - it matters a lot to the organizers and gatekeepers at the game, but you don't much care about it, so long as your seat is reserved (and vacant!) for you. If you can explain why preserving it matter, maybe I can help finding a different way to accomplish the same? |
Ok, the real solution here is to correctly interpret the various sigkill/sigterm signals sent to the scheduler. We need to check, but when you call teardown, I assume it sends a sigterm to the scheduler. Also, when a user wants to shut down all the executors, they can send a sigterm. The code should interpret this, kill all the executors, remove the state from ZK and quit. On a sigkill however, this should be treated as a failure and the executors should stay alive and the zk state not removed. This way, if you teardown the framework, it will also remove all executors and the zk state so a new instance can correctly start. @mwl Has just implemented something similar in the Logstash framework. |
Tasks:
|
It's a bit different for Logstash. We rely heavily on Spring, so it's just a matter of using a |
Honestly, the easiest is to launch |
Fixed in #509. From the sh script, starting java with |
Using Marathon I removed ES Scheduler, but this left Executors running. In order to remove them I teared down the framework by
curl -X POST -d 'frameworkId=XXXXXXXX-b036-4cb7-af53-4c837dc9521d-0002' http://${MASTER_IP}:5050/master/teardown
. This successfully removed all the framework tasks - executors.However now Mesos Cluster rejects my attempts to re-install the framework.
Below I copy an answer from
user@mesos.apache.org
. Is it possible the scheduler gets the old ID from Zookeeper? What changes should we introduce to enable re-installation of the framework?The text was updated successfully, but these errors were encountered: