-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework id is being written as a full protobuf object, not a string #47
Conversation
Any clue wether Use marathon to destroy framework results in a |
On
|
I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment. |
Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html |
Definitely no SIGTERM:
|
I worry that in a real failure case, it will remove all the tasks and restart none. Testing. |
How are you stopping the application in Marathon? Curl on some endpoint? |
Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon. |
There's a failovertimeout set to 60 seconds. Any chance you hit that? Not to self: Raise it and make it configurable |
What does that setting mean? And how would it affect killing and not-restarting tasks? Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any. |
So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't. |
It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour? |
Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later. |
Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected. |
@philwinder Could you take a look at this to verify if it solves the issue? |
LGTM. Tested fixed:
docker kill... etc. Second scheduler:
|
…rameworkid Framework id is being written as a full protobuf object, not a string
Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.
To replicate (confirmed using Kibana on real life Mesos cluster on AWS):
The new framework will then kill all previous tasks and not start any new ones.
Work around:
Delete the
/${framework_name}/tasks
zNode in zookeeper.