Framework id is being written as a full protobuf object, not a string #47

mwl · 2016-03-21T13:56:44Z

Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.

To replicate (confirmed using Kibana on real life Mesos cluster on AWS):

Start framework with marathon.
Use marathon to destroy framework
Start framework with marathon.
The new framework will then kill all previous tasks and not start any new ones.

Work around:
Delete the /${framework_name}/tasks zNode in zookeeper.

mwl · 2016-03-21T13:03:23Z

Any clue wether Use marathon to destroy framework results in a SIGTERM or SIGKILL. Sounds like it's a SIGKILL.

mwl · 2016-03-21T13:04:46Z

On SIGTERM you should see the following line at the end of the log

2016-03-21 13:04:04.731  INFO 27256 --- [       Thread-1] c.c.mesos.scheduler.UniversalScheduler   : Scheduler stopped

philwinder · 2016-03-21T13:05:05Z

I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment.

mwl · 2016-03-21T13:06:43Z

Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html

philwinder · 2016-03-21T13:07:03Z

Definitely no SIGTERM:

2016-03-21 13:02:55.018  INFO 1 --- [      Thread-65] c.c.mesos.scheduler.UniversalScheduler   : Finished evaluating 4 offers. Accepted 0 offers and rejected 4
Killing docker task
Shutting down
<EOF>

philwinder · 2016-03-21T13:07:58Z

I worry that in a real failure case, it will remove all the tasks and restart none. Testing.

mwl · 2016-03-21T13:08:28Z

Killing docker task says it all.

How are you stopping the application in Marathon? Curl on some endpoint?

philwinder · 2016-03-21T13:09:25Z

Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon.

mwl · 2016-03-21T13:13:13Z

There's a failovertimeout set to 60 seconds. Any chance you hit that?

Not to self: Raise it and make it configurable

philwinder · 2016-03-21T13:16:28Z

What does that setting mean? And how would it affect killing and not-restarting tasks?

Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any.

philwinder · 2016-03-21T13:18:33Z

So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't.

mwl · 2016-03-21T13:18:40Z

It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour?

philwinder · 2016-03-21T13:20:00Z

Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later.

philwinder · 2016-03-21T13:29:10Z

Start framework in docker mode

2016-03-21 13:23:30.466  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0004

Kill scheduler container.
Scheduler restarts:

2016-03-21 13:24:29.402  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=���sr7com.google.protobuf.GeneratedMessageLite$SerializedForm��[�asBytest�[BL�messageClassNamet�Ljava/lang/String;xpur�[B������T��xp+
)68728969-b184-41b6-944f-15606e6b14ce-0004t#org.apache.mesos.Protos$FrameworkID

Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected.

mwl · 2016-03-21T14:24:56Z

@philwinder Could you take a look at this to verify if it solves the issue?

philwinder · 2016-03-21T15:00:56Z

LGTM. Tested fixed:
So, First scheduler:

2016-03-21 14:51:05.680  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

docker kill... etc. Second scheduler:

2016-03-21 14:56:27.641  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

…rameworkid Framework id is being written as a full protobuf object, not a string

philwinder added the bug label Mar 21, 2016

philwinder changed the title ~~Stale tasks state in zookeeper prevents running tasks~~ Framework id is being written as a full protobuf object, not a string Mar 21, 2016

Store frameworkid string and deserialise it

084fcd1

mwl self-assigned this Mar 21, 2016

mwl added a commit that referenced this pull request Mar 21, 2016

Merge pull request #47 from ContainerSolutions/bug/47-storing-wrong-f…

de77a05

…rameworkid Framework id is being written as a full protobuf object, not a string

mwl merged commit de77a05 into master Mar 21, 2016

mwl deleted the bug/47-storing-wrong-frameworkid branch March 21, 2016 15:03

philwinder added a commit to ContainerSolutions/mesosframework that referenced this pull request Mar 21, 2016

Upgrade mesos-starter to fix ContainerSolutions/mesos-starter#47

354558b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework id is being written as a full protobuf object, not a string #47

Framework id is being written as a full protobuf object, not a string #47

mwl commented Mar 21, 2016

mwl commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

Framework id is being written as a full protobuf object, not a string #47

Framework id is being written as a full protobuf object, not a string #47

Conversation

mwl commented Mar 21, 2016

mwl commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016

philwinder commented Mar 21, 2016

mwl commented Mar 21, 2016

philwinder commented Mar 21, 2016