Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework id is being written as a full protobuf object, not a string #47

Merged
merged 1 commit into from
Mar 21, 2016

Conversation

mwl
Copy link
Contributor

@mwl mwl commented Mar 21, 2016

Sometimes there is state left over in zookeeper when shutting down a framework. When a new framework starts, it thinks there are three running tasks, when in fact there are none.

To replicate (confirmed using Kibana on real life Mesos cluster on AWS):

  1. Start framework with marathon.
  2. Use marathon to destroy framework
  3. Start framework with marathon.
    The new framework will then kill all previous tasks and not start any new ones.

Work around:
Delete the /${framework_name}/tasks zNode in zookeeper.

@philwinder philwinder added the bug label Mar 21, 2016
@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

Any clue wether Use marathon to destroy framework results in a SIGTERM or SIGKILL. Sounds like it's a SIGKILL.

@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

On SIGTERM you should see the following line at the end of the log

2016-03-21 13:04:04.731  INFO 27256 --- [       Thread-1] c.c.mesos.scheduler.UniversalScheduler   : Scheduler stopped

@philwinder
Copy link
Contributor Author

I think it is a SIGKILL. Also, interestingly, when you restart the framework, when it is up and running, it actually kills all the old instances of the framework. Updating original comment.

@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

Another workaround could be to use the shutdown endpoint in Spring Boot, https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html

@philwinder
Copy link
Contributor Author

Definitely no SIGTERM:

2016-03-21 13:02:55.018  INFO 1 --- [      Thread-65] c.c.mesos.scheduler.UniversalScheduler   : Finished evaluating 4 offers. Accepted 0 offers and rejected 4
Killing docker task
Shutting down
<EOF>

@philwinder
Copy link
Contributor Author

I worry that in a real failure case, it will remove all the tasks and restart none. Testing.

@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

Killing docker task says it all.

How are you stopping the application in Marathon? Curl on some endpoint?

@philwinder
Copy link
Contributor Author

Using the GUI. Click the cog icon, hit destroy. Is usual practice when messing with Marathon.

@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

There's a failovertimeout set to 60 seconds. Any chance you hit that?

Not to self: Raise it and make it configurable

@philwinder
Copy link
Contributor Author

What does that setting mean? And how would it affect killing and not-restarting tasks?

Scratch that. It fails. The restarted scheduler kills all other tasks then never restarts any.

@philwinder
Copy link
Contributor Author

So I think the bug is actually nothing to do with marathon. I think it's something to do with the reaping of tasks when it shouldn't.

@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

It kills the tasks associated with the framework ID if a new scheduler doesn't show up and take over before the timeout. Zookeeper state isn't being flushed so that could explain the behaviour?

@philwinder
Copy link
Contributor Author

Ah right. That explains the killing behaviour. But I definitely restarted within this time, and then I can watch the tasks get killed a few tens of seconds later.

@philwinder
Copy link
Contributor Author

  1. Start framework in docker mode
2016-03-21 13:23:30.466  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0004
  1. Kill scheduler container.
  2. Scheduler restarts:
2016-03-21 13:24:29.402  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=���sr7com.google.protobuf.GeneratedMessageLite$SerializedForm��[�asBytest�[BL�messageClassNamet�Ljava/lang/String;xpur�[B������T��xp+
)68728969-b184-41b6-944f-15606e6b14ce-0004t#org.apache.mesos.Protos$FrameworkID

Message indicates that a full protobuf instance is being written to zookeeper, where an ID is expected.

@philwinder philwinder changed the title Stale tasks state in zookeeper prevents running tasks Framework id is being written as a full protobuf object, not a string Mar 21, 2016
@mwl mwl self-assigned this Mar 21, 2016
@mwl
Copy link
Contributor

mwl commented Mar 21, 2016

@philwinder Could you take a look at this to verify if it solves the issue?

@philwinder
Copy link
Contributor Author

LGTM. Tested fixed:
So, First scheduler:

2016-03-21 14:51:05.680  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

docker kill... etc. Second scheduler:

2016-03-21 14:56:27.641  INFO 1 --- [       Thread-5] c.c.mesos.scheduler.UniversalScheduler   : Framework registrered with frameworkId=68728969-b184-41b6-944f-15606e6b14ce-0005

screen shot 2016-03-21 at 14 59 09

Approved with PullApprove

mwl added a commit that referenced this pull request Mar 21, 2016
…rameworkid

Framework id is being written as a full protobuf object, not a string
@mwl mwl merged commit de77a05 into master Mar 21, 2016
@mwl mwl deleted the bug/47-storing-wrong-frameworkid branch March 21, 2016 15:03
philwinder added a commit to ContainerSolutions/mesosframework that referenced this pull request Mar 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants