Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

timcharper · 2016-07-25T14:30:04Z

This is a parent issue to aggregate the handful of sub-issues related to resident tasks.

(check indicates it is merged to master. Please see #5206 for the backport to 1.4 status)

Marathon fails to release reserved resources for deleted apps #5142 - Marathon fails to release reserved resources for deleted apps
"Kill and wipe" does not actually kill task #5155 - "Kill and wipe" does not actually kill task
UnreachableStrategy configuration has no effect; should not be used by resident tasks #5163 - UnreachableStrategy configuration has no effect; should not be used by resident tasks
localVolumes Instance property is lost with Marathon restart #5165 - localVolumes Instance property is lost with Marathon restart
Killing a lost resident task results in expunge #5207 - Killing a lost resident task results in expunge
Workaround required for unreachable resident tasks #5284 - Workaround required for unreachable resident tasks

-- original --

I've recorded a video to show the problem:

In effect, Mesos tells Marathon a task was lost during a reconciliation (for a variety of reasons, but in this demonstrated occurrence it is lost because the mesos-slave id is forcibly changed and a new ID comes up on the same mesos-slave IP address). Then, Marathon responds to that by reserving a new set of resources and persistent volume, and launching a new task.

The expected behavior should be that Marathon should reuse the reserved resources (which it can't because it thinks there is a task running there... status.state == Unknown from looking at the protobuf hexdump in zookeeper). If it can't use the reserved resources because it thinks something might be running then it should not launch additional persistent volumes (when push comes to shove, if it can't satisfy 0% over capacity and 0% under capacity thresholds, it should heed the 0% over capacity limit).

The text was updated successfully, but these errors were encountered:

timcharper · 2016-07-25T16:21:02Z

I suspect #4118 may partially or wholly fix this issue, but the disregard for max-over-capacity is still a problem.

meichstedt · 2016-07-25T20:20:32Z

@timcharper thanks for reporting and providing the screencast.

tl;dr: top prio on our tech-debt list for 1.2

We're aware of that and the next work item for me and @unterstein is to provide an implementation/functionality for specifying/fixing task lost behavior, both for tasks using persistent volumes and for normal tasks. #4118 will not fix that issue alone, but it contains necessary prerequisites to allow for a clean implementation.

The underlying problem is that TASK_LOST is not very specific, and under certain circumstances, such a task might come back running – that's why Marathon as of now not always expunges these tasks from state. If that's the case, it will not consider them when scaling/assuring the required instance count, which is ok for non-resident tasks but broken for tasks using persistent volumes.

meichstedt · 2016-07-25T20:23:14Z

Just in case you're wondering: we're currently organizing part of our work in a closed tracker, which is why there hasn't been an issue for this in GH. We'll change that short-term.

jasongilanfarr · 2016-11-23T17:09:59Z

@timcharper is this still an issue or can we close?

timcharper · 2016-12-08T17:59:35Z

@jasongilanfarr it's still an issue.

timcharper · 2016-12-08T18:01:19Z

I can reproduce the issue in 1.4.0-rc1 and will post a video documenting it.

timcharper · 2016-12-08T20:40:41Z

From @meichstedt :

it might be enough to say when we see a reservation for a known instance and are trying to launch instances for the related runSpec, consider launching that instance no matter it’s state

So, consider updating reserved as in object InstanceOpFactory -> case class Request { lazy val reserved: Seq[Instance] = instances.filter(_.isReserved)} to possiblyMatchingReservations or something like that, and have it include Unreachable and UnreachableInactive tasks.

timcharper · 2016-12-08T20:43:00Z

Decided that this is not a release blocker but should be fixed soon after blockers

timcharper · 2016-12-08T23:17:42Z

Can confirm that 1.4.0-rc1 is not re-using reservations once a task becomes LOST.

timcharper · 2017-02-08T03:05:07Z

I still need to verify if this is enough, or I still need @meichstedt 's patch

timcharper · 2017-02-09T15:35:29Z

Found and proposed solution for #5142 while working on this

timcharper · 2017-02-09T23:38:33Z

Another bug found: #5155

timcharper · 2017-02-10T22:14:11Z

Found and fixed this: #5163

timcharper · 2017-02-10T22:48:53Z

Another one: #5165

timcharper · 2017-02-13T23:16:22Z

Cherry-picked and rebased @meichstedt's patch; it still doesn't work. Will look more tomorrow.

https://phabricator.mesosphere.com/D488 should be ready to land

Summary: Require disabled for resident tasks. Fixes #5163. Partially addresses #4137 Test Plan: create resident task. Make it get lost. Ensure that it doesn't come go inactive. Reviewers: aquamatthias, jdef, meichstedt, jenkins Reviewed By: aquamatthias, jdef, meichstedt, jenkins Subscribers: jdef, marathon-team Differential Revision: https://phabricator.mesosphere.com/D488

timcharper · 2017-02-18T00:52:05Z

Found another one: #5207. The solution proposed here will help at least give operators a manual way to recover lost tasks.

timcharper · 2017-02-22T03:03:57Z

With the fix for #5207 operators are at least given a valid work-around. The primary (only?) cause of this issue will go away with Mesos 1.2.0, slated for release in a few months, which will fix the issue in which agents are assigned a new agentId on host reboot, thereby allowing Mesos to officially declare a task as GONE (interpreted by us as a terminal state, and, therefore, prompts a re-launch).

Due to the decreased severity with the other fixes, a valid work-around to get resident tasks running, and, a planned fix for Mesos 1.2.0, I'm inclined to allow this ticket to just get fixed by Mesos 1.2.0

timcharper · 2017-03-01T17:14:27Z

The kill while unreachable approach was ultimately too complex.

We tried modifying reconciliation to reconcile with the agent ID, and this did not help.

We're going to monitor the offer stream and watch for reservations for unreachable tasks, and map them into terminal mesos updates.

meichstedt · 2017-03-06T20:43:49Z

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-1713. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

meichstedt · 2017-03-07T11:24:07Z

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-1713. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

meichstedt added bug labels Jul 25, 2016

meichstedt added this to the 1.2 milestone Jul 25, 2016

meichstedt assigned meichstedt and unterstein Jul 25, 2016

unterstein modified the milestones: 1.2, 1.3 Aug 24, 2016

aquamatthias modified the milestones: 1.4, 1.3 Sep 5, 2016

aquamatthias removed the Hacktoberfest label Oct 21, 2016

jasongilanfarr removed the Epic-217 label Nov 16, 2016

aquamatthias added ready and removed ready labels Dec 6, 2016

timcharper mentioned this issue Dec 8, 2016

persistent volumes gets deleted on task_lost in marathon 1.3 #4603

Closed

timcharper changed the title ~~Persistent volume overallocation in 1.2.0RC5~~ Marathon does not re-use reserved resources for which a lost task is associated Dec 8, 2016

timcharper self-assigned this Dec 8, 2016

timcharper mentioned this issue Dec 12, 2016

Create Marathon 1.4.0-RC2 #4787

Closed

aquamatthias added the ready label Dec 13, 2016

timcharper added ready for review and removed in progress labels Feb 8, 2017

timcharper added in progress and removed ready for review labels Feb 8, 2017

marcomonaco added the sprint-4 label Feb 9, 2017

marcomonaco added the sprint-5 label Feb 15, 2017

timcharper added ready for review and removed in progress labels Feb 22, 2017

timcharper closed this as completed Feb 22, 2017

timcharper removed the ready for review label Feb 22, 2017

timcharper reopened this Mar 1, 2017

timcharper added the in progress label Mar 1, 2017

timcharper changed the title ~~Marathon does not re-use reserved resources for which a lost task is associated~~ Parent issue: Marathon does not re-use reserved resources for which a lost task is associated Mar 1, 2017

timcharper added ready for review and removed in progress labels Mar 6, 2017

timcharper closed this as completed Mar 7, 2017

timcharper removed the ready for review label Mar 7, 2017

d2iq-archive locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

timcharper commented Jul 25, 2016 •

edited

Loading

timcharper commented Jul 25, 2016 •

edited

Loading

meichstedt commented Jul 25, 2016

meichstedt commented Jul 25, 2016

jasongilanfarr commented Nov 23, 2016

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016 •

edited

Loading

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016

timcharper commented Feb 8, 2017

timcharper commented Feb 9, 2017

timcharper commented Feb 9, 2017

timcharper commented Feb 10, 2017

timcharper commented Feb 10, 2017

timcharper commented Feb 13, 2017

timcharper commented Feb 18, 2017

timcharper commented Feb 22, 2017

timcharper commented Mar 1, 2017

meichstedt commented Mar 6, 2017

meichstedt commented Mar 7, 2017

Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

Comments

timcharper commented Jul 25, 2016 • edited Loading

timcharper commented Jul 25, 2016 • edited Loading

meichstedt commented Jul 25, 2016

meichstedt commented Jul 25, 2016

jasongilanfarr commented Nov 23, 2016

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016 • edited Loading

timcharper commented Dec 8, 2016

timcharper commented Dec 8, 2016

timcharper commented Feb 8, 2017

timcharper commented Feb 9, 2017

timcharper commented Feb 9, 2017

timcharper commented Feb 10, 2017

timcharper commented Feb 10, 2017

timcharper commented Feb 13, 2017

timcharper commented Feb 18, 2017

timcharper commented Feb 22, 2017

timcharper commented Mar 1, 2017

meichstedt commented Mar 6, 2017

meichstedt commented Mar 7, 2017

timcharper commented Jul 25, 2016 •

edited

Loading

timcharper commented Jul 25, 2016 •

edited

Loading

timcharper commented Dec 8, 2016 •

edited

Loading