Idle TaskTrackers #33

tarnfeld · 2014-09-16T20:04:02Z

This pull request introduces the ability to revoke slots from a running TaskTracker once it becomes idle. It contributes to solving #32 (map/reduce slot deadlock) as the cluster is able to remove slots that are idle and launch more when needed, avoid a deadlock situation (when resources are available).

TLDR; The JobTracker / Mesos Framework is able to launch and kill map and reduce slots in the cluster as they become idle, to make better use of those resources.

Essentially, what we've done here is separate the TaskTracker process from the slots. This means launching two mesos tasks attached to the same executor, one for the TaskTracker (as a task with potentially no resources) and another task which can be killed to free up mesos resources while keeping the TaskTracker itself alive. We attach the resources for "revokable" slots to the second task, reserving the ability to free up resources later on.

Note: This is kind of hacky, and is no where near worthy of testing in production yet. Working progress!

How does it work?

Given the use case, I am only dealing with the situation where a running TaskTracker is completely idle. For example, if we launch 10 task trackers with only map slots and 5 with only reduce slots, while the reduce phase is running the map slots (and resources associated with them) can become completely idle. These slots can be killed, as long as the TaskTracker is alive to serve map output to the reducer. It seems hadoop copes perfectly fine with TaskTrackers that have zero slots, too.

If we kill all map slots, we introduce potential failure cases where a node serving map data fails, and there are no map slots to re-compute the data. This is skirted around by only revoking a percentage of map slots from each TaskTracker (remaining = max(slots - (slots * 0.9), 1) by default).

Once a TaskTracker becomes alive, we check the "idleness" of the slots every 5 seconds, and if the while TaskTracker has no occupied slots for 5 checks, the next time round we'll revoke its slots. Currently the whole task tracker has to be idle for 30 seconds for slots to be revoked.

Make it work properly
Rebase all my nasty commits away
Configuration, configuration, configuration
Do all of the todos

I'd be very interested to hear what the community thinks of the solution. There's no doubt something obvious I have missed, but worth discussing the idea.

brndnmtthws · 2014-09-16T23:58:07Z

Looks like a good start!

tarnfeld · 2014-09-18T03:46:44Z

So this is working pretty well now, after various tweaks and fixes. I ran a small job on a small cluster to help force Hadoop into the situations highlighted, and the new behaviour is very apparent.

I first launched a map heavy job, then a reduce heavy job, together resulting in the entire cluster being allocated to Hadoop with 58 map slots and 2 reduce slots.

Once the first job finished, and the map phase of the second job finished, we're at a point where Hadoop has got 58 slots worth of compute power that it isn't putting to good use. The "idle tracker" code kicks in quite quickly and after about 30 seconds revokes slots from those task trackers (but keeping the tracker itself alive to serve map output) thus releasing the resources back to mesos, keeping a little behind to run the task tracker.

When those resources became available, mesos offered them back to Hadoop and Hadoop chose to launch reducers with those resources (since there's now only demand for reduce slots) resulting in a very fluid map and reduce slot allocation.

brndnmtthws · 2014-09-18T17:10:52Z

This is great Tom, looks brilliant!

tarnfeld · 2014-09-24T09:03:07Z

Thanks @brndnmtthws. When I started rolling this out in our docker environment (therefore now enforcing the resource limits, I wasn't using isolation in the above tests, wups) I ran into an issue that I overlooked. Depending on various timing elements, not all of the resources might have been given the executor by the time it needs them (when the task tracker starts up). By this I really mean memory resources.

In some situations, a lot of executors were being OOM killed (which led me to this kernel bug). Spark solves this problem by assigning all memory to the executor and all CPU to the tasks, because even when there are no tasks the executor will maintain it's memory limit and not be OOM killed. This helps solve the problem of not being able to change the JVM memory limits at runtime, as you never need to.

Given the ratio of CPU/RAM on our cluster, and the actual memory usage of our jobs, it will still be very beneficial to have this feature. Even if task trackers with zero slots are allocated many GBs of memory they're not using, there is still plenty free memory on the cluster to launch more task trackers, thus still allowing us to see the behaviour I described above with the map/reduce job.

Note: This problem does not really become noticeable when not enforcing limits with cgroups, because the JVM processes will free up memory over time and they'll just share the entire hosts memory space.

Thoughts?

brndnmtthws · 2014-09-24T13:52:54Z

Yeah, that could certainly be done. I'd take it a step further and suggest setting the CPU/mem separately for the executor and the tasks. Since the TT treats all the tasks as a pool, you'd have to treat all the tasks as one giant task, with the TT separate.

tarnfeld · 2014-09-24T13:58:15Z

Yeah, that could certainly be done. I'd take it a step further and suggest setting the CPU/mem separately for the executor and the tasks. Since the TT treats all the tasks as a pool, you'd have to treat all the tasks as one giant task, with the TT separate.

So that's kind of what's going on, though I think disk needs to also move over to the executor.

It's annoying that there's no way of reliably terminating an executor currently in Mesos. Having this feature would allow us to not use a task for the TaskTracker itself, just an executor. I think it should be adjusted to look like the following...

Executor has some CPU for itself, all the RAM (executor RAM + slots RAM), and all the disk
TaskTracker only has ports (since these only get bound to once it launches)
Slots get CPU only

brndnmtthws · 2014-10-01T05:02:24Z

Hey Tom, how's this stuff going? We're thinking of doing something similar, too. Thoughts? Is there anything I can do to help out?

tarnfeld · 2014-10-01T07:23:36Z

Hey Brenden! That's great news. Let me just note down the current state of things, my time has been sucked up by some other stuff recently so not had a chance to finish this off.

I made the changes we discussed around memory / CPU. Now we free up CPUs when task trackers become idle, but not memory, due to issues with the OOM killer and not being able to resize a running JVM.
Currently we rely on ordering of the tasks being launched, we assume the task tracker is launched first, then the slots second, however due to MESOS-1812. I started a thread on the mailing list about some kind of shutdownExecutor method and it has been suggested to do the following...
- I think we should watch for when the "slots" task launches and is picked up by the executor, and until this happens not send any tasks to the task tracker. This will guarantee we don't try and run tasks on an executor that doesn't have the right amount of resources.
- There are various timing issues with the current implementation, and I think we need to make some adjustments to the task structure.

TLDR; Some testing of what exists here would be really great, I hope to spend some time on it today or tomorrow, and implement the above ideas at least in a basic way.

brndnmtthws · 2014-10-01T23:29:24Z

Do you have some time to chat about it? Maybe tomorrow morning? (I guess
that's afternoon your time?)

On Wed, Oct 1, 2014 at 12:23 AM, Tom Arnfeld notifications@github.com
wrote:

Hey Brenden! That's great news. Let me just note down the current state of
things, my time has been sucked up by some other stuff recently so not had
a chance to finish this off.

I made the changes we discussed around memory / CPU. Now we free up
CPUs when task trackers become idle, but not memory, due to issues with the
OOM killer and not being able to resize a running JVM.

Currently we rely on ordering of the tasks being launched, we assume
the task tracker is launched first, then the slots second, however due to
MESOS-1812 https://issues.apache.org/jira/browse/MESOS-1812. I
started a thread on the mailing list about some kind of
shutdownExecutor method and it has been suggested to do the
following...

I think we should watch for when the "slots" task launches and is
picked up by the executor, and until this happens not send any tasks to the
task tracker. This will guarantee we don't try and run tasks on an executor
that doesn't have the right amount of resources.

There are various timing issues with the current implementation,
and I think we need to make some adjustments to the task structure.

TLDR; Some testing of what exists here would be really great, I hope to
spend some time on it today or tomorrow, and implement the above ideas at
least in a basic way.

—
Reply to this email directly or view it on GitHub
#33 (comment).

tarnfeld · 2015-01-04T20:43:50Z

src/main/java/org/apache/hadoop/mapred/MesosExecutor.java

+        int runningJobs = taskTracker.runningJobs.size();
+
+        // Check to see if the number of running jobs on the task tracker is zero
+        if (runningJobs == 0) {


This check entirely based on TaskTracker.runningJobs is not sufficient. It appears this map is not kept 100% up to date especially when failed jobs/tasks are concerned. Perhaps we should look into TaskTracker.runningTasks instead.

Check for runningTasks instead of runningJobs

tarnfeld · 2015-03-28T11:24:16Z

@brndnmtthws Hey. I'm getting ready to merge this branch and push out a new release now. We've been running the tip of this branch for a month and have seen zero issues, it's also pretty memory efficient now.

I need to make sure the version numbers and docs are all updated first, but any objections?

Conflicts: src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java

We need to free up the resources (CPU) assigned to the task, so lets do that now.

In some cases when jobs fail, the runningJobs map is not updated correctly.

This reverts commit b4f9556.

This now means that when slots are freed by the framework, not only will the CPU become available but so will some of the memory.

Previously we would check the number of running jobs, however that sometimes returend incorrect values especially when dealing with failed jobs on the cluster. The result being some TaskTrackers never commit suicide.

brndnmtthws · 2015-03-28T15:00:38Z

No objections from me.

🚢 🇮🇹

Idle TaskTrackers

tarnfeld force-pushed the feature/fluid-task-trackers branch from 46b6b9b to 7d4935a Compare October 2, 2014 15:23

tarnfeld force-pushed the feature/fluid-task-trackers branch 2 times, most recently from 20b73d2 to f474c39 Compare October 17, 2014 18:35

tarnfeld changed the title ~~[WIP] Idle TaskTrackers~~ Idle TaskTrackers Oct 18, 2014

tarnfeld mentioned this pull request Oct 20, 2014

Running job deadlock #32

Closed

tarnfeld reviewed Jan 4, 2015
View reviewed changes

tarnfeld force-pushed the feature/fluid-task-trackers branch from f7f5cf6 to d141803 Compare March 3, 2015 11:56

tarnfeld added 12 commits March 28, 2015 13:30

Initial implementation of idle task tracker monitoring

05bbe28

Conflicts: src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java

Don't send duplicate task updates

df3ae29

Remove configurables

1e73dd7

Don't try and schedule tasks on a killed task tracker

5c3b91a

Disk defaults

a5a04e1

Remove the tasktracker_ prefix

d7bc8fd

Add in some logging around killing task trackers

aacabf4

Ensure we have enough slots for job cleanup

bd8f55c

Send a TASK_FINISHED update once the slots are revoked

11c78f0

We need to free up the resources (CPU) assigned to the task, so lets do that now.

Only log out if the tracker wasn't idle before

cdbeef0

Ensure we trigger another scheduleIdleCheck(); after killing the tracker

12c88d9

Switch to checking for running tasks, not jobs

9d17b13

In some cases when jobs fail, the runningJobs map is not updated correctly.

tarnfeld added 6 commits March 28, 2015 13:30

Revert "Switch to checking for running tasks, not jobs"

078d6b9

This reverts commit b4f9556.

Make sure we call notifySlots() before killing the task launcher

e7eb4fb

Clean up brackets

3fd9c2a

Push memory limits onto the tasks

c3f9540

This now means that when slots are freed by the framework, not only will the CPU become available but so will some of the memory.

Make use of the isIdle() method on the TaskTracker

aa99653

Previously we would check the number of running jobs, however that sometimes returend incorrect values especially when dealing with failed jobs on the cluster. The result being some TaskTrackers never commit suicide.

Bump to 0.1.0

8028724

tarnfeld force-pushed the feature/fluid-task-trackers branch from d141803 to 8028724 Compare March 28, 2015 13:45

tarnfeld added a commit that referenced this pull request Mar 29, 2015

Merge pull request #33 from duedil-ltd/feature/fluid-task-trackers

4fd5bf1

Idle TaskTrackers

tarnfeld merged commit 4fd5bf1 into mesos:master Mar 29, 2015

tarnfeld deleted the feature/fluid-task-trackers branch March 29, 2015 11:23

tarnfeld mentioned this pull request Mar 29, 2015

Map/Reduce slot allocation is not ideal for small clusters #46

Open

tarnfeld mentioned this pull request May 13, 2015

Running multiple instances #52

Open

tarnfeld mentioned this pull request Oct 16, 2015

Kill Task reports Finished and then waits for tasks to finish #65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle TaskTrackers #33

Idle TaskTrackers #33

tarnfeld commented Sep 16, 2014

brndnmtthws commented Sep 16, 2014

tarnfeld commented Sep 18, 2014

brndnmtthws commented Sep 18, 2014

tarnfeld commented Sep 24, 2014

brndnmtthws commented Sep 24, 2014

tarnfeld commented Sep 24, 2014

brndnmtthws commented Oct 1, 2014

tarnfeld commented Oct 1, 2014

brndnmtthws commented Oct 1, 2014

tarnfeld Jan 4, 2015

tarnfeld commented Mar 28, 2015

brndnmtthws commented Mar 28, 2015

Idle TaskTrackers #33

Idle TaskTrackers #33

Conversation

tarnfeld commented Sep 16, 2014

How does it work?

brndnmtthws commented Sep 16, 2014

tarnfeld commented Sep 18, 2014

brndnmtthws commented Sep 18, 2014

tarnfeld commented Sep 24, 2014

brndnmtthws commented Sep 24, 2014

tarnfeld commented Sep 24, 2014

brndnmtthws commented Oct 1, 2014

tarnfeld commented Oct 1, 2014

brndnmtthws commented Oct 1, 2014

tarnfeld Jan 4, 2015

Choose a reason for hiding this comment

tarnfeld commented Mar 28, 2015

brndnmtthws commented Mar 28, 2015