-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle TaskTrackers #33
Idle TaskTrackers #33
Conversation
Looks like a good start! |
So this is working pretty well now, after various tweaks and fixes. I ran a small job on a small cluster to help force Hadoop into the situations highlighted, and the new behaviour is very apparent. I first launched a map heavy job, then a reduce heavy job, together resulting in the entire cluster being allocated to Hadoop with 58 map slots and 2 reduce slots. Once the first job finished, and the map phase of the second job finished, we're at a point where Hadoop has got 58 slots worth of compute power that it isn't putting to good use. The "idle tracker" code kicks in quite quickly and after about 30 seconds revokes slots from those task trackers (but keeping the tracker itself alive to serve map output) thus releasing the resources back to mesos, keeping a little behind to run the task tracker. When those resources became available, mesos offered them back to Hadoop and Hadoop chose to launch reducers with those resources (since there's now only demand for reduce slots) resulting in a very fluid map and reduce slot allocation. |
This is great Tom, looks brilliant! |
Thanks @brndnmtthws. When I started rolling this out in our docker environment (therefore now enforcing the resource limits, I wasn't using isolation in the above tests, wups) I ran into an issue that I overlooked. Depending on various timing elements, not all of the resources might have been given the executor by the time it needs them (when the task tracker starts up). By this I really mean memory resources. In some situations, a lot of executors were being OOM killed (which led me to this kernel bug). Spark solves this problem by assigning all memory to the executor and all CPU to the tasks, because even when there are no tasks the executor will maintain it's memory limit and not be OOM killed. This helps solve the problem of not being able to change the JVM memory limits at runtime, as you never need to. Given the ratio of CPU/RAM on our cluster, and the actual memory usage of our jobs, it will still be very beneficial to have this feature. Even if task trackers with zero slots are allocated many GBs of memory they're not using, there is still plenty free memory on the cluster to launch more task trackers, thus still allowing us to see the behaviour I described above with the map/reduce job. Note: This problem does not really become noticeable when not enforcing limits with cgroups, because the JVM processes will free up memory over time and they'll just share the entire hosts memory space. Thoughts? |
Yeah, that could certainly be done. I'd take it a step further and suggest setting the CPU/mem separately for the executor and the tasks. Since the TT treats all the tasks as a pool, you'd have to treat all the tasks as one giant task, with the TT separate. |
So that's kind of what's going on, though I think disk needs to also move over to the executor.
It's annoying that there's no way of reliably terminating an executor currently in Mesos. Having this feature would allow us to not use a task for the TaskTracker itself, just an executor. I think it should be adjusted to look like the following...
|
Hey Tom, how's this stuff going? We're thinking of doing something similar, too. Thoughts? Is there anything I can do to help out? |
Hey Brenden! That's great news. Let me just note down the current state of things, my time has been sucked up by some other stuff recently so not had a chance to finish this off.
TLDR; Some testing of what exists here would be really great, I hope to spend some time on it today or tomorrow, and implement the above ideas at least in a basic way. |
Do you have some time to chat about it? Maybe tomorrow morning? (I guess On Wed, Oct 1, 2014 at 12:23 AM, Tom Arnfeld notifications@github.com
|
46b6b9b
to
7d4935a
Compare
20b73d2
to
f474c39
Compare
int runningJobs = taskTracker.runningJobs.size(); | ||
|
||
// Check to see if the number of running jobs on the task tracker is zero | ||
if (runningJobs == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check entirely based on TaskTracker.runningJobs
is not sufficient. It appears this map is not kept 100% up to date especially when failed jobs/tasks are concerned. Perhaps we should look into TaskTracker.runningTasks
instead.
- Check for
runningTasks
instead ofrunningJobs
f7f5cf6
to
d141803
Compare
@brndnmtthws Hey. I'm getting ready to merge this branch and push out a new release now. We've been running the tip of this branch for a month and have seen zero issues, it's also pretty memory efficient now. I need to make sure the version numbers and docs are all updated first, but any objections? |
Conflicts: src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java
We need to free up the resources (CPU) assigned to the task, so lets do that now.
In some cases when jobs fail, the runningJobs map is not updated correctly.
This reverts commit b4f9556.
This now means that when slots are freed by the framework, not only will the CPU become available but so will some of the memory.
Previously we would check the number of running jobs, however that sometimes returend incorrect values especially when dealing with failed jobs on the cluster. The result being some TaskTrackers never commit suicide.
d141803
to
8028724
Compare
No objections from me. 🚢 🇮🇹 |
This pull request introduces the ability to revoke slots from a running TaskTracker once it becomes idle. It contributes to solving #32 (map/reduce slot deadlock) as the cluster is able to remove slots that are idle and launch more when needed, avoid a deadlock situation (when resources are available).
TLDR; The JobTracker / Mesos Framework is able to launch and kill map and reduce slots in the cluster as they become idle, to make better use of those resources.
Essentially, what we've done here is separate the TaskTracker process from the slots. This means launching two mesos tasks attached to the same executor, one for the TaskTracker (as a task with potentially no resources) and another task which can be killed to free up mesos resources while keeping the TaskTracker itself alive. We attach the resources for "revokable" slots to the second task, reserving the ability to free up resources later on.
Note: This is kind of hacky, and is no where near worthy of testing in production yet. Working progress!
How does it work?
Given the use case, I am only dealing with the situation where a running TaskTracker is completely idle. For example, if we launch 10 task trackers with only map slots and 5 with only reduce slots, while the reduce phase is running the map slots (and resources associated with them) can become completely idle. These slots can be killed, as long as the TaskTracker is alive to serve map output to the reducer. It seems hadoop copes perfectly fine with TaskTrackers that have zero slots, too.
If we kill all map slots, we introduce potential failure cases where a node serving map data fails, and there are no map slots to re-compute the data. This is skirted around by only revoking a percentage of map slots from each TaskTracker (
remaining = max(slots - (slots * 0.9), 1)
by default).Once a TaskTracker becomes alive, we check the "idleness" of the slots every 5 seconds, and if the while TaskTracker has no occupied slots for 5 checks, the next time round we'll revoke its slots. Currently the whole task tracker has to be idle for 30 seconds for slots to be revoked.
I'd be very interested to hear what the community thinks of the solution. There's no doubt something obvious I have missed, but worth discussing the idea.