Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Schedule workers on consolidated offers #154

Merged

Conversation

JessicaLHartog
Copy link
Collaborator

@JessicaLHartog JessicaLHartog commented Jun 23, 2016

Since @dsKarthick is unavailable and the master branch is fundamentally broken, I am taking over #146 in his absence.

At the moment, this PR includes:

@@ -64,6 +64,8 @@ public String getHostName() {
}

public void add(Protos.Offer offer) {
// We are unable to aggregate offers if they are from different workers
Copy link
Collaborator

@dsKarthick dsKarthick Jul 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different workers slaves. Note that one can run multiple slaves on the same host. Not sure how we should handle the scenario when see offers coming in from different slaves running on the same worker - abort? or emit an error log and move on?

Karthick Duraisamy Soundararaj and others added 20 commits July 28, 2016 14:50
… and use it to schedule and assign workers with

consolidated offers per node. Adresses the following problems.

Problem 1:
    Types of OfferResources was limited - CPU, MEM and PORTS. Each of them where either RESERVED or UNRESERVED.
    We treated RESERVED or UNRESERVED resources the same way. So the logic in OfferResources was simple. But
    we now have three different types of RESERVED, STATICALLY_RESERVED and DYNAMICALLY_RESERVED and we are starting
    to have special handling on each of the types of resources. One example of such code is
     https://github.com/mesos/storm/pull/111/files#diff-9e83ab25db3f6d262627383d8aa8f815.

Problem 2:
    Currently, offers related logic is spread across OfferResources and RotatingMap.
    What we ultimately need is an interface which could be used across various parts of the framework.
    This commit introduces a 'storm.mesos.resources' package which is the first step in achieving the
    aforementioned goal. This package enables us to define an interface like 'storm.mesos.resources.Aggregator'
    and have various aggregators implement 'storm.mesos.resources.Aggregator' thereby abstracting the aggegation
    logic(including locking) from the rest of the packages.
following error while building java7 binary.

```
   testGetTasksToLaunchForOneTopologyWithOneOffer(storm.mesos.MesosNimbusTest)  Time elapsed: 0.158 sec  <<< ERROR!
   java.lang.NoSuchMethodError: java.lang.String.join(Ljava/lang/CharSequence;Ljava/lang/Iterable;)Ljava/lang/String;
```

Fixing ^^ by using org.apache.commons.lang3.StringUtils.join instead of java.lang.String.join
… Jessica Hartog.

Still TODO:
  1. Refactor MesosNimbus.getTasksToLaunch as per Erik's comment
  2. Refactor SchedulerUtils.getPorts as per Jessica's comment
      1. Refactor MesosNimbus.getTasksToLaunch as per Erik's comment
      2. Refactor SchedulerUtils.getPorts as per Jessica's comment
Also improve readabiltiy to removing unnecessary verbosity of ReservationType enum identifiers,
and then defining their `toString()` method to use the lowercase version of their id.

These new logs retain the same information, but in half the characters.
We can do even better by stripping out unnecessary trailing 0s from the resources.

Example of original log:
```
2016-06-19T07:55:05.924+0000 s.m.u.MesosCommon [INFO] Available resources at workerhostname: cpu : Resource cpus - Total available : 19.500000 Total available by reservation type : [ , [DYNAMICALLY_RESERVED : 0.0, STATICALLY_RESERVED : 0.0, UNRESERVED : 19.5] ], memory: Resource mem - Total available : 88561.000000 Total available by reservation type : [ , [DYNAMICALLY_RESERVED : 0.0, STATICALLY_RESERVED : 0.0, UNRESERVED : 88561.0] ], ports : [31003-31003,31005-32000]
```

Example of new log:
```
2016-06-19T09:21:43.075+0000 s.m.u.MesosCommon [INFO] Available resources at workerhostname: cpus: 19.500000 (dynamic: 0.0, static: 0.0, unreserved: 19.5), mem: 72402.000000 (dynamic: 0.0, static: 0.0, unreserved: 72402.0), ports: [31000-31001,31006-32000]
```
* avoid rescheduling executors when all workers are already running
* throw RuntimeException for nonsensical case where all workers are running but we have unassigned executors
* improve variable naming to be more concise without losing fidelity, and thus improve readability
* removing a call from schedule, since we're pretty sure we don't need it
* fixing the number of slots available for each topology
* spread executors by fixing an unnecessary assignment to executors
* spread executors by not scaling workers down when there are less slots available than assigned slots
…ns, delete few unnecessary/irrelevant methods
…e being

1. Our floating point tabulations in AggregatedOffers can lead to confusing results
where a 1.0 becomes 0.999999999996 for example.  We add a fudge factor by adding 0.01
more CPU resources in one test to work around this until the following issue is fixed:
  * mesos#161
2. Since we are disabling the ability of allSlotsAvailableForScheduling to account
for supervisors already existing when calculating resource needs, we disable the test
that is validating that behavior.  That will be reenabled once we fix this issue:
* mesos#160

Also a few cosmetic-ish changes:
1. Use more standard hostnames with a domain of "example.org", a standard domain
for documentation.  See http://example.org
2. Rearrange/renumber some of the offers to prevent confusion.
@JessicaLHartog JessicaLHartog force-pushed the schedule_workers_on_consolidated_offers branch from 0952c4b to 1032e58 Compare July 29, 2016 19:19
@erikdw
Copy link
Collaborator

erikdw commented Jul 30, 2016

🚢 🇮🇹

🏁 😓

@erikdw erikdw merged commit 575de04 into mesos:master Jul 30, 2016
@salimane
Copy link
Contributor

@erikdw does these changes applies to storm 0.9.6 shim ?

@erikdw
Copy link
Collaborator

erikdw commented Aug 28, 2016

@salimane : yup, this is a fundamental change to the core scheduling logic which applies to both 0.9.6 & 0.10.x. The shims are just slim "adapters" that allow the same storm-mesos codebase to work with different versions of storm's LocalState class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants