-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scheduling Topology Hints #180
Conversation
29f4d4e
to
4b0c03c
Compare
4b0c03c
to
e3bc10c
Compare
faabric::util::SchedulingDecision publicMakeSchedulingDecision( | ||
std::shared_ptr<faabric::BatchExecuteRequest> req, | ||
bool forceLocal, | ||
faabric::util::SchedulingTopologyHint topologyHint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this function isn't immediately obvious and it feels like a bit of a hack. In general if an API needs to be changed to support a test you have one of two things going on: (i) your test is too invasive and checking too much internal logic; (ii) the API doesn't expose enough information.
In this case I think it's (i). I think it can be changed relatively easily as this and callFunctions
have the same signature. It might be possible to change the tests to call callFunctions
instead (as they have mock mode turned on). You can then add a check to make sure the underlying function calls have been disaptched to the expected hosts.
If this isn't possible, then we need to work out what's happening in callFunctions
that doesn't work properly in mock mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, switching to callFunctions
is pretty easy with the only caveat that we need to use the TestExecutor
and TestExecutorFactory
classes that currently lived in ./tests/tests/scheduler/test_executor.cpp
.
I have moved the declaration of these classes to fixtures.h
and kept the definition where it is.
I also add a check for the recorded messages in the function call client.
tests/test/util/test_scheduling.cpp
Outdated
#include <faabric/util/func.h> | ||
#include <faabric/util/scheduling.h> | ||
|
||
using namespace faabric::util; | ||
|
||
namespace tests { | ||
|
||
TEST_CASE("Test building scheduling decisions", "[util]") | ||
TEST_CASE("Test building scheduling decisions", "[util][scheduling-decisions]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tags are getting a bit too granular. Admittedly the util
tag is bloated, but as mentioned above, these tests are really testing the scheduler, so once they've been moved, they can be added to the [scheduler]
tag. The usefulness of the tags is to allow someone to quickly run the tests related to the thing that they've changed, however, when they're this granular, I'm not sure it's easy to work out which ones to run.
Catch2 provides various selectors that should make it easy to run subsets of tests in your dev workflow, if you're finding there are too many tests to run in a loop e.g. by file: https://catch2.docsforge.com/v2.13.2/running/command-line/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree thanks. For future reference, you can run all tests in tests/test/scheduler/test_scheduling_decisions.cpp
by running:
faabric_tests -# [#test_scheduling_decisions]
src/scheduler/Scheduler.cpp
Outdated
@@ -296,8 +311,20 @@ faabric::util::SchedulingDecision Scheduler::makeSchedulingDecision( | |||
int available = r.slots() - r.usedslots(); | |||
int nOnThisHost = std::min(available, remainder); | |||
|
|||
for (int i = 0; i < nOnThisHost; i++) { | |||
hosts.push_back(h); | |||
// Under the pairs topology hint, we never allocate a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut here would instead be:
if(topologyHint == faabric::util::SchedulingTopologyHint::PAIRS &&
nOnThisHost < 2) {
// Move on if we can't colocate function with at least one other
continue;
}
I think this fixes the issue you mention in the PR description too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would behave as expected. For example:
- We have 4 hosts with 4 slots each.
- We want to schedule 9 requests (with the
NEVER_ALONE
hint). - We expect 4 requests scheduled to the first host, and 5 scheduled to the second.
However, I think your solution would schedule 5 requests on the first host and 4 on the second, as it would exhaust all possible hosts (nOnThisHost == 1
for all of them), and resort to overload the master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After an offline discussion, I use this change together with a change in the overloading logic that makes the issue mentioned in the description disappear.
@@ -29,238 +29,224 @@ namespace tests { | |||
std::atomic<int> restoreCount = 0; | |||
std::atomic<int> resetCount = 0; | |||
|
|||
class TestExecutor final : public Executor | |||
TestExecutor::TestExecutor(faabric::Message& msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here involve moving from a class declaration + definition to just definition and some forceLocal
refactor.
}; | ||
} | ||
|
||
SECTION("Decreasing to one and increasing slot distribution") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this test that before would have failed with NEVER_ALONE
.
@csegarragonz, I forgot to add a comment to that approval. Couple of nitpicky comments but other than that LGTM |
In this PR I add support to pass scheduling topology hints as a parameter to
callFunctions
. Scheduling hints affect the way batches are scheduled to the available resources. The currently supported hints are:NORMAL
: will bin-pack requests to hosts until it runs out of slots, and then will overload the master (default).NEVER_ALONE
: will never schedule a request to a host without other requests of the batch. There is one exception: master requests (first in the batch) will be allocated to a new host (even on their own) if the master is out of resources.These hints only affect the
SchedulingDecision
so the separation recently made between making the scheduling decision and actually calling the functions has proven extremely useful for the implementation and testing. I add tooling for testing these and future hints we may add easily, together with a number of tests.