-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qmanager: send partial-ok with sched.hello #1321
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have already done this, but I modified Fluxion's parse_R
to print the json_unpack
error and got:
Dec 17 16:40:58.375707 UTC sched-fluxion-resource.err[0]: parse_R: json_unpack: Expected integer, got real
qmanager/modules/qmanager.cpp
Outdated
@@ -560,7 +560,7 @@ static std::shared_ptr<qmanager_ctx_t> qmanager_new (flux_t *h) | |||
} | |||
if (!(ctx->schedutil = | |||
schedutil_create (ctx->h, | |||
SCHEDUTIL_FREE_NOLOOKUP, | |||
SCHEDUTIL_HELLO_PARTIAL_OK, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: if this code is modified to use the SCHEDUTIL_HELLO_PARTIAL_OK
macro only if it is defined, then flux-sched won't need to have a new dependency on an as-yet unreleased version of flux-core.
RFC 20 allows optional microsecond precision in R (FYI - I did make that simple change and your new test passes) |
Changes for reference (probably needs a clang-format done with the flux-sched project style) diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index f8554555..42d3fcae 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1528,8 +1528,8 @@ static int parse_R (std::shared_ptr<resource_ctx_t> &ctx,
int rc = 0;
int version = 0;
int saved_errno;
- uint64_t st = 0;
- uint64_t et = 0;
+ double tstart = 0.;
+ double expiration = 0.;
json_t *o = NULL;
json_t *graph = NULL;
json_error_t error;
@@ -1541,23 +1541,24 @@ static int parse_R (std::shared_ptr<resource_ctx_t> &ctx,
errno = EINVAL;
goto out;
}
- if ((rc = json_unpack (o,
- "{s:i s:{s:I s:I} s?:o}",
- "version",
- &version,
- "execution",
- "starttime",
- &st,
- "expiration",
- &et,
- "scheduling",
- &graph))
- < 0) {
+ if ((rc = json_unpack_ex (o,
+ &error,
+ 0,
+ "{s:i s:{s:F s:F} s?:o}",
+ "version", &version,
+ "execution",
+ "starttime", &tstart,
+ "expiration", &expiration,
+ "scheduling", &graph)) < 0) {
errno = EINVAL;
- flux_log (ctx->h, LOG_ERR, "%s: json_unpack", __FUNCTION__);
+ flux_log (ctx->h,
+ LOG_ERR,
+ "%s: json_unpack: %s",
+ __FUNCTION__,
+ error.text);
goto freemem_out;
}
- if (version != 1 || st < 0 || et < st) {
+ if (version != 1 || tstart < 0 || expiration < tstart) {
rc = -1;
errno = EPROTO;
flux_log (ctx->h,
@@ -1565,8 +1566,8 @@ static int parse_R (std::shared_ptr<resource_ctx_t> &ctx,
"%s: version=%d, starttime=%jd, expiration=%jd",
__FUNCTION__,
version,
- static_cast<intmax_t> (st),
- static_cast<intmax_t> (et));
+ static_cast<intmax_t> (tstart),
+ static_cast<intmax_t> (expiration));
goto freemem_out;
}
if (graph != NULL) {
@@ -1585,8 +1586,8 @@ static int parse_R (std::shared_ptr<resource_ctx_t> &ctx,
format = "rv1exec";
}
- starttime = static_cast<int64_t> (st);
- duration = et - st;
+ starttime = static_cast<int64_t> (tstart);
+ duration = static_cast<uint64_t> (expiration - tstart);
freemem_out:
saved_errno = errno; |
Nice! Thanks! I will add those changes to the PR. |
OK pushed the following
I'm not sure if we need a test that covers partial release with JGF as well. I'll leave this a WIP that's figured out. |
Problem: the raw RPC interfaces are changing to use a size_t instead of int for payload size. Earlier qmanager was changed to use size_t. It didn't occur to me at the time, but we can avoid using the raw RPC interface entirely since all payloads are strings. This allows fluxion to continue to work with older versions. Do that instead.
Another push:
|
Strange. The new test is failing on some builders. The test output includes indicates that the scheduler successfully set the partial-ok flag:
but then later, the job manager behaves as though it's not set:
Bah, I've been staring at this for a while and think I'll step away and try again tonight or tomorrow. |
Reran the failing tests and they all work now!?! |
Is it possible that the flux-core docker images hadn't yet updated? |
That must be it, but the I'm somewhat inclined to just move on from this since it is working now, but it is disconcerting! |
I was thinking it was a delay in available of the switch to preprocessor defines for schedutil flags (flux-framework/flux-core#6520), but since the |
07d200e
to
a91309c
Compare
Just pushed a test with The test does the following:
|
@milroy if you have any hints on this one, much appreciated! Here's the part of the test that's failing
|
@garlick the reason for the failure is that the diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index 56251f83..a2578b3c 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -150,15 +150,17 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg, const
unsigned int prio;
uint32_t uid;
double ts;
+ const char *free_ranks = NULL;
json_t *jobspec = NULL;
flux_future_t *f = NULL;
+ std::string R_free;
/* Don't expect jobspec to be set here as it is not currently defined
* in RFC 27. However, add it anyway in case the hello protocol
* evolves to include it. If it is not set, it must be looked up.
*/
if (flux_msg_unpack (msg,
- "{s:I s:i s:i s:f s?o}",
+ "{s:I s:i s:i s:f s?s s?o}",
"id",
&id,
"priority",
@@ -167,6 +169,8 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg, const
&uid,
"t_submit",
&ts,
+ "free",
+ &free_ranks,
"jobspec",
&jobspec)
< 0) {
@@ -216,6 +220,22 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg, const
"requeue success (queue=%s id=%jd)",
queue_name.c_str (),
static_cast<intmax_t> (id));
+ if (free_ranks) {
+ R_free = "{\"version\":1,\"execution\":{\"R_lite\":[{\"rank\":\"" + std::string (free_ranks) + "\"}]}}";
+ if (queue->remove (static_cast<void *> (h), id, false, R_free.c_str ()) < 0) {
+ flux_log_error (h,
+ "%s: partial cancel (id=%jd queue=%s)",
+ __FUNCTION__,
+ static_cast<intmax_t> (id),
+ queue_name.c_str ());
+ goto out;
+ }
+ flux_log (h,
+ LOG_DEBUG,
+ "partial cancel successful after requeue (queue=%s id=%jd)",
+ queue_name.c_str (),
+ static_cast<intmax_t> (id));
+ }
rc = 0;
out:
flux_future_destroy (f); |
Progress! That does get the tests passing alright! Any worries about the following:
I would have thought the correct solution would be to find where However, I'm fine with this if you are! |
I just pushed that change, but I realized there may be a problem if any of the job's partially released resources had been reallocated to other jobs before the scheduler reload. Then the partially released job would be attempting to allocate the same resources during hello, and fail. |
Problem: RFC 20 allows optional microsecond precision in R starttime and expiration, but the resource module's parse_R() function assumes integer. Update R parser to handle any JSON number for these fields. Also log a more detailed error when parsing fails. Co-authored-by: Mark A. Grondona <mark.grondona@gmail.com>
All the reader is doing in this case is extract the ranks to be canceled from R (so R doesn't have to be valid). I could add the
That could be a problem depending on how the partial cancel fails. I don't think it would be too hard to handle those failures, though.
These two observations are closely related and represent a valid concern. There isn't currently a way to perform the intersection of the JGF and rv1 rank key, but I think adding support for the intersection would address these problems. |
Just pushed an enhancement to the test that demonstrates the failure just described, where partially freed nodes have been reallocated to another job. I see this:
and later when core tries to free resources it thinks are still allocated to the partially released job
|
@garlick I'll work on the JGF and rv1 On a related note, I don't seem to be able to fork your fork. I used to have a fork but must have deleted it and now GitHub thinks it still exists. Any ideas how I might make a PR to your fork? |
Thanks @milroy ! When working with someone else on a PR I usually add their fork as a remote, pull down their working branch, do some work, then push it to my own repo. Like
(if origin is the remote for your fork) Then you can just tell me to cherry pick from your copy of my branch when you have something for me to try. |
@garlick thanks for the reminder, I'm pretty sure that's what I did last time and just forgot. I've pushed my changes here: https://github.com/milroy/flux-sched/tree/partial_ok. Note that I included a revert commit for f27b290 because those changes shouldn't be included. You can just cherry pick the other commits and drop f27b290 from your branch. I also included a fixup for the testsuite fixup to correct the allocated node equality check. |
Problem: the JGF reader doesn't support excluding previously freed ranks during update allocate. This leads to state divergence between sched and core because Fluxion uses the initial, full JGF to reconstruct an allocation state. Add support for identifying and skipping vertices and edges with ranks corresponding to those freed in previous partial cancellations. The JGF reader unpacks the `free_ranks` key inserted in qmanager in the `scheduling` key payload and uses the idset to exclude vertices and edges from the JGF when fetching and updating vertices and edges.
Problem: libschedutil has ignored the SCHEDUTIL_FREE_NOLOOKUP flag (now the default) since flux-core-0.61.0. Drop this flag.
Problem: when the scheduler is reloaded, housekeeping jobs with partially allocated resources are canceled rather than being sent to the scheduler for re-allocation during the hello handshake. This is because there was no way for the job manager to inform the scheduler of the free R subset. However, RFC 27 now specifies that the sched.hello request may contain a 'partial-ok' flag. If set, hello responses may include a 'free' key containing an RFC 22 idset, with ids corresponding to ranks of R that are free. Schedutil wraps this so that if schedutil_create() is called with the SCHEDUTIL_HELLO_PARTIAL_OK flag, then - the hello request includes partial-ok - free ranks, if any, are subtracted from R in each response message before calling the scheduler's callback Note that the scheduling key (JGF), if present, always contains the original, full resource set. Set SCHEDUTIL_HELLO_PARTIAL_OK flag if defined by flux-core.
Problem: the sched.hello RPC now includes a `free` key whose value is an idset of previously partially released ranks. Currently Fluxion doesn't unpack the key and handle the previously freed resources. The rv1_noexec doesn't need the freed ranks idset because core only sends the R that is still allocated in R_lite. However, core doesn't support processing JGF, so rv1 with the scheduling key contains R with a scheduling key JGF representing the initial resource set. This leads to state divergence between core and sched for the rv1 or reader. Add support for unpacking the `free` key and packing it into the JSON `scheduling` payload for queue reconstruction and update allocate with JGF.
I did just, plus reordered the commits in the PR to avoid breaking Thanks! Nice work @milroy! |
Problem: there is no test coverage for reloading the scheduler with partially allocated jobs. Add a test that runs if the scheduler was able to send the hello request with the partial-ok flag. Use the convenience scripts from sharness.d to load/reload the scheduler modules, and load qmanager with synchronization so that tests are not racing with the hello handshake that happens after module loading completes.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1321 +/- ##
======================================
Coverage 75.2% 75.2%
======================================
Files 111 111
Lines 15986 16046 +60
======================================
+ Hits 12029 12076 +47
- Misses 3957 3970 +13
|
This changes qmanager to invoke the
sched.hello
request with the flag that tells it to allow partial allocations through. schedutil then diminishes R in the hello responses by thefree
set (if present) before handing it to the hello callback in qmanager. We were wondering if fluxion would perhaps just handle this.flux-framework/flux-core#6445 must be merged before it can work with flux-core master, however, preliminary testing with that branch appears to show a problem. The sharness test added here fails when I reload the scheduler with one node in housekeeping, and fluxion says:
So there is a little more work to do here to run that down. Marking this as a WIP.