Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: new format for [bootstrap] configuration #2578

Merged
merged 16 commits into from
Dec 23, 2019

Conversation

garlick
Copy link
Member

@garlick garlick commented Dec 14, 2019

As discussed in #2572, this PR changes the [bootstrap] config file for (hopefully) improved clarity and expressiveness. Example:

[bootstrap]

default_port = 8050
default_bind = "tcp://en0:%p"
default_connect = "tcp://e%h:%p"

hosts = [
    {
        host="fluke0",
        bind="tcp://en4:9001",
        connect="tcp://fluke-mgmt:9001"
    },
    { host = "fluke[1-3]" },
]

Here the bind address is tcp://en4:9001 on rank 0 and tcp://en0:8050 on ranks 1-3. The connect address for rank 0 is tcp://fluke-mgmt:9001, tcp://efluke1:8050 for rank 1, tcp://efluke2:8050 for rank 2, etc..

Note: the rank is determined by looking up the hostname in the hosts array, instead of matching IP addresses to local interfaces. The size of the instance is the length of the array. I dropped the ability to explicitly set the rank and size in the config file. That did not seem terribly useful since setting the rank means the config file must be unique for each broker, and the size can be easily inferred if the hosts array is fully specified.

Problem: flux start --size=N where N>1 warns unhelpfully:
"warning: setting --bootstrap=selfpmi due to --size option".

Drop the warning message.
@garlick
Copy link
Member Author

garlick commented Dec 14, 2019

While writing tests I discovered that libtomlc99 doesn't allow heterogeneous type elements in arrays, so the original example I gave above of mixed string and table hosts elements won't work. TOML itself is evolving towards allowing heterogenous arrays, so I got a bit confused referencing the TOML language page. It's brand new though:

toml-lang/toml#665

I updated the example above to reflect this limitation.

Maybe the extra verbosity will be tolerable when we have some approximation of hostlists?

@grondo
Copy link
Contributor

grondo commented Dec 16, 2019

Maybe the extra verbosity will be tolerable when we have some approximation of hostlists?

Yeah, I think it the extra verbosity is perfectly acceptable (perhaps even preferable).

@garlick
Copy link
Member Author

garlick commented Dec 16, 2019

I don't have the code in front of me but maybe I should post this interface for review, for makeshift "hostset" support (for taking them apart, not constructing them):

/* Parse the first bracketed idset embedded in string 's'.
 */
struct idset * idset_parse_embedded (const char *s);

/* Replace the first bracketed idset in 'fmt' with 'id', placing the result in 'buf'
 */
int idset_format (char *buf, size_t *bufsz, const char *fmt, unsigned int id);

So basically this would let you expand something like fluke[0-99] in a config file to fluke0, fluke1, ... by first parsing out the embedded idset, and then iterating over it substituting each id.

This seems like the bare minimum needed to support a hostset expression in place of a host in the above config. It falls far short of what we're used to in hostlist.c, especially doesn't offer the reverse: take a list of hosts and make a hostset for you. But also, the list is always sorted

@grondo
Copy link
Contributor

grondo commented Dec 16, 2019

The interface seems fine to me. My first thought it that most use cases of the low-level functions above would immediately want to turn a "hostset" string into a list over which they could iterate, but how that looks or where that function lives will probably fall out of future use of the config. Keeping the libidset functions low-level seems fine to me.

It did occur to me that you could also simplify the parsing of the config by allowing the hostnames to be split into different fields, e.g.

hosts = [
    {
        prefix="fluke",
        ids = "0",
        suffix="",
     },
]

This would save even more config space because default host prefix and suffix could be set:

default_prefix = "fluke"
hosts = [
    { ids = "[0-99]" }
]

Maybe ids could be named idset to hint that this is not a list of ids -- no duplicates, in order.

Then you can avoid an extra step of parsing what you get out of your toml parser.

(Just an idea to throw out there)

@grondo
Copy link
Contributor

grondo commented Dec 16, 2019

Also I assume idset_parse_embedded() and idset_format() both allow suffixes?

Another thought, if you extend the api (or maybe add separate functions) to allow the nth index to be parsed/replaced, then libidset could parse hostset strings with multiple idsets embedded, e.g. rack[0-47]-host[0-15].

@garlick
Copy link
Member Author

garlick commented Dec 16, 2019

Right, I was thinking iteration would consist of idset_first()...idset_next(), calling idset_format() for each id. The format string would allow the bracketed hostset to be in any position, so you could do stuff like [1-99]fluke or fluke[1-99]compute or fluke[1-99].

It would be pretty easy to add an "index" argument to choose the nth idset in a string with multiple if we felt that was likely to be needed.

@garlick garlick changed the title WIP: change [bootstrap] config file broker: new format for [bootstrap] configuration Dec 19, 2019
@garlick
Copy link
Member Author

garlick commented Dec 19, 2019

Removing WIP from the description as I think this may be ready for another peek.

[Er, I thought I had written something earlier about the changes just pushed but I'm not seeing it now so maybe I didn't press the button before leaving yesterday?]

To handle an arbitrary number of embedded idsets, I went with a "map" pattern where a user supplied function is called for each formatted string. The function recurses for each level of idset and can use the idset_format() function as written. I'm not sure it is the greatest user interface but it was easy to implement, and we could always add something on top of it if we want to make a "hostset" class with iterators, or similar. (As I write this I'm thinking - hmm, should I just go ahead and do that?)

Unrelated commits tacked on here

@grondo
Copy link
Contributor

grondo commented Dec 19, 2019

int idset_map_embedded (const char *s, idset_map_f fun, void *arg);

Any reason why this function can't be called just idset_map()? If you were planning an idset_map that takes a struct idset * argument, then maybe the one that takes a string could be idset_string_map or idset_strmap? I find the embedded term redundant here since all strings with an idset are "embedded" I suppose...

(I actually think the idea of idset_map is kind of powerful in general.)

we could always add something on top of it if we want to make a "hostset" class with iterators, or similar. (As I write this I'm thinking - hmm, should I just go ahead and do that?)

As I tried to say above, I think any users of the config will want to just get a list of hosts (er, set of hosts) from the config, not have to parse the idset and immediately turn it into a list with a call to map, strcpy, list_append(), etc. My guess would be if we don't have this higher level interface at first, then the second person to implement it will create it ;-) (maybe not a bad thing, though)

I would perhaps not spend too much time on the hostset interface though. We had a hostset_t in the old hostlist code and I don't think it ever was used. There are just too many times where order matters. (Perhaps different for flux though, what do I know)

@grondo
Copy link
Contributor

grondo commented Dec 19, 2019

typo in 1ffadf2: stirngs

@garlick
Copy link
Member Author

garlick commented Dec 19, 2019

As I tried to say above, I think any users of the config will want to just get a list of hosts (er, set of hosts) from the config, not have to parse the idset and immediately turn it into a list with a call to map, strcpy, list_append(), etc. My guess would be if we don't have this higher level interface at first, then the second person to implement it will create it ;-) (maybe not a bad thing, though)

Were you thinking we need a generic way to iterate over "hostsets" or whatever, or something specific for the [bootstrap] hosts list? I was thinking [bootstrap] would only have the one user in the broker bootstrap code, but maybe that's short sighted?

I did write up a throwaway class for a struct hostset just now, partly in response to your original comment, but I'm not sure its worth pushing. Also "hostset" seems a misnomer since there's nothing hostname specific here.

struct hostset *hostset_decode (const char *s);
void hostset_destroy (struct hostset *hostset);

const char *hostset_first (struct hostset *hostset);
const char *hostset_next (struct hostset *hostset);

size_t hostset_count (struct hostset *hostset);

@grondo
Copy link
Contributor

grondo commented Dec 19, 2019

Yeah, I had assumed there would be more than just a single use case (eventually). I'm fine with waiting to try to make the higher level abstraction until other use cases show up.

In testing this code I noticed (realized?) that idsets can be specified out-of-order. I'm not sure if it is a problem, but since they are being used in this particular case to specify an ordered array it could be confusing that host[0-3] == host[3,2,1,0]==[ "host0", "host1", "host2", "host3" ]`

@garlick
Copy link
Member Author

garlick commented Dec 19, 2019

Well that's a good point. It would be worrisome if, for example, "service nodes" are spaced out numerically, like fluke[0,63,127] and someone tried to bunch the service nodes at the beginning to force them to have the lower numbered broker ranks, e.g. fluke[0,63,127,1-62,64-126].

Maybe we should state that the hosts array order is ignored (not rank order), and add other config to express the TBON topology, or even the "host role" in terms of hosts entries? Example:

overlay-levels = [
    [ "fluke0" ],
    [ "fluke[64,128]" ],
    [ "fluke[1-63,65-127,129-191]" ],
]

hmmm.

On multiple users, maybe we'll need to capture the mapping of hosts to broker ranks somewhere else for general consumption because this only works for a "configured" instance. With PMI, we might need to collect that info.

@garlick
Copy link
Member Author

garlick commented Dec 19, 2019

Or we could just think of this PR as solving #2572, with more config to come (probably) as we require more configurability for the system instance TBON? Meanwhile maybe a flux_config_bootstrap(7) man page makes sense so we have a place to document the semantics of the hosts array?

Thanks for catching the commit typo. Would idset_format_map() make sense for the map function name? Otherwise I like idset_strmap() or idset_string_map() over idset_map() since the latter seems like it ought to be a map over just the id's in one idset...

@grondo
Copy link
Contributor

grondo commented Dec 19, 2019

I like your idea of separately specifying the TBON topology.

Would idset_format_map() make sense for the map function name? Otherwise I like idset_strmap() or idset_string_map() over idset_map() since the latter seems like it ought to be a map over just the id's in one idset...

Yeah, I figured that (and I think an idset_map() which calls your map function once per integer might actually be a cool feature). I'm fine with any of the alternatives you suggested. I just found embedded confusing.

Problem: when the broker is started with the
'log-filename' attribute set on the command line,
the log file is opened from all broker ranks.
This does not match the documentation or the code.

Move initialization of logbuf->rank in logbuf_initialize()
to before logbuf_register_attrs() is called.  Now the test
for rank 0 is only true on rank 0.

Move the logbuf->h initialization too for consistency.

Fixes flux-framework#2581
Problem: logbuf_initialize() calls flux_attr_set_cacheonly()
on the "rank" attribute.  The broker call to create_dummyattrs()
right after the call to logbuf_initialize() handles this,
and there is no requirement to set it before that.

Drop this call.
Problem: there is no convenient way to decode an idset
from a bounded substring.

Add idset_ndecode() which accepts a length.
Add idset_format_map() to simplify expansion of strings that
have embedded idsets.

Given a string with zero or more embedded bracketed idsets,
expands idsets, calling a map function for each formatted string.
For example, an input string of "r[0-1]n[0-1]" would result in
map function calls for "r0n0", "r0n1", "r1n0", "r1n1".
Returns number of iterations, or -1 on failure with errno set.

Iteration can be halted early from the map function (no error).
The map function can also return -1 with errno set to abort
iteration with an error.
@garlick
Copy link
Member Author

garlick commented Dec 19, 2019

OK forced a push with some changes so the only exposed idset function is idset_format_map().

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

I don't think anyone uses this naming convention anymore for hosts, but one other difference between idsets and hostlists is that hostlists preserve leading zeros from the range syntax:

~$ hostlist -e "foo[00-10]"
foo00,foo01,foo02,foo03,foo04,foo05,foo06,foo07,foo08,foo09,foo10

In the future, if support for this is needed perhaps idset_format_map() could take an optional zero padding width, which could be specified in the [bootstrap] hosts array.

Or, an alternate syntax for specifying sets of hosts could be supported as above

hosts = [{ 
  prefix = "foo"
  suffix = ""
  zeropad = 2
  ids = "0-10"
} ]

@garlick
Copy link
Member Author

garlick commented Dec 20, 2019

I hesitate to ask, but since I'm not in the same office as you ATM and you can't throw things at me -- should we just pull in hostlist.[ch] to liblsd? It embodies the accumulated wisdom of the ages, is more or less debugged, and has tests. It may be expedient to do so rather than inflict subtly different semantics on our sys admins when they will have many other unfamiliar things to deal with when we roll out a Flux system instance.

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

hesitate to ask, but since I'm not in the same office as you ATM and you can't throw things at me -- should we just pull in hostlist.[ch] to liblsd?

That might be unfortunate in the long run. Let's proceed with the current direction, we could always support the slightly more verbose but more explicit syntax above, or when someone has time work on a hostlist.c replacement. (It seems likely we'll want something ordered and which can support "compress" methods eventually anyway)

Edit: to be more clear, it is probably fine to pull hostlist.[ch] back in for the short term, but I think the arguments against that are 1) it has a lot of older, ad-hoc, crufty code cobbled together over time, and may not be the most maintainable thing 2) interfaces exported could be improved 3) we really do want to do a rewrite of this thing at some point, but since it seemingly does the minor thing it is supposed to, the code keeps getting dragged along.

On the other hand, my one worry about extending libidset with hostlist-like code is when we do have a hostlist replacement there may be duplicated features/interfaces between libidset which can only support sets by design, and the hostlist replacement, which could be confusing.

Probably this is getting us off track from our near-term goals so I think we should leave things as is and move on.

@garlick
Copy link
Member Author

garlick commented Dec 20, 2019

OK then, full speed ahead. I wasn't planning on adding anything else here, unless you have additional suggestions. I would leave the explicit overlay config to another PR.

@garlick
Copy link
Member Author

garlick commented Dec 20, 2019

Oh, I did have one other thought. One unsatisfying thing is the hostname localhost in the config file interpreted as a wildcard so that a size=1 instance can bootstrap with a generic config file, like the one we are installing by default:

[bootstrap]

default_port = 8020
default_bind = "tcp://127.0.0.1:%p"
default_connect = "tcp://127.0.0.1:%p"

hosts = [
    { host = "localhost" }
]

Any better ideas here? It seems like there needs to be one hosts entry to avoid introducing magical behavior? Allowing the host to be a glob or regex might get confusing if idsets are also supported. Any better ideas? Maybe just allow exactly * to be special but not a full glob/regex?

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

Any better ideas here?

Hm, no. I think I don't fully understand the use case. When is the "generic" config file read by the broker?

It seems like there needs to be one hosts entry to avoid introducing magical behavior? Allowing the host to be a glob or regex might get confusing if idsets are also supported. Any better ideas? Maybe just allow exactly * to be special but not a full glob/regex?

Since the hosts entry is a full JSON object, maybe a special key like anyhost = true?
Sorry I'm no help here, still don't feel I fully understand the full picture.

On second thought, since the default address maps to localhost I do think special localhost makes a kind of sense...

@garlick
Copy link
Member Author

garlick commented Dec 20, 2019

The use case was for the default installed boot.toml, so that someone could maybe apt-get install flux-core on their laptop and immediately be able to run stuff in a size=1 system instance.

Maybe that case can be handled without a config file, come to think of it.

@garlick
Copy link
Member Author

garlick commented Dec 22, 2019

OK, I got rid of the "localhost" specialness, and instead now assume that if the hosts array is empty or missing, this is a size=1 instance.

I went ahead and made some minor changes to the way the config boot method initializes the overlay, so we can avoid binding to, and therefore configuring, a zeromq socket on leaf nodes, where it will never be used (at least not until we have some strategy for "grow"). Also, for size=1, the single broker now doesn't need zeromq sockets at all, so a) it doesn't need curve keys, and b) it won't start the zeromq service thread.

I have a follow-on PR that implements the same overlay changes for the PMI boot method, and would have tacked that on here, except that it seems to have made the subprocess exec service in the broker inexplicably sad, so I want to debug that first.

Finally, made one minor tweak to the systemd unit file to set local-uri attribute on the command line. If this is not set, the broker puts the socket under the rank in the rundir, but clients using the compiled-in default path are looking in the parent directory.

If that sounds OK, I'll squash the incremental development.

@garlick
Copy link
Member Author

garlick commented Dec 22, 2019

Restarted a builder that failed valgrind with a SIGALRM.

@garlick
Copy link
Member Author

garlick commented Dec 23, 2019

I have a follow-on PR that implements the same overlay changes for the PMI boot method, and would have tacked that on here, except that it seems to have made the subprocess exec service in the broker inexplicably sad, so I want to debug that first.

Found the problem - a new flux_send() failure case when there is no "child" socket (as opposed to one that has nothing connected to it). That was an easy fix, and so I went ahead and added on the boot_pmi.c change also.

The hydra test failures were really getting in my way so as discussed in #1169, I went ahead and removed them here.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick look through the PR and things look good to me. Just a couple minor comments. I also poked at things a bit and didn't find any issues.

such as when being launched by systemd.

The default bootstrap mode is PMI. The Flux systemd unit file forces the
broker to use the config method by specifying `--setattr=boot-method=config`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? boot-method should be boot.method?

t/t0001-basic.t Outdated
echo $ATTR_VAL | grep "^ipc" &&
! echo $ATTR_VAL | grep "%B"
'
test_expect_success 'tbon.endpoint fails on bad endpoint' '
! flux start -o,--setattr=tbon.endpoint='foo://bar' flux getattr tbon.endpoint
! flux start -s2 -o,--setattr=tbon.endpoint='foo://bar' flux getattr tbon.endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are editing these tests anyway, mind switching from ! to test_must_fail? See docs for test_must_fail for a description of why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Actually that last one fails under test_must_fail because rank 1 broker has to be killed by flux-start. I'll add a comment for housekeepers of the future.

@grondo
Copy link
Contributor

grondo commented Dec 23, 2019

Nice test coverage as well! I did notice from the coverage report that if you want a small bump in coverage, some of the [bootstrap] hosts array errors were not hit during testing, e.g. an invalid idset, invalid host entry (i.e. json_unpack failures), etc.

Change the [bootstrap] configuration to allow bind URI,
and connect URI to be expressed independently, and to
determine the rank by matching the hostname, not the
local interface addresses.

Rename the rank-ordered array from 'tbon-endpoints' to
'hosts' for clarity.  A hosts entry is a table containing
the keys 'host' (required), 'bind' (optional), and 'connect'
(optional).

The keys 'default_port', 'default_bind', and 'default_connect'
may be defined.  The default bind and connect values are
used when those keys are not explicitly set in the hosts entry.
The 'default_port' value is substituted for the token "%p" if
it occurs in bind or connect strings.  The 'host' value in a
hosts entry entry is substituted for the token "%h" if it
occurs in bind or connect strings.

The 'host' key may included embedded bracketed idsets.

Example 1024 node cluster config:

[bootstrap]

default_port = 8050
default_bind = "tcp://en0:%p"
default_connect = "tcp://e%h:%p"

hosts = [
    {
        host="fluke0",
        bind="tcp://en4:9001",
        connect="tcp://fluke-mgmt:9001"
    },
    { host = "fluke[1-1023]" },
]

In the example, the bind address is "tcp:/en4:9001" on rank 0
and "tcp://en0:8050" on ranks 1-1023.  The connect address for
rank 0 is "tcp://fluke-mgmt:9001"; "tcp://efluke1:8050" for rank 1;
"tcp://efluke2:8050" for rank 2, etc..

Update existing sharness test.

Update installed config file.

Fixes flux-framework#2572
Avoid triggering the loading of curve keys and/or starting
of the zeromq service thread if the overlay is not going
to be used.
Problem: the broker sets the local-uri attribute to
local://${rundir}/${rank}/local, but if FLUX_URI is unset,
flux_open() tries to open local://${rundir}/local.

Set local-uri on the command line in the systemd unit file,
to match the flux_open() default.
Problem: instance-level test tries to do too many things
in one test and is hard to debug.

Spit the test into several subtests, and write output to
the file system for post-mortem.
Problem: tests expect tbon.endpoint to be set in a size=1
instance when it need not be.

Check this attribute in rank 0 of a size=2 instance,
and relax expectations of content.

Also: [cleanup] use test_must_fail where appropriate.
When the bootstrap method is PMI, don't bind to a zeromq socket
if there are no downstream peers.

Change broker_response_sendmsg() to silently discard a response
if it has exhausted all other routing option and tries to send
it downstream.  This is zeromq's default behavior for unroutable
messages sent on a ROUTER socket anyway (drop).
Problem: hydra tests fail occasionally

As noted in flux-framework#1169, some versions of hydra might have a problem
capturing stdio, which makes these tests unreliable.  Drop the
tests that are just verifying hydra's PMI behavior, and make the
ones that remain not dependent on stdio.

Fixes flux-framework#1169
@garlick
Copy link
Member Author

garlick commented Dec 23, 2019

Thanks for the helpful comments! Just forced a push with suggested fixes, incremental development squashed, an attempt at more coverage, and a few minor commit message cleanups.

Problem: t2601-job-shell.t (test 8 - PMI cliques are correct
for 2 ppn) just failed in travis with no useful output.

This test, and a few tests surrounding it, redirect the stderr
of flux job attach to a file, then ignore the file.

Allow flux job attach's stderr to be included in the verbose
test output
@garlick
Copy link
Member Author

garlick commented Dec 23, 2019

Hmm, had one builder fail in t2602-job-shell.t here:

expecting success: 
        id=$(flux jobspec srun -N2 -n4 ${PMI_INFO} -c | flux job submit) &&
	flux job attach $id >pmi_clique2.raw 2>pmi_clique2.err &&
	sort -snk1 <pmi_clique2.raw >pmi_clique2.out &&
	sort >pmi_clique2.exp <<-EOT &&
	0: clique=0,1
	1: clique=0,1
	2: clique=2,3
	3: clique=2,3
	EOT
	test_cmp pmi_clique2.exp pmi_clique2.out

not ok 8 - pmi-shell: PMI cliques are correct for 2 ppn

There would be output in the logs for the test_cmp if that was what failed. Tests before and after are OK. For some reason the test redirects the stderr of flux job attach to a file and ignores it. I'll push a fix to make sure stderr gets included in the verbose test output.

@codecov-io
Copy link

Codecov Report

Merging #2578 into master will increase coverage by 0.02%.
The diff coverage is 84.14%.

@@            Coverage Diff             @@
##           master    #2578      +/-   ##
==========================================
+ Coverage   81.45%   81.47%   +0.02%     
==========================================
  Files         247      248       +1     
  Lines       38306    38438     +132     
==========================================
+ Hits        31201    31319     +118     
- Misses       7105     7119      +14
Impacted Files Coverage Δ
src/cmd/flux-start.c 78.26% <ø> (-0.15%) ⬇️
src/common/libidset/idset_decode.c 97.91% <100%> (+0.09%) ⬆️
src/common/libidset/idset_format.c 100% <100%> (ø)
src/broker/log.c 70.67% <100%> (+0.81%) ⬆️
src/broker/boot_pmi.c 63.38% <60%> (-2.04%) ⬇️
src/broker/broker.c 74.29% <78.57%> (+0.12%) ⬆️
src/broker/boot_config.c 83.04% <83.95%> (+0.89%) ⬆️
src/common/libpmi/simple_client.c 86.33% <0%> (-3.28%) ⬇️
src/common/libpmi/pmi.c 91.5% <0%> (-1.89%) ⬇️
src/broker/pmiutil.c 64.95% <0%> (-0.86%) ⬇️
... and 10 more

@garlick
Copy link
Member Author

garlick commented Dec 23, 2019

Whee, OK this is ready IMHO.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, LGTM!

@mergify mergify bot merged commit f9780be into flux-framework:master Dec 23, 2019
@garlick
Copy link
Member Author

garlick commented Dec 24, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants