-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuned-libnbc algorithm configuration upgrade #8435
base: main
Are you sure you want to change the base?
Conversation
goncalvt
commented
Feb 2, 2021
•
edited by jsquyres
Loading
edited by jsquyres
- A new node level is proposed for tuned configuration file.
- The configuration file feature is now available for libnbc component.
Can one of the admins verify this patch? |
@goncalvt thanks for the PR. could you please give some more context on what this PR is doing and why? In order to be accepted, all commits have to be signed off. The usual way to have contributions upstreamed is to issue a PR vs the |
b5346a1
to
de24d99
Compare
Collective algorithm selection provided by tuned component allow to improve performance. Nevertheless, an issue happens when we want to choose the best algorithm for a given number of MPI ranks: The file format does not allow to differentiate between the case x ranks = N nodes x P processes per node and the case x ranks = N' nodes x P' processes per node. This leads to poor performance in some configurations. Thus, we propose to add a node level to configuration file. Also, this feature can be used in libnbc component. |
If i well understood i should move this PR on master branch ? |
de24d99
to
a97352b
Compare
@@ -0,0 +1,302 @@ | |||
/* | |||
* Copyright (c) 2020 Bull SAS. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should your copyrights be 2021?
opal_output_verbose(1,ompi_coll_base_framework.framework_output,"Fix errors as listed above and try again.\n"); | ||
|
||
/* deallocate memory if allocated */ | ||
if (alg_rules) ompi_coll_base_free_all_rules (alg_rules, n_collectives); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style nit: please put all blocks in {}
, even 1-line blocks.
We have very few style restrictions in Open MPI, but this is one of them. Thanks!
(If you haven't already, you might want to check out https://github.com/open-mpi/ompi/wiki/CodingStyle)
* | ||
*/ | ||
|
||
int ompi_coll_base_read_rules_config_file (char *fname, int format_version, ompi_coll_base_alg_rule_t** rules, int n_collectives) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be major new functionality.
Can you provide some documentation for what it is, how it works, how users use it, why users should use it, ...etc.?
Can you also provide some test cases in https://github.com/open-mpi/ompi-tests-public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a renaming (and moving) of the tuned capability (aka. ompi_coll_tuned_read_rules_config_file).
|
||
END_C_DECLS | ||
#endif /* MCA_COLL_BASE_DYNAMIC_RULES_H_HAS_BEEN_INCLUDED */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please remove blank lines at the ends of files.
@@ -558,3 +559,69 @@ const char* mca_coll_base_colltype_to_str(int collid) | |||
} | |||
return colltype_translation_table[collid]; | |||
} | |||
|
|||
OBJ_CLASS_INSTANCE(ompi_coll_base_hostname_item_t, opal_list_item_t, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need a class here? I think you could probably use a char**
here, and the opal_argv
functions. It may not be a huge difference, but the opal_argv
functions might be slightly lighter-weight...?
Most of the features introduced here are interesting and definitively good to have. Basically, I would remove the extension to the dynamic configuration file, and pull everything else in master (not in the 4.1 branch). As @ggouaillardet stated, tuned is not meant to support hierarchical collectives, and as such does not have the flexibility required to correctly address this issue. A combination of ADAPT and HAN will be the best way forward, if you have any questions I'll be happy to talk with you about. |
I agree with the principle of the hierarchical approach of adapt/han but in my opinion this can't be sufficient in all cases. Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations. |
For
|
The purpose of the function ompi_coll_base_get_nnodes is to get the number of nodes of a specific communicator. It seems that the feature of PMIx you are talking about deals with the whole MPI job. The second solution (PMIX_NODE_lIST or PMIX_Resolve_nodes) may do the job but we must apply it on an OMPI communicator and i'm not sure that this can work. I will take a look at the pmix API anyway. |
That is correct. The PMIx data will be for the whole job in this case (MPI_COMM_WORLD). Since PMIx doesn't have knowledge of the MPI communicator it might not be of much help here. |
Do you have any proof of such a bold statement ? The literature has plenty of works highlighting the benefits of a hierarchical approach, and based of my understanding of the existing collective algorithms provided via tuned, not a single provide support for any reasonable, location based overlay communication tree. |
PMIx does know where each proc is located, so one could obtain the desired information easily enough. I don't know the relative speed of the two approaches. As @jjhursey said, the PMIx call is shmem-local, but you'd have to call it on each proc (or get the entire proc map for the job and then parse it). Can't say if that is faster than doing some MPI collective op and then parsing the result. Might be worth investigating. |
Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ? |
It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion. |
/* For each rank */ | ||
for (i=0 ; i<group_size ; i++) { | ||
proc = ompi_group_get_proc_ptr (group, i, true); | ||
hostname = opal_get_proc_hostname(&proc->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhc54 @jjhursey ompi_coll_base_get_nnodes()
returns the number of nodes used by a given communicator.
I do not expect PMIx
to be MPI communicators aware.
That being said, the current implementation calls opal_get_proc_hostname()
for each rank of the communicator. If I remember correctly, we tried to avoid that (because of both performance and memory overhead). A more efficient implementation would be to use a bitmap, use the nodeid (e.g. an int
) of each proc to set the corresponding bit, and then count the number of bits in the bitmap. Can PMIx
be used to retrieve the nodeid of each proc? (bonus point if PMIx
can return the nodeids of a list or proc).
Note both current and proposed implementation do not use any collective operations.
I am pretty sure ROMIO
and possibly ompio
implement a similar subroutine, and would benefit from a single and optimized implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect PMIx to be MPI communicators aware.
Yeah, it definitely won't be unless PMIx_Connect was called when those communicators are formed. We do that for inter-communicators, so PMIx does know that membership. I'm pretty sure we don't do it for intra-communicators since those procs are already "connected" by definition, so we wouldn't know about the membership there.
Can PMIx be used to retrieve the nodeid of each proc?
Sure - just call PMIx_Get
with the name of the proc and the PMIX_NODEID
key
(bonus point if PMIx can return the nodeids of a list or proc).
You can, but you would need to use a different interface. You would PMIx_Query_info
with the PMIX_NODEID
key and an array of proc names. I'd have to add the backend support for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for all this fanciness, MPI provides support for doing all this in MPI. Similar capability is already available in HAN in the subcomm file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HAN uses MPI_Allreduce()
and MPI_Allgather()
to get this kind of information.
Wouldn't it be faster to get node ids from PMIx
and then have each task independently process them to achieve the very same outcome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not rely on yet another component for this kind of information. The MPI version is portable, and will remain so whatever runtime we are using. Plus, HAN's code does more than just what is required here, you only need a comm_split and a allgather. Moreover, if we are taking this route and will almost always need the leader and local communicators, we can embed their creation into the communicator creation, bringing the cost to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid you misinterpreted a large part of the discussion here, nobody (and certainly not HAN) is ignoring the RM information in order to reinvent the wheel, we already have it in a format that is OMPI-specific and that we can use in a more optimized way than linearly going through the PMIx database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this on Tuesday's webex -- I don't have the larger context here, but looking at ompi_coll_base_get_nnodes()
, I'm not quite grokking how a comm_split+allgather+string compares to test for "unknown" hostnames (which could lead to errors) and further nested string lookup loops would be faster than the put-the-node-id-into-a-bit-field approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this on the webex today, and things were much more clear to me:
- To be clear: @bosilca doesn't think that this functionality belongs here in tuned, anyway (please correct me if I got that sentiment incorrect).
- But if it is to be used here in
tuned
, we should probably extract the functionality out of HAN's init-time startup where this information (and some additional stuff) is already obtained, and put that in coll base. That way, other coll components can use it. From my understanding, the HAN implementation is already good/scalable/happy; it isn't an issue of PMIx vs. MPI -- it was more: we already have the information locally; there's no need to calculate that information again. - We all recognize that this PR probably needs to be re-filed on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion topologic configuration should be avalaible in tuned. Tha main reason is that there is collective which does not have han implementation. Then, it's not guarantee that all collectives can be easily implemented without risk of bugs using han approach (alltoall with disordered ranks,etc.). In that cases fallback will be used and tuned implementations would be used and it's can be optimized using topologic configuration. In the worst case, that kind of configuration might be used in tuned fallback situations. This is quite true for libnbc.
Then, i'm not against the idea of merging topological init of Han and tuned. It implies that we consider that Han is the most of the time enabled and used and thus subcomms init is essential anyway. It's not a problem to me even if our customer does not use han right now (it may change soon). In such approach comm rules init in tuned must be delayed (same in libnbc of course). It implies a lot of change but it can facilitate the acceptability of this PR it's does not matter.
I will resubmit this PR on master branch anyway but before doing this i want to be sure that we agree about the global scope of what this PR can address. I'm at your disposal to find an agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will evaluate Coll/han vs coll/tuned and try to illustrate on which case (at the moment) node level configuration on tuned can be useful. I imagine that there is an interest when tuned is used for both intra node level and inter node level with the same number of processes. In this situation the current file format can't differentiate both configuration resulting on non optimal performances. Then, depending on that results i will decide to abandon node level format or not.
HAN is dependent on the algorithm selection on the 2 levels of the hierarchy. HAN does not implement all collectives, but just yesterday @EmmanuelBRELLE added support for barrier, so the number of supported algorithms is increasing. I don't know in what cases tuned was able to provide better support than HAN, but I would definitively love to hear more details about, because tuned is expected to be retired and replaced with HAN. Basically, everything tuned can do should already be supported by HAN, including single level collective description. |
Right, dont update the configuration, but move it in base, and use it in libnbc. Take a look in HAN configuration file, it might more more suitable as a starting point for libnbc. |
From my point of view, Han is complementary to tuned but does not replace it. Han says how to split the collectives (=Han algorithms) into smaller collectives (and how to order these collectives), whereas tuned selects the (best?) point to point communication pattern (from base) to acheive this collective given size of the (sub-)communicator and the data. For now tuned is still a nice-to-have in between Han and base |
The Han parser handle string for collective and component. Is that kind of features you expect to see forwarded in libnbc-tuned |
It really depends how you want to see it. You are right, in the current form HAN calls into tuned for some of the collectives, but not because it really needs yet another level of decision, it is simply because we assumed that tuned will provide the best decision for a single layer of our hierarchy (intra or inter node). Moving away from this, is a simple matter of allowing algorithm naming instead of component naming in the configuration file, et voila. |
That would be great, as it drastically simplifies the writing of these decision files. And does not change the internal structure of the storage, we convert back to a internal indexed naming scheme. |
I took the liberty of converting this PR to Draft status so that we don't accidentally merge it before it has been merged to master and cherry-picked to v4.1.x. |
A larger question: does this functionality belong in v4.1.x? Technically, it's a whole new feature, and since it wasn't included in 4.1.0, that's not really what we do in subreleases like this. Are there bug fixes that need to be separated out of this PR that should be applied to v4.1.x? |
There is no specific need for 4.1 branch. |
@goncalvt Can you re-target this against master instead of the v4.1.x branch? |
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>
Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>