Tuned-libnbc algorithm configuration upgrade #8435

goncalvt · 2021-02-02T09:48:27Z

A new node level is proposed for tuned configuration file.
The configuration file feature is now available for libnbc component.

ompiteam-bot · 2021-02-02T09:48:29Z

Can one of the admins verify this patch?

ggouaillardet · 2021-02-02T09:58:13Z

@goncalvt thanks for the PR.

could you please give some more context on what this PR is doing and why?

In order to be accepted, all commits have to be signed off.

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

goncalvt · 2021-02-02T10:19:58Z

@goncalvt thanks for the PR.

could you please give some more context on what this PR is doing and why?

In order to be accepted, all commits have to be signed off.

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

Collective algorithm selection provided by tuned component allow to improve performance. Nevertheless, an issue happens when we want to choose the best algorithm for a given number of MPI ranks: The file format does not allow to differentiate between the case x ranks = N nodes x P processes per node and the case x ranks = N' nodes x P' processes per node. This leads to poor performance in some configurations. Thus, we propose to add a node level to configuration file. Also, this feature can be used in libnbc component.

goncalvt · 2021-02-02T10:25:41Z

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

If i well understood i should move this PR on master branch ?

ggouaillardet · 2021-02-02T10:34:27Z

yes, master is a better fit.

but let's wait for @bosilca take on that. The new coll/adapt might replace coll/tuned in the long run.

Also, @rhc54, can PMIx help more efficiently (time and memory overhead) implement ompi_coll_base_get_nnodes() ?

jsquyres · 2021-02-02T14:05:08Z

ompi/mca/coll/base/coll_base_dynamic_file.c

@@ -0,0 +1,302 @@
+/*
+ * Copyright (c) 2020      Bull SAS. All rights reserved.


Nit: should your copyrights be 2021?

ompi/mca/coll/base/coll_base_dynamic_file.c

jsquyres · 2021-02-02T14:09:50Z

ompi/mca/coll/base/coll_base_dynamic_file.c

+    opal_output_verbose(1,ompi_coll_base_framework.framework_output,"Fix errors as listed above and try again.\n");
+
+    /* deallocate memory if allocated */
+    if (alg_rules) ompi_coll_base_free_all_rules (alg_rules, n_collectives);


Minor style nit: please put all blocks in {}, even 1-line blocks.

We have very few style restrictions in Open MPI, but this is one of them. Thanks!

(If you haven't already, you might want to check out https://github.com/open-mpi/ompi/wiki/CodingStyle)

jsquyres · 2021-02-02T14:11:07Z

ompi/mca/coll/base/coll_base_dynamic_file.c

+ *
+ */
+
+int ompi_coll_base_read_rules_config_file (char *fname, int format_version, ompi_coll_base_alg_rule_t** rules, int n_collectives)


This seems to be major new functionality.

Can you provide some documentation for what it is, how it works, how users use it, why users should use it, ...etc.?

Can you also provide some test cases in https://github.com/open-mpi/ompi-tests-public?

This is a renaming (and moving) of the tuned capability (aka. ompi_coll_tuned_read_rules_config_file).

jsquyres · 2021-02-02T14:11:58Z

ompi/mca/coll/base/coll_base_dynamic_rules.h

+
+END_C_DECLS
+#endif /* MCA_COLL_BASE_DYNAMIC_RULES_H_HAS_BEEN_INCLUDED */
+


Nit: please remove blank lines at the ends of files.

jsquyres · 2021-02-02T14:14:54Z

ompi/mca/coll/base/coll_base_util.c

@@ -558,3 +559,69 @@ const char* mca_coll_base_colltype_to_str(int collid)
    }
    return colltype_translation_table[collid];
 }
+
+OBJ_CLASS_INSTANCE(ompi_coll_base_hostname_item_t, opal_list_item_t, NULL, NULL);


Do you really need a class here? I think you could probably use a char** here, and the opal_argv functions. It may not be a huge difference, but the opal_argv functions might be slightly lighter-weight...?

bosilca · 2021-02-02T17:25:57Z

Most of the features introduced here are interesting and definitively good to have. Basically, I would remove the extension to the dynamic configuration file, and pull everything else in master (not in the 4.1 branch).

As @ggouaillardet stated, tuned is not meant to support hierarchical collectives, and as such does not have the flexibility required to correctly address this issue. A combination of ADAPT and HAN will be the best way forward, if you have any questions I'll be happy to talk with you about.

goncalvt · 2021-02-03T08:48:47Z

Most of the features introduced here are interesting and definitively good to have. Basically, I would remove the extension to the dynamic configuration file, and pull everything else in master (not in the 4.1 branch).

As @ggouaillardet stated, tuned is not meant to support hierarchical collectives, and as such does not have the flexibility required to correctly address this issue. A combination of ADAPT and HAN will be the best way forward, if you have any questions I'll be happy to talk with you about.

I agree with the principle of the hierarchical approach of adapt/han but in my opinion this can't be sufficient in all cases. Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

jjhursey · 2021-02-03T15:33:17Z

For ompi_coll_base_get_nnodes PMIx can help.

If you just need the count of the number of nodes in the job you can do a PMIx_Get on PMIX_NUM_NODES. It's a lookup from local shared memory so should be pretty fast.
If you need the full list of nodes (which I don't think you need here) then you can use either PMIx_Get on PMIX_NODE_LIST or you can use PMIx_Resolve_nodes.

goncalvt · 2021-02-03T15:49:56Z

For ompi_coll_base_get_nnodes PMIx can help.

If you just need the count of the number of nodes in the job you can do a PMIx_Get on PMIX_NUM_NODES. It's a lookup from local shared memory so should be pretty fast.

If you need the full list of nodes (which I don't think you need here) then you can use either PMIx_Get on PMIX_NODE_LIST or you can use PMIx_Resolve_nodes.

The purpose of the function ompi_coll_base_get_nnodes is to get the number of nodes of a specific communicator. It seems that the feature of PMIx you are talking about deals with the whole MPI job. The second solution (PMIX_NODE_lIST or PMIX_Resolve_nodes) may do the job but we must apply it on an OMPI communicator and i'm not sure that this can work. I will take a look at the pmix API anyway.

jjhursey · 2021-02-03T16:10:52Z

That is correct. The PMIx data will be for the whole job in this case (MPI_COMM_WORLD). Since PMIx doesn't have knowledge of the MPI communicator it might not be of much help here.

bosilca · 2021-02-03T16:18:29Z

Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

Do you have any proof of such a bold statement ? The literature has plenty of works highlighting the benefits of a hierarchical approach, and based of my understanding of the existing collective algorithms provided via tuned, not a single provide support for any reasonable, location based overlay communication tree.

rhc54 · 2021-02-03T16:18:51Z

PMIx does know where each proc is located, so one could obtain the desired information easily enough. I don't know the relative speed of the two approaches. As @jjhursey said, the PMIx call is shmem-local, but you'd have to call it on each proc (or get the entire proc map for the job and then parse it). Can't say if that is faster than doing some MPI collective op and then parsing the result. Might be worth investigating.

goncalvt · 2021-02-04T08:40:58Z

Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

Do you have any proof of such a bold statement ? The literature has plenty of works highlighting the benefits of a hierarchical approach, and based of my understanding of the existing collective algorithms provided via tuned, not a single provide support for any reasonable, location based overlay communication tree.

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

goncalvt · 2021-02-04T09:41:41Z

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

ggouaillardet · 2021-02-04T10:05:19Z

ompi/mca/coll/base/coll_base_util.c

+    /* For each rank */
+    for (i=0 ; i<group_size ; i++) {
+        proc = ompi_group_get_proc_ptr (group, i, true);
+        hostname = opal_get_proc_hostname(&proc->super);


@rhc54 @jjhursey ompi_coll_base_get_nnodes() returns the number of nodes used by a given communicator.
I do not expect PMIx to be MPI communicators aware.

That being said, the current implementation calls opal_get_proc_hostname() for each rank of the communicator. If I remember correctly, we tried to avoid that (because of both performance and memory overhead). A more efficient implementation would be to use a bitmap, use the nodeid (e.g. an int) of each proc to set the corresponding bit, and then count the number of bits in the bitmap. Can PMIx be used to retrieve the nodeid of each proc? (bonus point if PMIx can return the nodeids of a list or proc).

Note both current and proposed implementation do not use any collective operations.

I am pretty sure ROMIO and possibly ompio implement a similar subroutine, and would benefit from a single and optimized implementation.

I do not expect PMIx to be MPI communicators aware.

Yeah, it definitely won't be unless PMIx_Connect was called when those communicators are formed. We do that for inter-communicators, so PMIx does know that membership. I'm pretty sure we don't do it for intra-communicators since those procs are already "connected" by definition, so we wouldn't know about the membership there.

Can PMIx be used to retrieve the nodeid of each proc?

Sure - just call PMIx_Get with the name of the proc and the PMIX_NODEID key

(bonus point if PMIx can return the nodeids of a list or proc).

You can, but you would need to use a different interface. You would PMIx_Query_info with the PMIX_NODEID key and an array of proc names. I'd have to add the backend support for it.

No need for all this fanciness, MPI provides support for doing all this in MPI. Similar capability is already available in HAN in the subcomm file.

HAN uses MPI_Allreduce() and MPI_Allgather() to get this kind of information.

Wouldn't it be faster to get node ids from PMIx and then have each task independently process them to achieve the very same outcome?

I would rather not rely on yet another component for this kind of information. The MPI version is portable, and will remain so whatever runtime we are using. Plus, HAN's code does more than just what is required here, you only need a comm_split and a allgather. Moreover, if we are taking this route and will almost always need the leader and local communicators, we can embed their creation into the communicator creation, bringing the cost to a minimum.

I'm afraid you misinterpreted a large part of the discussion here, nobody (and certainly not HAN) is ignoring the RM information in order to reinvent the wheel, we already have it in a format that is OMPI-specific and that we can use in a more optimized way than linearly going through the PMIx database.

Let's discuss this on Tuesday's webex -- I don't have the larger context here, but looking at ompi_coll_base_get_nnodes(), I'm not quite grokking how a comm_split+allgather+string compares to test for "unknown" hostnames (which could lead to errors) and further nested string lookup loops would be faster than the put-the-node-id-into-a-bit-field approach.

We talked about this on the webex today, and things were much more clear to me:

To be clear: @bosilca doesn't think that this functionality belongs here in tuned, anyway (please correct me if I got that sentiment incorrect).

But if it is to be used here in tuned, we should probably extract the functionality out of HAN's init-time startup where this information (and some additional stuff) is already obtained, and put that in coll base. That way, other coll components can use it. From my understanding, the HAN implementation is already good/scalable/happy; it isn't an issue of PMIx vs. MPI -- it was more: we already have the information locally; there's no need to calculate that information again.

We all recognize that this PR probably needs to be re-filed on master.

In my opinion topologic configuration should be avalaible in tuned. Tha main reason is that there is collective which does not have han implementation. Then, it's not guarantee that all collectives can be easily implemented without risk of bugs using han approach (alltoall with disordered ranks,etc.). In that cases fallback will be used and tuned implementations would be used and it's can be optimized using topologic configuration. In the worst case, that kind of configuration might be used in tuned fallback situations. This is quite true for libnbc.

Then, i'm not against the idea of merging topological init of Han and tuned. It implies that we consider that Han is the most of the time enabled and used and thus subcomms init is essential anyway. It's not a problem to me even if our customer does not use han right now (it may change soon). In such approach comm rules init in tuned must be delayed (same in libnbc of course). It implies a lot of change but it can facilitate the acceptability of this PR it's does not matter.

I will resubmit this PR on master branch anyway but before doing this i want to be sure that we agree about the global scope of what this PR can address. I'm at your disposal to find an agreement.

I will evaluate Coll/han vs coll/tuned and try to illustrate on which case (at the moment) node level configuration on tuned can be useful. I imagine that there is an interest when tuned is used for both intra node level and inter node level with the same number of processes. In this situation the current file format can't differentiate both configuration resulting on non optimal performances. Then, depending on that results i will decide to abandon node level format or not.

bosilca · 2021-02-04T20:15:45Z

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

HAN is dependent on the algorithm selection on the 2 levels of the hierarchy. HAN does not implement all collectives, but just yesterday @EmmanuelBRELLE added support for barrier, so the number of supported algorithms is increasing. I don't know in what cases tuned was able to provide better support than HAN, but I would definitively love to hear more details about, because tuned is expected to be retired and replaced with HAN. Basically, everything tuned can do should already be supported by HAN, including single level collective description.

bosilca · 2021-02-04T20:21:12Z

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

Right, dont update the configuration, but move it in base, and use it in libnbc. Take a look in HAN configuration file, it might more more suitable as a starting point for libnbc.

EmmanuelBRELLE · 2021-02-05T09:28:55Z

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

HAN is dependent on the algorithm selection on the 2 levels of the hierarchy. HAN does not implement all collectives, but just yesterday @EmmanuelBRELLE added support for barrier, so the number of supported algorithms is increasing. I don't know in what cases tuned was able to provide better support than HAN, but I would definitively love to hear more details about, because tuned is expected to be retired and replaced with HAN. Basically, everything tuned can do should already be supported by HAN, including single level collective description.

From my point of view, Han is complementary to tuned but does not replace it. Han says how to split the collectives (=Han algorithms) into smaller collectives (and how to order these collectives), whereas tuned selects the (best?) point to point communication pattern (from base) to acheive this collective given size of the (sub-)communicator and the data. For now tuned is still a nice-to-have in between Han and base

goncalvt · 2021-02-05T09:36:19Z

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

Right, dont update the configuration, but move it in base, and use it in libnbc. Take a look in [HAN configuration file], it might more more suitable as a starting point for libnbc.

The Han parser handle string for collective and component. Is that kind of features you expect to see forwarded in libnbc-tuned
parser ?

bosilca · 2021-02-05T16:08:29Z

From my point of view, Han is complementary to tuned but does not replace it. Han says how to split the collectives (=Han algorithms) into smaller collectives (and how to order these collectives), whereas tuned selects the (best?) point to point communication pattern (from base) to achieve this collective given size of the (sub-)communicator and the data. For now tuned is still a nice-to-have in between Han and base

It really depends how you want to see it. You are right, in the current form HAN calls into tuned for some of the collectives, but not because it really needs yet another level of decision, it is simply because we assumed that tuned will provide the best decision for a single layer of our hierarchy (intra or inter node). Moving away from this, is a simple matter of allowing algorithm naming instead of component naming in the configuration file, et voila.

bosilca · 2021-02-05T16:12:04Z

The Han parser handle string for collective and component. Is that kind of features you expect to see forwarded in libnbc-tuned parser ?

That would be great, as it drastically simplifies the writing of these decision files. And does not change the internal structure of the storage, we convert back to a internal indexed naming scheme.

jsquyres · 2021-02-06T21:21:52Z

I took the liberty of converting this PR to Draft status so that we don't accidentally merge it before it has been merged to master and cherry-picked to v4.1.x.

jsquyres · 2021-02-09T17:03:28Z

A larger question: does this functionality belong in v4.1.x? Technically, it's a whole new feature, and since it wasn't included in 4.1.0, that's not really what we do in subreleases like this.

Are there bug fixes that need to be separated out of this PR that should be applied to v4.1.x?

goncalvt · 2021-02-10T10:48:47Z

A larger question: does this functionality belong in v4.1.x? Technically, it's a whole new feature, and since it wasn't included in 4.1.0, that's not really what we do in subreleases like this.

Are there bug fixes that need to be separated out of this PR that should be applied to v4.1.x?

There is no specific need for 4.1 branch.

jsquyres · 2021-04-26T19:30:46Z

@goncalvt Can you re-target this against master instead of the v4.1.x branch?

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

goncalvt force-pushed the tuned_libnbc branch from b5346a1 to de24d99 Compare February 2, 2021 10:06

goncalvt force-pushed the tuned_libnbc branch from de24d99 to a97352b Compare February 2, 2021 10:51

jsquyres reviewed Feb 2, 2021

View reviewed changes

gpaulsen added this to the v4.1.1 milestone Feb 2, 2021

jsquyres removed this from the v4.1.1 milestone Feb 2, 2021

ggouaillardet reviewed Feb 4, 2021

View reviewed changes

jsquyres changed the title ~~Tuned-libnbc algorithm configuration upgrade~~ v4.1.x: Tuned-libnbc algorithm configuration upgrade Feb 5, 2021

jsquyres added the Target: v4.1.x label Feb 5, 2021

jsquyres added this to the v4.1.1 milestone Feb 5, 2021

jsquyres marked this pull request as draft February 6, 2021 21:21

jsquyres added the enhancement label Feb 8, 2021

rajachan removed this from the v4.1.1 milestone Feb 12, 2021

rajachan removed the Target: v4.1.x label Feb 12, 2021

gpaulsen added the Target: v4.1.x label Apr 26, 2021

gpaulsen added this to the v4.1.2 milestone Apr 26, 2021

jsquyres added Target: main and removed Target: v4.1.x labels Apr 26, 2021

jsquyres removed this from the v4.1.2 milestone Apr 26, 2021

jsquyres changed the title ~~v4.1.x: Tuned-libnbc algorithm configuration upgrade~~ Tuned-libnbc algorithm configuration upgrade Apr 26, 2021

goncalvt changed the base branch from v4.1.x to master April 27, 2021 08:45

goncalvt added 6 commits April 27, 2021 10:49

[MISC] Update .mailmap

025d732

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

[COLL/TUNED] Add node level algorithm configuration

b977b7e

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

[COLL/TUNED] Handle non commutative support in algorithm selection

941666e

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

[COLL/TUNED] Move config file features to base directory

b0a5492

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

[COLL/TUNED] Dump file rules if verbosity level is high

8aca231

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

[COLL/LIBNBC] Upgrade dynamic rules support

ebd1404

Signed-off-by: Thomas Goncalves <thomas.goncalves@atos.net>

goncalvt force-pushed the tuned_libnbc branch from a97352b to ebd1404 Compare April 27, 2021 08:53

		@@ -0,0 +1,302 @@
		/*
		* Copyright (c) 2020 Bull SAS. All rights reserved.


		END_C_DECLS
		#endif /* MCA_COLL_BASE_DYNAMIC_RULES_H_HAS_BEEN_INCLUDED */

Tuned-libnbc algorithm configuration upgrade #8435

Are you sure you want to change the base?

Tuned-libnbc algorithm configuration upgrade #8435

Conversation

goncalvt commented Feb 2, 2021 • edited by jsquyres Loading

ompiteam-bot commented Feb 2, 2021

ggouaillardet commented Feb 2, 2021

goncalvt commented Feb 2, 2021

goncalvt commented Feb 2, 2021

ggouaillardet commented Feb 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca commented Feb 2, 2021

goncalvt commented Feb 3, 2021

jjhursey commented Feb 3, 2021

goncalvt commented Feb 3, 2021

jjhursey commented Feb 3, 2021

bosilca commented Feb 3, 2021

rhc54 commented Feb 3, 2021

goncalvt commented Feb 4, 2021

goncalvt commented Feb 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca commented Feb 4, 2021

bosilca commented Feb 4, 2021

EmmanuelBRELLE commented Feb 5, 2021 • edited Loading

goncalvt commented Feb 5, 2021

bosilca commented Feb 5, 2021

bosilca commented Feb 5, 2021

jsquyres commented Feb 6, 2021

jsquyres commented Feb 9, 2021

goncalvt commented Feb 10, 2021

jsquyres commented Apr 26, 2021

goncalvt commented Feb 2, 2021 •

edited by jsquyres

Loading

EmmanuelBRELLE commented Feb 5, 2021 •

edited

Loading