Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ofi2 #2285

Merged
merged 1 commit into from
Oct 26, 2016
Merged

Ofi2 #2285

merged 1 commit into from
Oct 26, 2016

Conversation

anandhis
Copy link
Contributor

I have cleaned up the ofi, and using ofi_prov_id inside the plugin to identify the different providers in ofi-libfabric. Please let me know if you have a different branch to issue pull request on.
thanks,
Anandhi

@rhc54
Copy link
Contributor

rhc54 commented Oct 24, 2016

Hmmm...looks like you may have overwritten changes in the RML base or incorrectly fixed the merge conflicts?

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this PR was a first push just to get some comments from others. Thank you for doing so!

I made a bunch of minor comments, I hope they're helpful!

@@ -145,8 +145,9 @@ typedef struct {
opal_object_t super;
opal_event_t ev;
orte_rml_send_t send;
/* conduit_id */
orte_rml_conduit_t conduit_id;
//[Anandhi] fix this, maybe define this withing ofi?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to fix this before creating the PR?

/* Open the default oob conduit */
/*opal_output_verbose(10, orte_rml_base_framework.framework_output,
"%s Opening the default conduit - oob component",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add commented-out code in new commits. Thanks!

orte_set_attribute(&conduit_attr, ORTE_RML_INCLUDE_COMP_ATTRIB, ORTE_ATTR_LOCAL,"oob",OPAL_STRING);
/* To set the default conduit to ofi-sockets, comment above line and uncomment below 2 lines*/
//orte_set_attribute( &conduit_attr, ORTE_RML_INCLUDE_COMP_ATTRIB, ORTE_ATTR_GLOBAL,"ofi",OPAL_STRING);
//orte_set_attribute( &conduit_attr, ORTE_RML_PROVIDER_ATTRIB, ORTE_ATTR_GLOBAL,"sockets",OPAL_STRING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to force the OFI provider to be sockets?


/** RML/OFI key values **/
/* (char*) ofi socket address (type IN) of the node process is running on */
#define OPAL_RML_OFI_FI_SOCKADDR_IN "rml.ofi.fisockaddrin"
#define OPAL_RML_OFI_FI_SOCKADDR_IN "rml.ofi.fisockaddrin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please watch the use of whitespace at the end of lines 😄

@@ -40,6 +43,9 @@
#define MULTI_BUF_SIZE_FACTOR 128
#define MIN_MULTI_BUF_SIZE (1024 * 1024)

#define SOCKADDR "ofi-sockaddr"
#define PSMXADDR "ofi-psmxaddr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to have a PSM-specific construct in this code...? Isn't this supposed to be generic libfabric code?

/* Alternatively, check the attributes to see if we qualify - we only handle
* "pt2pt" */
OPAL_LIST_FOREACH(attr, attributes, orte_attribute_t) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something supposed to be in the body of this loop?


/* The returned string will be of format - "<process-name>;ofi-socket:<sin_family,sin_addr,sin_port>;ofi-<provider2>:<prov2epname>" */
for( cur_ofi_prov=0; cur_ofi_prov < orte_rml_ofi.ofi_prov_open_num ; cur_ofi_prov++ ) {
switch ( orte_rml_ofi.ofi_prov[cur_ofi_prov].fabric_info->addr_format) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: if the fi_av_straddr() could be used, that would be less code to have here, and you won't need to know/understand all addressing formats.

I see similar constructs later -- I'll stop mentioning it here in the review, but see if fi_av_straddr() can be used to avoid these kinds of things.

free(final);
final = tmp;
len = strlen(final);
/* [TODO] check string length to not exceed limit */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to do this before making this PR?

{
char *tmp, *sin_fly, *sin_port, *sin_addr;
short port;
int res;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: watch the indenting here.

ep_sockaddr->sin_port = htons(port);
res = inet_aton(sin_addr,(struct in_addr *)&ep_sockaddr->sin_addr);

opal_output_verbose(1,orte_rml_base_framework.framework_output,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: watching the indenting here.

@anandhis anandhis closed this Oct 24, 2016
@anandhis anandhis reopened this Oct 24, 2016
@rhc54
Copy link
Contributor

rhc54 commented Oct 25, 2016

Here are the MTT results:

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 3.0.0a1     | 00:01    | 1    |      |          |      | MPI_Install-my_installation-my_installation-3.0.0a1-my_installation.html |
| Test Build  | trivial         | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-3.0.0a1-my_installation.html          |
| Test Build  | ibm             | 3.0.0a1     | 00:37    | 1    |      |          |      | Test_Build-ibm-my_installation-3.0.0a1-my_installation.html              |
| Test Build  | intel           | 3.0.0a1     | 01:18    | 1    |      |          |      | Test_Build-intel-my_installation-3.0.0a1-my_installation.html            |
| Test Build  | java            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-3.0.0a1-my_installation.html             |
| Test Build  | orte            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-3.0.0a1-my_installation.html             |
| Test Build  | oshmem          | 3.0.0a1     | 00:26    | 1    |      |          |      | Test_Build-oshmem-my_installation-3.0.0a1-my_installation.html           |
| Test Run    | trivial         | 3.0.0a1     | 00:07    | 8    |      |          |      | Test_Run-trivial-my_installation-3.0.0a1-my_installation.html            |
| Test Run    | ibm             | 3.0.0a1     | 08:51    | 488  |      |          |      | Test_Run-ibm-my_installation-3.0.0a1-my_installation.html                |
| Test Run    | spawn           | 3.0.0a1     | 00:07    | 6    | 1    |          |      | Test_Run-spawn-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 3.0.0a1     | 02:18    |      | 1    |          |      | Test_Run-loopspawn-my_installation-3.0.0a1-my_installation.html          |
| Test Run    | intel           | 3.0.0a1     | 16:23    | 474  |      |          | 4    | Test_Run-intel-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 3.0.0a1     | 11:52    | 431  |      |          | 47   | Test_Run-intel_skip-my_installation-3.0.0a1-my_installation.html         |
| Test Run    | java            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-3.0.0a1-my_installation.html               |
| Test Run    | orte            | 3.0.0a1     | 00:41    | 19   |      |          |      | Test_Run-orte-my_installation-3.0.0a1-my_installation.html               |
| Test Run    | oshmem          | 3.0.0a1     | 20:05    | 202  | 98   | 6        |      | Test_Run-oshmem-my_installation-3.0.0a1-my_installation.html             |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+

@anandhis I think all you need to do now is cleanup the warnings and address any further comments. I will also ask on the call today if someone with fabric on their cluster can check this there.

@jsquyres
Copy link
Member

@anandhis Make sure to also fix the failing tests that show up here in github. We also just voted in a new policy today in the Open MPI community: you need to include a "signed off by" line in your commits. I.e., when you commit, use git commit -s .... The -s will automatically add a Signed-off-by: ... line at the bottom of your commit message.

@hppritcha
Copy link
Member

I'll give this PR a try using GNI provider, but will have to wait till early next week.

@jjhursey
Copy link
Member

bot:ibm:retest
We had a cluster issue that caused the most recent CI to fail with an empty message. It should be resolved now.

@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2016

We have this all opal_ignore'd for now so we can commit things without impacting anyone, and then let people selectively turn it on to check it on their system. I'm going to work with @anandhis to squash this down to a single commit before we bring it in.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Fixed merge issues, and minor pull-request comments
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

Fixed merge issues, and minor pull-request comments
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Removed trailing space
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Cleaned up test- ofi_conduit_stress.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

cleaned up printing the provider info during initialisation
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Fixing warnings
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

minor cleanup
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

more cleanup
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Sending the ethernet address only in the get_contact_info, rest will be sent through modex
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Adding error logging on failures
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Handling the OPAL_MODEX_SEND/RECV generically for all ofi providers.
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Adding to build ofi for limited people
	new file:   ../orte/mca/rml/ofi/.opal_ignore
	new file:   ../orte/mca/rml/ofi/.opal_unignore

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Removign the error logging for now
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2016

Travis stalled/died

@rhc54 rhc54 merged commit 60099c9 into open-mpi:master Oct 26, 2016
@anandhis anandhis deleted the ofi2 branch May 11, 2017 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants