-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ofi2 #2285
Ofi2 #2285
Conversation
Hmmm...looks like you may have overwritten changes in the RML base or incorrectly fixed the merge conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this PR was a first push just to get some comments from others. Thank you for doing so!
I made a bunch of minor comments, I hope they're helpful!
@@ -145,8 +145,9 @@ typedef struct { | |||
opal_object_t super; | |||
opal_event_t ev; | |||
orte_rml_send_t send; | |||
/* conduit_id */ | |||
orte_rml_conduit_t conduit_id; | |||
//[Anandhi] fix this, maybe define this withing ofi? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to fix this before creating the PR?
/* Open the default oob conduit */ | ||
/*opal_output_verbose(10, orte_rml_base_framework.framework_output, | ||
"%s Opening the default conduit - oob component", | ||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add commented-out code in new commits. Thanks!
orte_set_attribute(&conduit_attr, ORTE_RML_INCLUDE_COMP_ATTRIB, ORTE_ATTR_LOCAL,"oob",OPAL_STRING); | ||
/* To set the default conduit to ofi-sockets, comment above line and uncomment below 2 lines*/ | ||
//orte_set_attribute( &conduit_attr, ORTE_RML_INCLUDE_COMP_ATTRIB, ORTE_ATTR_GLOBAL,"ofi",OPAL_STRING); | ||
//orte_set_attribute( &conduit_attr, ORTE_RML_PROVIDER_ATTRIB, ORTE_ATTR_GLOBAL,"sockets",OPAL_STRING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to force the OFI provider to be sockets?
|
||
/** RML/OFI key values **/ | ||
/* (char*) ofi socket address (type IN) of the node process is running on */ | ||
#define OPAL_RML_OFI_FI_SOCKADDR_IN "rml.ofi.fisockaddrin" | ||
#define OPAL_RML_OFI_FI_SOCKADDR_IN "rml.ofi.fisockaddrin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please watch the use of whitespace at the end of lines 😄
@@ -40,6 +43,9 @@ | |||
#define MULTI_BUF_SIZE_FACTOR 128 | |||
#define MIN_MULTI_BUF_SIZE (1024 * 1024) | |||
|
|||
#define SOCKADDR "ofi-sockaddr" | |||
#define PSMXADDR "ofi-psmxaddr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd to have a PSM-specific construct in this code...? Isn't this supposed to be generic libfabric code?
/* Alternatively, check the attributes to see if we qualify - we only handle | ||
* "pt2pt" */ | ||
OPAL_LIST_FOREACH(attr, attributes, orte_attribute_t) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something supposed to be in the body of this loop?
|
||
/* The returned string will be of format - "<process-name>;ofi-socket:<sin_family,sin_addr,sin_port>;ofi-<provider2>:<prov2epname>" */ | ||
for( cur_ofi_prov=0; cur_ofi_prov < orte_rml_ofi.ofi_prov_open_num ; cur_ofi_prov++ ) { | ||
switch ( orte_rml_ofi.ofi_prov[cur_ofi_prov].fabric_info->addr_format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: if the fi_av_straddr()
could be used, that would be less code to have here, and you won't need to know/understand all addressing formats.
I see similar constructs later -- I'll stop mentioning it here in the review, but see if fi_av_straddr()
can be used to avoid these kinds of things.
free(final); | ||
final = tmp; | ||
len = strlen(final); | ||
/* [TODO] check string length to not exceed limit */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to do this before making this PR?
{ | ||
char *tmp, *sin_fly, *sin_port, *sin_addr; | ||
short port; | ||
int res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: watch the indenting here.
ep_sockaddr->sin_port = htons(port); | ||
res = inet_aton(sin_addr,(struct in_addr *)&ep_sockaddr->sin_addr); | ||
|
||
opal_output_verbose(1,orte_rml_base_framework.framework_output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: watching the indenting here.
Here are the MTT results:
@anandhis I think all you need to do now is cleanup the warnings and address any further comments. I will also ask on the call today if someone with fabric on their cluster can check this there. |
@anandhis Make sure to also fix the failing tests that show up here in github. We also just voted in a new policy today in the Open MPI community: you need to include a "signed off by" line in your commits. I.e., when you commit, use |
I'll give this PR a try using GNI provider, but will have to wait till early next week. |
bot:ibm:retest |
We have this all opal_ignore'd for now so we can commit things without impacting anyone, and then let people selectively turn it on to check it on their system. I'm going to work with @anandhis to squash this down to a single commit before we bring it in. |
modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Fixed merge issues, and minor pull-request comments modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am Fixed merge issues, and minor pull-request comments modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Removed trailing space modified: ../orte/mca/rml/ofi/rml_ofi_component.c Cleaned up test- ofi_conduit_stress.c modified: ../orte/test/system/ofi_conduit_stress.c cleaned up printing the provider info during initialisation modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Fixing warnings modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> minor cleanup modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> more cleanup modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Sending the ethernet address only in the get_contact_info, rest will be sent through modex modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Adding error logging on failures modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Handling the OPAL_MODEX_SEND/RECV generically for all ofi providers. modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Adding to build ofi for limited people new file: ../orte/mca/rml/ofi/.opal_ignore new file: ../orte/mca/rml/ofi/.opal_unignore Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Removign the error logging for now modified: ../orte/mca/rml/ofi/rml_ofi_component.c
Travis stalled/died |
I have cleaned up the ofi, and using ofi_prov_id inside the plugin to identify the different providers in ofi-libfabric. Please let me know if you have a different branch to issue pull request on.
thanks,
Anandhi