Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTL OFI: add support for FI_REMOTE_CQ_DATA. #5004

Merged
merged 1 commit into from
Jun 15, 2018
Merged

MTL OFI: add support for FI_REMOTE_CQ_DATA. #5004

merged 1 commit into from
Jun 15, 2018

Conversation

matcabral
Copy link
Contributor

MTL OFI: add support for FI_REMOTE_CQ_DATA.

Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA.
Signed-off-by: Matias Cabral matias.a.cabral@intel.com

@matcabral matcabral requested review from jsquyres and hppritcha April 3, 2018 01:09
@matcabral matcabral self-assigned this Apr 3, 2018
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really review the majority of this code, but I did have a small number of comments.

if (OPAL_UNLIKELY(0 > ret)) {
char *fi_api;
if (ompi_mtl_ofi.fi_cq_data) {
asprintf( &fi_api, "fi_tinjectddata");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use a string constant here? I know this is an error path, and it doesn't really matter, but it might be slightly simpler to:

char *fi_api = ompi_mtl_ofi.fi_cq_data ? " fi_tinjectddata" : "fi_tinject";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No limitation and indeed looks nicer. Will update.

fi

AC_DEFINE_UNQUOTED([MTL_OFI_ALTERNATIVE_DEFAULT_TAG],$ALTERNATIVE_TAG,
[Use ofi alternative no FI_REMOTE_CQ_DATA tag])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what this configure CLI option is for. It sounds like if the provider does not support FI_REMOTE_CQ_DATA, you use a different bit mapping scheme for tag matching. Is that right?

If so, why isn't that just detected and used a run time -- why does it require a configure-time argument?

Copy link
Contributor Author

@matcabral matcabral Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a runtime detection of FI_REMOTE_CQ_DATA and fallback to a default OFI tag fields distribution (mtl_ofi_component.c 415 to 458). However, the original implementation offered very few bits of the OFI tag to pack the source rank (16) considerably limiting scalability. @bosilca suggested having 2 options for the default/fallback OFI tag fields distribution that can be selected at configure time, with the "default default" having more bits for the source rank than for the user tag (as openib btl does). See mtl_ofi_types.h 78-81. This option is to choose DEFAULT_2 at build time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep the old / unscalable-number-of-bits-for-source default values?

Should the number of source bits be dynamically determined by the job size / size of MPI_COMM_WORLD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep the old / unscalable-number-of-bits-for-source default values?

Here is my understanding OMPI "needs" 96 bits 32x3 for source rank, communicator id and MPI tag. However, OFI offers 64 in the tag, plus 32 (or more) in FI_REMOTE_CQ_DATA when supported. When not supported, 64 is all you get and must make it fit. The "old_unscalable" tag was thought to provide the full range of MPI tags. If a user still needs this, can build with the --enable-mtl-ofi-alternative-tag

Should the number of source bits be dynamically determined by the job size / size of MPI_COMM_WORLD?

Maybe, but would still be cases that will fall out of the logic when spawning new ranks and crating new communicators. Maybe an alternative is reading the total number of available slots, it it actually possible see this value in an MTL ? but will have to implement a logic that will still leave some cases out (as long as 64 < 96 😮 ) OR over reserve for slots not used. So, a build time option may still be helpful. The question would be what are apps really needing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres Would it be possible to merge the PR as is? We have a few other patches we want to contribute that depend on this one. I understand and agree that a dynamic approach for defining the tag bits is the optimal way to go. However, a) most of the providers today support FI_REMOTE_CQ_DATA b) as George shared, there risk of the default TAG not being enough is very reduced, the build time flag is just an alternative for when that odd case happens. We can indeed revisit the the dynamic approach if we see the concern grows. Thanks,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is effectively your+Cray's module, so if you can get @hppritcha's approval, you can go ahead and merge. But I still think:

  1. Having a compile-time decision like this is somewhat un-OMPI-ish. IMHO, it would be more natural to have a decision like this be a run-time / MCA-var-driven decision. But I'm not quite sure what decision you're trying to get from the user -- isn't it all driven by what the underlying libfabric provider provides? I.e., if the fi provider supports 32 bits in FI_REMOTE_CQ_DATA, then you use scheme A. If it doesn't, you use scheme B. Does this need to be a user-driven decision at all?
  2. Does the OFI MTL convey back the max supported tag value back up to the CM PML?

I agree that dynamically sizing the number of bits based on the size of the job could be considered outside the scope of this PR. Random question, though: does the OFI MTL support spawn (i.e., adding more peers)? If not, then you could consider the job size as constant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the name. It doesn't give any indication of why you might want to use the alternative logic. Much prefered would be --enable/disable-remote-cq-data-matching or somesuch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres

  1. Strictly speaking, it can be made a runtime decision. However, given the low chance of occurrence @bolica suggested making it a build time.
    Clarification of the decision being made by the option: (also clarifies @bwbarrett 's question): OMPI needs to send 96 bits with each message (32 source rank, 32 communicator id, 32 user tag) but OFI tag only offers 64. When the OFI provider supports FI_REMOTE_CQ_DATA (detected at runtime) there is no limitation since 32 bit is the minimum required by OFI spec. However, when it is not supported some bits need to be trimmed (96 -> 64). Then is when this option comes into play and offers two alternatives for "fallback tag" with different options to distribute the 64 bits available.

  2. yes, this is shown in the comments belwo

Random question, though: does the OFI MTL support spawn (i.e., adding more peers)?

Yes.

(1UL << 30), /* max tag value - must allow negatives */
(int)((1ULL << MTL_OFI_CID_BIT_COUNT ) - 1), /* max cid */
(int)(1ULL << (MTL_OFI_TAG_BIT_COUNT - 2)),/* max tag value - must allow negatives */
// (1UL << 30), /* max tag value - must allow negatives */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to commit a commented-out line like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OFI_TAG_BIT_COUNT is right; numbers can be negative. That doesn't mean - 2, that means / 2 for max tag value. I could be wrong; the allocation is hard to figure out. But pretty sure it's / 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to commit a commented-out line like this.

Oops, this slipped out. Will remove

I don't think OFI_TAG_BIT_COUNT is right;

I agree! I asked this in the devel list. I just took a conservative approach and followed what was already there. Let me test this further.

fi

AC_DEFINE_UNQUOTED([MTL_OFI_ALTERNATIVE_DEFAULT_TAG],$ALTERNATIVE_TAG,
[Use ofi alternative no FI_REMOTE_CQ_DATA tag])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is effectively your+Cray's module, so if you can get @hppritcha's approval, you can go ahead and merge. But I still think:

  1. Having a compile-time decision like this is somewhat un-OMPI-ish. IMHO, it would be more natural to have a decision like this be a run-time / MCA-var-driven decision. But I'm not quite sure what decision you're trying to get from the user -- isn't it all driven by what the underlying libfabric provider provides? I.e., if the fi provider supports 32 bits in FI_REMOTE_CQ_DATA, then you use scheme A. If it doesn't, you use scheme B. Does this need to be a user-driven decision at all?
  2. Does the OFI MTL convey back the max supported tag value back up to the CM PML?

I agree that dynamically sizing the number of bits based on the size of the job could be considered outside the scope of this PR. Random question, though: does the OFI MTL support spawn (i.e., adding more peers)? If not, then you could consider the job size as constant.

} else { \
match_bits |= (MTL_OFI_TAG_MASK & tag); \
} \
#define MTL_OFI_SET_RECV_BITS(match_bits, mask_bits, comm_id, source, tag) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite long for a macro. Is there a reason to not make this an inline function? (some of the others above are also a little long)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No really good reason, just extended the existing approach. OMPI prefers inline for longer macros? sure, I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were cautious about using inline for long operations because of the jitter it can cause as compilers can't optimize the instruction cache as much as usual? @bwbarrett ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macros vs. inline in this case is probably to-MAY-to vs. to-MAH-to: i.e., you're shoving a bunch of code inline, regardless of the mechanism.

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To merge this patch, I think we should really document (in, say, a README.md in the mtl directory) how this all works. I'm confused, and it's all based on Portals ideas that I've been working on for years. A description of how matching works for the different modes supported in a README would go a long way towards maintainability.

fi

AC_DEFINE_UNQUOTED([MTL_OFI_ALTERNATIVE_DEFAULT_TAG],$ALTERNATIVE_TAG,
[Use ofi alternative no FI_REMOTE_CQ_DATA tag])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the name. It doesn't give any indication of why you might want to use the alternative logic. Much prefered would be --enable/disable-remote-cq-data-matching or somesuch.

(1UL << 30), /* max tag value - must allow negatives */
(int)((1ULL << MTL_OFI_CID_BIT_COUNT ) - 1), /* max cid */
(int)(1ULL << (MTL_OFI_TAG_BIT_COUNT - 2)),/* max tag value - must allow negatives */
// (1UL << 30), /* max tag value - must allow negatives */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OFI_TAG_BIT_COUNT is right; numbers can be negative. That doesn't mean - 2, that means / 2 for max tag value. I could be wrong; the allocation is hard to figure out. But pretty sure it's / 2.

@@ -244,6 +244,7 @@ ompi_mtl_ofi_send_start(struct mca_mtl_base_module_t *mtl,
ompi_proc_t *ompi_proc = NULL;
mca_mtl_ofi_endpoint_t *endpoint = NULL;
ompi_mtl_ofi_request_t *ack_req = NULL; /* For synchronous send */
fi_addr_t src_addr=0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_addr = 0; (spaces are important)

src_addr = endpoint->peer_fiaddr;
} else {
MTL_OFI_SET_SEND_BITS(match_bits, comm->c_contextid,
comm->c_my_rank, tag);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the indentation is wrong here.

@@ -100,6 +101,17 @@ AC_DEFUN([OPAL_CHECK_OFI],[
AC_SUBST($1_CPPFLAGS)
AC_SUBST($1_LDFLAGS)
AC_SUBST($1_LIBS)
AC_ARG_ENABLE(mtl-ofi-alternative-tag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the MTL's configure.m4, not the toplevel opal_check_ofi. This flag does nothing for OSC or a potential OOB, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will move.

@matcabral
Copy link
Contributor Author

I agree on adding a README.md. However, I would like to reach a general agreement on the the build time option being discussed, and document accordingly. Thanks for the feedback.

@bwbarrett
Copy link
Member

@matcabral, honestly, I am having a hard time sorting through the protocol differences without the readme to comment on the high level part of the software. That's why I poked at the README. I think the general idea is ok, although clearly I'd have a preference for runtime selection. We also need the non-cq_data case to work by default because packaging in distros.

@matcabral
Copy link
Contributor Author

@bwbarrett @jsquyres I'm fine with moving it to be a runtime option, but would please ask for some feedback on the following before doing the change. AFAIK, the openib btl only has 16 bits for the user tag and that has not been a problem. So, would it be safe to just follow the same approach for the OFI fallback tag (and not use any option at all)? I'm honestly trying to not add runtime options unless necessary, to avoid overwhelming users with things needed in very odd cases.

README: I will create one and post it for review, would appreciate some input on the above question.

The non-cq data is definitely expected to work by default, subject that 16 bits are good for user tag. OR, can go back to the previous fallback tag that had more bits for tag and less for source rank.

Thanks!

@bwbarrett
Copy link
Member

The BTL design pushes tag sizing into the OB1 PML. The OB1 PML has a max tag size of MAX_INT. The Portals4, PSM, and PSM2 MTLs all limit tag to 2^30. I have trouble seeing us being ok with max tag being only 15 bits; yes that's all the standard requires, but when Portals3.3 used that as a MPI_TAG_UB, there was considerable pushback from users.

@matcabral
Copy link
Contributor Author

@bwbarrett Thanks for the input, will effectively create the runtime mca paramters. Now, these are the two fallback distributions I'm proposing:
Default fallback: 26 b Comm id, 20 b source rank, 16 user tag
Alternative fallback: 12 b Comm id, 18 b source, 32 b user tag

Thoughts?

Note that I am showing 2 more bits for the source than what was shared on this PR since we optimized the sync send protocol to use only 2 bits (instead of 4). However, that is in a different patch.

Thanks,

@bwbarrett
Copy link
Member

I think 18 bits for source rank is so far above average job size that I'd push to make that 2^15 or 2^16. The places that don't have cq_data can't scale as large; that's a reasonable tradeoff.

@matcabral
Copy link
Contributor Author

Just being paranoid and willing to avoid confusions: both of the fallback tags I shared are for the case when cq_data is not supported. So, when no cq_data is available, the MTL will offer 2^20 by default for the source rank, and move to offer 2^18 when (no cq_data is available) and the mca paramter is passed. From your suggestion, I get I should be reducing this.

@hppritcha
Copy link
Member

@matcabral could you update with some of this feedback incorporated? Then I'll run against GNI provider again and review

@matcabral
Copy link
Contributor Author

Hi @hppritcha, I'm actively working on the comments. I will have an update posted in the next couple days. thanks,

@matcabral
Copy link
Contributor Author

Summary of the changes in new patch:

  • Added README file
  • Made the option to select the "fallback tag" distribution for providers that do not support FI_REMOTE_CQ_DATA an mca parameter.
  • Made the default "fallback tag" the option with MAX_INT for MPI tag (fewer communicator IDs). Added more bits to the MPI TAG in the alternative fallback distribution (20 bits for tags).
  • Reviewed the max tag allowed: maximum signed number.
  • Migrated all tag creation macros to inline functions.

NOTE: We have a patch that applies on top of this one that optimized the MPI_Ssend protocol to use 2 bits instead of 4. So there will 2 more bits available in the fallback tags.

@matcabral
Copy link
Contributor Author

Hi @bwbarrett, would please take a look at the latest patch? thanks,

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing I really didn't like in this patch was the lack of a mode-set parameter, making it hard to ensure all three cases work in our nightly testing.

fi_strerror(-ret), -ret);
goto error;
}
else if ((NULL != prov_cq_data) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open MPI style is } else if (.... on one line. So the } on line 453 should be on line 454.

Also, why an else if, if your previous test is going to goto NULL. It's a little defensive, but it's also a bit hard to follow.

Copy link
Contributor Author

@matcabral matcabral May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will more the else if to line 453.
else if : The first test checks for (-FI_ENODATA != ret) to consider it an error. So, you still don't know if the provider supports CQ_DATA

@@ -136,7 +137,23 @@ ompi_mtl_ofi_component_register(void)
MCA_BASE_VAR_SCOPE_READONLY,
&ompi_mtl_ofi.ofi_progress_event_count);

free(desc);
free(desc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the unnecessary change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a minor indentation fix

free(desc);

fallback_alternative_tag = false;
asprintf(&desc, "Use alternative ofi tag bits distribution for providers that do not support FI_REMOTE_CQ_DATA:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this. I've read it three times and I'm still confused. I also don't like that there's no way to force using cq_data (or fail if it's not there). Maybe a 3-option flag that sets a particular mode (would be easier to test as well, wouldn't need a device that doesn't support cq_data to test, could just run through each mode).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few different things here:

  • The README file was intended to provider further clarity to the fallback tag thing. I'm open to add any additional mechanism to add more clarity, please share how.
  • I understand the complications for testing, but I'm not convinced that CQ_DATA selection should be made available through an mca parameter since this is just a testing requirement with no benefit for the end user (in fact it limits scalability). Unless there is a provider that may have benefits by intentionally not using CQ_DATA, e.g. overhead in supporting FI_DIRECTED_RECIEVE, required to not include the rank in the tag. Thoughts ?

@@ -392,7 +410,7 @@ ompi_mtl_ofi_component_init(bool enable_progress_threads,
if (FI_ENODATA == -ret) {
// It is not an error if no information is returned.
goto error;
} else if (0 != ret) {
} else if (OPAL_UNLIKELY(0 != ret)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the compiler hint; this is initialization, not the critical path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will remove all compiler hints from init time.

@@ -310,11 +327,12 @@ ompi_mtl_ofi_component_init(bool enable_progress_threads,
{
int ret, fi_version;
struct fi_info *hints;
struct fi_info *providers = NULL, *prov = NULL;
struct fi_info *providers = NULL, *prov = NULL, *prov_cq_data = NULL ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple lines make code easier to read :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

* FI_DIRECTED_RECV is also needed so receives can discrimate the source
*/
prov_tmp_name = strdup(prov->fabric_attr->prov_name);
if(!prov_tmp_name){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in OMPI, we explicitly check against the NULL.

I also don't understand why you're swapping provider names here; a comment would go a long way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not swapping providers names, jut making a copy to avoid issues with fi_freeinfo(). Will add a comment.

else if ((NULL != prov_cq_data) &&
(0 == strncmp (prov_tmp_name, prov_cq_data->fabric_attr->prov_name,
strlen(prov_tmp_name)))) {
prov=prov_cq_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment of which case we're in would make this way more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

unsigned long long source_rank_mask;
unsigned long long mpi_tag_mask;
int num_bits_mpi_tag;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra newline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

#define MTL_OFI_SYNC_SEND_DATA (0x0000000100000000ULL)
#define MTL_OFI_SYNC_SEND_ACK_DATA (0x0000000900000000ULL)

__opal_attribute_always_inline__ static inline uint64_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be inlined? Thinking you're smarter than the inline is generally a path to madness...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function creates the send tag. Please check previous comments where it was asked to move the macros that create tags to inline functions.

@matcabral
Copy link
Contributor Author

matcabral commented Jun 6, 2018

@bwbarrett, please review the updated patch. Main change is the tag selection logic and corresponding mca paramter. Now it addresses all you requirements. Details explained in the README file.
In addition all other minor comments were addressed.

@matcabral
Copy link
Contributor Author

minor pending cleaning done in last patch

@matcabral
Copy link
Contributor Author

@jsquyres would you consider the latest patch addresses your comments? merge request. thanks,

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great; thanks for all the changes.

My only quibble is with the help message for the MCA var. You can accept my suggestion or not -- my only goal with the suggestion is to provide a help message that is oriented towards an end user (not a developer who knows things about libfabric). No need to have me re-review if you update the help message.


ofi_tag_mode = MTL_OFI_TAG_AUTO;
asprintf(&desc, " Mode for OFI tag."
" 1 auto (default): detect if the provider supports FI_REMOTE_CQ_DATA or fallback to 2."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what the explanation for 1 means. Does it mean:

  1. If FI_REMOTE_CQ_DATA is supported, use 4.
  2. Otherwise, use 2.
  3. 3 is available just for the heckuvit (i.e., different bit counts than 2)
  4. How many bits are used for 4? I only ask because they're specified / shown for 2 and 3, but not 4 (or 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All answers are correct. 4 or 1 do not show this info because there are no limitations: 32b. Should it still show it? or at least mention there are no limitations using it.

" 1 auto (default): detect if the provider supports FI_REMOTE_CQ_DATA or fallback to 2."
" 2 ofi_tag_1: %d bits com_id,%d bits source rank,%d bits mpi_tag."
" 3 ofi_tag_2: %d bits com_id,%d bits source rank,%d bits mpi_tag."
" 4 force_fi_cq_data: try FI_REMOTE_CQ_DATA or fail if not supported.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user supposed to specify a value of auto, ofi_tag_1, ...etc., or a value of 1, 2, ...etc.? From reading this help message, I think a user might assume that they are supposed to use the number, but I think that since you're registering this as an MCA var enum, they're supposed to specify the names (auto, etc.), right?

Here's my suggestion for the help message:

Mode specifying how many bits to use for various MPI values in OFI/Libfabric communications. Some Libfabric provider network types can support as many bits as Open MPI needs; others can only supply a limited number of bits, which then must be split across the MPI communicator ID, MPI source rank, and MPI tag. Three different splitting schemes are available: ofi_tag_full (%d bits for the communicator, %d bits for the source rank, and %d bits for the tag), ofi_tag_2 (%d bits communicator, %d bits source rank, %d bits tag), or ofi_tag_3 (%d bits communicator, %d bits source rank, %d bits tag). By default, this MCA variable is set to "auto", which will first try to use "ofi_tag_full", and if that fails, fall back to "ofi_tag_2".

Where "ofi_tag_full" -- I think -- is the equivalent to your force_fi_cq_data (FI_REMOTE_CQ_DATA and friends have no meaning to the end user).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing numbers vs strings: both will work (since the enum is registered), but (please correct me if I'm wrong) most of the mca parameters are written suggesting text. I take it in and will update the README accordingly and also update the parameter help message. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. I had honestly forgotten that we allow both numbers or names for enum values of MCA vars. If you find my suggested text helpful, awesome. 😄

Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA. Add README file to OFI MTL
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
@matcabral
Copy link
Contributor Author

New patch including @jsquyres feedback for the README and parameter help message. Renamed the force_cq_data option to ofi_tag_full

@matcabral matcabral merged commit 10516c1 into open-mpi:master Jun 15, 2018
@matcabral matcabral deleted the mtl_ofi_remote_cq_data branch August 31, 2018 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants