Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not slice unencoded pmi messages #238

Closed
wants to merge 1 commit into from

Conversation

ggouaillardet
Copy link
Contributor

pmix_vallen_max is the max size of an encoded key value, so there
is no need to slice the unencoded pmi message in the first place.

if we really want to keep the unencoded pmi message small

refers #235 but for master only

pmix_vallen_max is the max size of an encoded key value, so there
is no need to slice the unencoded pmi message in the first place.
@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2014

@hjelmn @hppritcha @elenash Can you folks please take a look at this?

@elenash
Copy link
Contributor

elenash commented Oct 17, 2014

I don't have any objections to this patch, it just makes all pmi_puts to be done during pmix_fence, not earlier (during pmix_put). But could you explain how it fixes the original problem?

@ggouaillardet
Copy link
Contributor Author

@elenash

Without the patch, when you put a key that will increase pmix_packed_data from 200 to 300 bytes, 256 bytes of unencoded data will be sent immediatly
This is two keys :
key-0 is 192 unencoded bytes which means 256 encoded bytes
key-1 is 64 unencoded bytes which means 86 encoded bytes and is minus terminated
At fence, key-2 will be sent

The receiver will not try to get key-2 because key-1 is minus terminated,
and hence the truncated modex.

an other way to fix this is to slice the encoded key :
in this case, only key-0 is sent immediatly since it is not minus terminated

Gilles

elenash notifications@github.com wrote:

I don't have any objections to this patch, it just makes all pmi_puts to be done during pmix_fence, not earlier (during pmix_put). But could you explain how it fixes the original problem?


Reply to this email directly or view it on GitHub.

{"@context":"http://schema.org","@type":"EmailMessage","description":"View this Pull Request on GitHub","action":{"@type":"ViewAction","url":"https://github.com/open-mpi/ompi/pull/238#issuecomment-59513689","name":"View Pull Request"}}

@rhc54
Copy link
Contributor

rhc54 commented Oct 17, 2014

Just to provide some background on why this code even exists. Cray imposes a limit on the number of keys an application can commit to PMI. At large scale, OMPI was running into this limit, and so Nathan created the ability to "pack" our data into "meta-keys" so that every key pushed to PMI contained a full maxlen of info in it. This compressed the data into a minimum number of keys.

Of course, you then had to "uncompress" the meta-key data on the receiving end. Ugly, but it resolved the problem by significantly reducing the number of keys pushed to PMI.

I don't know what broke when we refactored the code to create the pmix framework, but this used to work quite well. I don't really have a stake in this code as pmix never uses it (we pack the data into buffers instead of using this method). My only reason for asking for a review was to ensure that those who do care can verify that the compression remains in operation so that Cray machines don't encounter the key limit problem again.

@elenash
Copy link
Contributor

elenash commented Oct 17, 2014

@rhc54 There shouldn't be such problem with this patch. It still creates meta keys but later when it was previously.

@elenash
Copy link
Contributor

elenash commented Oct 17, 2014

@ggouaillardet I'm not sure that understood what you mean. Why does key-1 become minus terminated?

@ggouaillardet
Copy link
Contributor Author

@elenash because pmi_encode minus terminates the encoded string

In this scenario, pmi_encode is invoked twice by opal_pmix_base_commit_packed :
First by s1_put => key-1 is minus terminated
And then by s1_fence => key-2 that will not even be retrieved

elenash notifications@github.com wrote:

@ggouaillardet I'm not sure that understood what you mean. Why does key-1 become minus terminated?


Reply to this email directly or view it on GitHub.

{"@context":"http://schema.org","@type":"EmailMessage","description":"View this Pull Request on GitHub","action":{"@type":"ViewAction","url":"https://github.com/open-mpi/ompi/pull/238#issuecomment-59527400","name":"View Pull Request"}}

@elenash
Copy link
Contributor

elenash commented Oct 17, 2014

@ggouaillardet I see, thanks!

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7/

Build Log
last 50 lines

[...truncated 16292 lines...]
[jenkins01:28155] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 4 (pid 28155, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        28155
--------------------------------------------------------------------------
[jenkins01:28152] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28160] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28153] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28150] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28163] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28157] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:28146] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50047,1],6]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:28061] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:28061] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:28061] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:28061] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@ggouaillardet
Copy link
Contributor Author

#245 is a better implementation that obsoletes this PR

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 21, 2016
Don't register the PSM errhandler until it is certain that the PSM component can be used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants