Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmix: correctly split pmi messages #245

Merged
merged 1 commit into from
Nov 11, 2014

Conversation

ggouaillardet
Copy link
Contributor

This is a better alternative to #238

messages are split and sent immediatly.

@elenash could you please review this ?

@elenash
Copy link
Contributor

elenash commented Oct 22, 2014

@ggouaillardet I looked though your changes and it's not really clear to me what you added threshold =3/4 * pmix_vallen and manipulation with data_offset %3 in your partial_commit_packed function. What's the key idea of that?
It looks like you missed a check if (NULL != tmp) after the while loop (it's needed since we can get there with freed tmp buffer in case of errors in the loop).

And by the way, what's the necessity of this fix? Your last fix seemed not bad.
Did you test this patch well?

@ggouaillardet
Copy link
Contributor Author

@elenash

  • pmix_vallen is the max size of a base64 encoded key value and base64
    encodes 3 bytes into 4 ascii bytes
    so the threshold for the binary key is 3/4 * pmix_vallen (so the
    encoded key is less than pmix_vallen characters)
  • in order to keep it simple, commit partial encodes the binary key 3
    bytes at a time, hence the %3
  • i removed free(tmp) before the break statement, so there is no more
    need to test (NULL != tmp), thanks for pointing this.
  • with the previous change, all the keys were sent at the last minute,
    and i thought it was suboptimal compared with the
    initial (but buggy) method in which key chunks were sent as soon as
    their length reached pmix_vallen characters.
  • i did test it and i will test it again

i pushed my changes (after a rebase in order to keep the git log short)
could you please have a look at it ?

@elenash
Copy link
Contributor

elenash commented Oct 22, 2014

@ggouaillardet I think there is a typo in line 180:
memmove(encoded_data, encoded_data+max_key-1-_enc_data_offset, max_key - 1 -_enc_data_offset);
It look like number of bytes to memmove here should be strlen(encoded_data)+1 -(max_key - 1 - *enc_data_offset), isn't it?

Everything else seems fine. But I'm still looking at changes in this pull request, these are without your new fixes. You said you pushed it somewhere?

@elenash
Copy link
Contributor

elenash commented Oct 22, 2014

@ggouaillardet at line 274 the same

@ggouaillardet
Copy link
Contributor Author

@elenash
thanks for the review, you are right about the typos
i found an other issue in this poc, and also pushed 248acbb in order to fix a different issue.

i overwrited my branch (git push --force) so the most up to date revision is always avaliable at https://github.com/open-mpi/ompi/pull/245/commits

i will test a bit more and push this branch into the master if you are ok with it

Thanks again !

@elenash
Copy link
Contributor

elenash commented Oct 23, 2014

@ggouaillardet Yes, sure, it looks fine to me. Thanks you!

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/6/

Build Log
last 50 lines

[...truncated 16357 lines...]
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[jenkins01:04383] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 7 (pid 4383, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        4383
--------------------------------------------------------------------------
[jenkins01:04381] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04374] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04379] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04371] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04372] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04376] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:04370] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49134,1],6]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:04364] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:04364] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:04364] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:04364] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/11/
Test PASSed.

@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/15/
Test PASSed.

Thanks to @elenash for all the reviews
ggouaillardet added a commit that referenced this pull request Nov 11, 2014
pmix: correctly split pmi messages
@ggouaillardet ggouaillardet merged commit 43af1e2 into open-mpi:master Nov 11, 2014
@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/28/

Build Log
last 50 lines

[...truncated 6 lines...]
 > /usr/bin/git init /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace # timeout=60
Fetching upstream changes from https://github.com/open-mpi/ompi
 > /usr/bin/git --version # timeout=60
using .gitcredentials to set credentials
 > /usr/bin/git config --local credential.helper store --file=/tmp/git1174826598023214943.credentials # timeout=60
 > /usr/bin/git fetch --tags --progress https://github.com/open-mpi/ompi +refs/heads/*:refs/remotes/origin/* # timeout=60
 > /usr/bin/git config --local --remove-section credential # timeout=60
 > /usr/bin/git config remote.origin.url https://github.com/open-mpi/ompi # timeout=60
 > /usr/bin/git config remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=60
 > /usr/bin/git config remote.origin.url https://github.com/open-mpi/ompi # timeout=60
Pruning obsolete local branches
Fetching upstream changes from https://github.com/open-mpi/ompi
using .gitcredentials to set credentials
 > /usr/bin/git config --local credential.helper store --file=/tmp/git1376052301993232427.credentials # timeout=60
 > /usr/bin/git fetch --tags --progress https://github.com/open-mpi/ompi +refs/pull/*:refs/remotes/origin/pr/* --prune # timeout=60
 > /usr/bin/git config --local --remove-section credential # timeout=60
 > /usr/bin/git rev-parse origin/pr/245/merge^{commit} # timeout=60
FATAL: Command "/usr/bin/git rev-parse origin/pr/245/merge^{commit}" returned status code 128:
stdout: origin/pr/245/merge^{commit}

stderr: fatal: ambiguous argument 'origin/pr/245/merge^{commit}': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

hudson.plugins.git.GitException: Command "/usr/bin/git rev-parse origin/pr/245/merge^{commit}" returned status code 128:
stdout: origin/pr/245/merge^{commit}

stderr: fatal: ambiguous argument 'origin/pr/245/merge^{commit}': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1435)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1411)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1407)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1110)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1120)
    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.revParse(CliGitAPIImpl.java:516)
    at hudson.plugins.git.GitAPI.revParse(GitAPI.java:257)
    at hudson.plugins.git.RevisionParameterAction.toRevision(RevisionParameterAction.java:83)
    at hudson.plugins.git.GitSCM.determineRevisionToBuild(GitSCM.java:815)
    at hudson.plugins.git.GitSCM.checkout(GitSCM.java:915)
    at hudson.model.AbstractProject.checkout(AbstractProject.java:1258)
    at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:622)
    at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:528)
    at hudson.model.Run.execute(Run.java:1759)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:89)
    at hudson.model.Executor.run(Executor.java:240)

Test FAILed.

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 21, 2016
…-page-fix

man pages: fix problem with MPI_Win_lock_all
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants