-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437
Comments
This looks like version confusion between the two nodes. Note that the |
@rhc54: thx a lot. I don't think that is the case. because I was compile/install them using the exactly same configuration, you can check the ompi information as below. Thx again. ompi info of the local node: the ompi_info of the remote node: |
Thanks for providing that output. I gather you built/installed them on each node separately, yes? That is a rather unusual way of doing it and generally not recommended - it is much safer to install on a shared file system directory. Try configuring with
|
@shinechou You might also want to check that there's not some OS/distro-installed Open MPI on your nodes that is being found and used (e.g., earlier in the PATH than your hand-installed Open MPI installations). |
@rhc54: you are right. I install them separately. What is the proper way to do it? Do u have have any guidance for that? Because after I install it I have to install another library on top of ompi library. I'll try to compile it with --enable-debug option and run the command u mentioned. @jsquyres: thank u for ur comments. But I'm sure there is no other ompi installed on my both nodes. |
You could install Open MPI on one node, and then tar up the installation tree on that node, and then untar it on the other node. Then you'd know for sure that you have exactly the same binary installation on both nodes. Something like this: $ ./configure --prefix=/opt/openmpi-3.0.0
$ make -j 32 install
...
$ cd /opt
$ tar jcf ~/ompi-install-3.0.0.tar.bz2 openmpi-3.0.0
$ scp ~/ompi-install-3.0.0.tar.bz2 @othernode:
$ ssh othernode
...login to othernode...
$ cd /opt
$ rm -rf openmpi-3.0.0
$ sudo tar xf ~/ompi-install-3.0.0.tar.bz2 Usually, people install Open MPI either via package (e.g., RPM) on each node, or they install Open MPI on a network filesystem (such as NFS) so that the one, single installation is available on all nodes. Note that I mentioned the multiple Open MPI installation issue because the majority of time people run into this error, it's because users are accidentally / unknowingly using multiple different versions of Open MPI (note that Open MPI currently only supports running exactly the same version of Open MPI on all nodes in a single job). This kind of error almost always indicates that version X of Open MPI is trying to read more data than was sent by Open MPI version Y. Try this exercise: $ ompi_info | head
$ ssh othernode ompi_info | head Doing the 2nd line non-interactively is important (i.e., a single command -- not Make sure that both If they do, then there's something configured differently between the two (but which might still be a bug, because "same version but configured differently" should still usually work). |
@jsquyres: thx again for ur guidance. I tried ur excercise. They return the same version as below, $ ompi_info | head $ ssh mpiuser@client ompi_info | head |
@rhc54: thx. I've tried ur suggestion to run $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname, but I got the error message like,
or
|
Github pro tip: use three single-tick-marks to denote verbatim regions. See https://guides.github.com/features/mastering-markdown/. Ok, good, so you have the same Open MPI v3.0.0 installed on both sides. But something must be different between them, or you wouldn't be getting these errors. Are both machines the same hardware? Also, I think @rhc54 meant for you to run the |
@jsquyres: thx a lot. Sorry that I'm a noob for linux and openmpi so I don't know that hostname is not the hostname (indeed it is master). I am using same version of ubuntu for both nodes (ubuntu 16.04 64-bit desktop version), but indeed the hardware are different for those two nodes, the "master" node is HP Z820 workstation (XEON E5-2670, 64G ECC-RAM, ASUS GTX1080), the "client" node is a DIY PC(i3-7100, 32G DDR4 RAM, ASUS GTX1080Ti), whether the difference of HW configuration of two nodes will result in this error? ''' |
Sorry for the confusion - I expected you to retain the |
@rhc54: thx. Could you please help me to figure it out? Pls check the output as below, ''' An internal error has occurred in ORTE: [[42458,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355) This is something that should be reported to the developers. [ryan-z820:15616] [[42458,0],0] plm:base:receive processing msg ''' |
@rhc54: I've provided the log with your debug command, could you please help me to check it? thx a lot in advance. |
I honestly am stumped - it looks like you basically received an empty buffer, and I have no idea why. I can't replicate it. Perhaps you might try with the nightly snapshot of the 3.0.x branch to see if something has been fixed that might have caused the problem? |
@rhc54: thank you. I'll try the nightly version to see what comes out. |
@rhc54: the problem has been resolved. It seems that the problem is caused by different HW architecture. One is using xeon but another one is using i3, now I change the xeon one to i3 and it works fine. Maybe another possible reason is the xeon workstation has two network adapters, one is used for AMT, not sure whether or not it will affect ompi though. |
@shinechou I have the same problem. And I have check the open-mpi version, they are the same. Could you tell me how to find the hardware problem? I have checked my network and its adapters, they are the same. |
@zhanglistar: for me, one of my node is an HP workstation with XEON CPU and it has different HW configuration than the master node (which is a regular PC). So I didn't use the HP workstation but just use another regular PC. |
For the benefit of others running into this error or "ORTE_ERROR_LOG: Data unpack had inadequate space": in my case the issue was resolved by switching to the internal hwloc. I had compiled OpenMPI 3.0.0 on two different Ubuntu releases (16.04 and 17.10), both configured identically, and with Running Removing |
I got this same problem with Open MPI 4.0.1, when built locally on each machine (having machines with different generations of Intel CPUs). A Sandybridge machine would not be able to communicate with Skylake nodes. Copying the Sandybridge binaries over to the Skylake nodes fixed the issue. So we have a problem where different architectures produce different binaries (structures ?) which are not compatible protocol-wise. Do you think that should be fixed or just documented (don't build Open MPI locally on each machine) @rhc54 ? |
@sjeaugey It sounds like you built with a different hwloc version on the two types of nodes? |
That was not my impression as I could not find any trace of hwloc anywhere on the nodes (so I assume both were compiled with the internal hwloc). Reading the whole issue, I could not determine whether the fix came from the hwloc change or the fact that the binary was propagated from one machine to the others as suggested by Jeff in #4437 (comment) Now, I did not compile myself the libraries that weren't working properly, nor did I try to re-compile the working version on each node to confirm it would break, so I'm not 100% sure yet. I'll update the bug if I can reproduce it better. |
Open MPI Version: v4.0.0 Output of
Both are installed using common shared network. while running command on s1(master)
while running command separately in s2(slave)
Output of
Output of
Both machines are running on but while running command on distributed giving following error
|
@RahulKulhari Please do not add new issues to a closed issue; thanks. |
@RahulKulhari were you able to resolve the issue? Facing same problem! |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI v3.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Following the installation guidance of FAQ,
Please describe the system on which you are running
Details of the problem
I got the error as below,
An internal error has occurred in ORTE:
This is something that should be reported to the developers.__
Thanks a lot in advance.
The text was updated successfully, but these errors were encountered: