-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue: Inconsistent communication between nodes #6584
Comments
@YarShev Could you guide me on how to set up the environment for reproducing the issue? |
Sure, here is the steps you can follow.
Note that the benchmark consumes ~270 GB of RAM on each node.
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ sh Miniconda3-latest-Linux-x86_64.sh
$ conda create -n mpich-test python=3.8
$ conda activate mpich-test
$ pip install modin
$ conda install -c conda-forge mpi4py mpich
$ pip install msgpack
$ pip install unidist
$ git clone https://github.com/intel-ai/timedf.git
$ cd timedf
$ git diff
diff --git a/setup.py b/setup.py
index 99ff536..7c62da6 100644
--- a/setup.py
+++ b/setup.py
@@ -30,7 +30,7 @@ setup(
*find_packages(include=["timedf*"]),
*find_namespace_packages(include=["timedf_benchmarks.*"]),
],
- install_requires=parse_reqs("base.txt"),
+ # install_requires=parse_reqs("base.txt"),
extras_require={"reporting": reporting_reqs, "all": all_reqs},
python_requires=">=3.8",
entry_points={
$ pip install -e .
$ pip install kaggle
$ benchmark-load hm_fashion_recs ./datasets/hm_fashion_recs.
mpiexec -n 1 -env UNIDIST_CPUS 44 -env MODIN_CPUS 44 -env UNIDIST_MPI_HOSTS host1,host2 python timedf/scripts/benchmark_run.py hm_fashion_recs -data_file ./datasets/hm_fashion_recs/ -pandas_mode Modin_on_unidist_mpi -verbosity 1 -no_ml
If you have any questions/issues on reproducing the issue, please let me know. You can also contact me directly via piterok123@gmail.com. |
Note that we run the benchmark using The full line of running the benchmark without dynamic spawn. $ mpiexec -n 46 -env UNIDIST_CPUS 44 -env MODIN_CPUS 44 -host host1,host2 -env UNIDIST_IS_MPI_SPAWN_WORKERS False python timedf/scripts/benchmark_run.py hm_fashion_recs -data_file ./datasets/hm_fashion_recs/ -pandas_mode Modin_on_unidist_mpi -verbosity 1 -no_ml |
We have been able to reproduce the issue using mpi4py code only. from mpi4py import MPI
import time
import socket
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
hostname = socket.gethostname()
host = socket.gethostbyname(hostname)
with open('out.txt', 'w') as f:
print(f"host: {host}", file=f)
tag = 1
if rank == 0:
dest_rank = 1
r1 = comm.isend(1.00001, dest=dest_rank, tag=tag)
r2 = comm.isend(2.0002, dest=dest_rank, tag=tag)
r1.wait()
r2.wait()
elif rank == 1:
source_rank = 0
backoff=0.01
while not comm.Iprobe(source=source_rank, tag=tag):
time.sleep(0.01)
r1 = comm.irecv(source=source_rank, tag=tag)
r1.cancel()
r1.wait()
r2 = comm.irecv(source=source_rank, tag=tag)
while True:
status, data = r2.test()
if status:
with open('out2.txt', 'w') as f:
print(f"data: {data}", file=f)
break
else:
time.sleep(backoff) In the second |
The command to run is the following. $ mpiexec -n 2 -host host1,host2 python mpich-repr.py |
First of all, you should check the status after |
If the user cancels a communication request, why should he receive the data from the cancelled request? Seems wrong to me. |
If you successfully cancelled an EDIT: to clarify, you are not canceling the "request". You are canceling the operation represented by the request. A request object is simply a handle representing an asynchronous operation. |
I just read the standard again and yes, this is probably the case. I wonder if it is possible to cancel irecv at all so the subsequent irecv would get the data from a subsequent isend but not the previous one, which was canceled on the receiver side? |
You don't need to cancel the |
In that case the communication finishes successfully, i.e., the receiver will get data eventually, which may bloat memory up a little if the data is huge. Is there a way to avoid such a case, i.e., as I said fully cancell the communication request? |
I see. You may issue an EDIT: I think you can even issue |
I don't quite get you. If I execute the following code, I will get the error. Please elaborate the point you are saying about. from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = numpy.arange(10000000, dtype='i')
r = comm.Isend([data, MPI.INT], dest=1, tag=77)
r.Wait()
elif rank == 1:
data = numpy.empty(1, dtype='i')
r = comm.Irecv([data, MPI.INT], source=0, tag=77)
r.Wait()
|
The EDIT: I think that means you can just remove |
Let me try this on the HM benchmark and get back to you. |
Before running the HM I tried this simple example and it hangs. Is that some side effect of MPI_Request_free? Because if I change it to Wait, everything works well. from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = numpy.arange(1000000, dtype='i')
r = comm.Isend([data, MPI.INT], dest=1, tag=77)
print("before")
r.Wait()
print("after")
elif rank == 1:
data = numpy.empty(1000000, dtype='i')
r = comm.Irecv([data, MPI.INT], source=0, tag=77)
r.Free()
print(data) EDIT: Intel MPI just crashed. |
I'll check your example later. Meanwhile, let's consult an mpi4py expert -- |
Reference example in C:
|
Does it hang in C too? |
@hzhou This is a modification to your C code #include <stdio.h>
#include <mpi.h>
#define N 1000000
int send_buf[N];
int main(int argc, char** argv)
{
int tag = 10;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Init(NULL, NULL);
int mpi_size, mpi_id;
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id);
if (mpi_id == 0) {
MPI_Request send_req;
MPI_Isend(send_buf, N, MPI_INT, 1, tag, comm, &send_req);
MPI_Wait(&send_req, MPI_STATUS_IGNORE);
puts("send 10 ints done.");
} else if (mpi_id == 1) {
MPI_Request recv_req;
MPI_Irecv(NULL, 0, MPI_INT, 0, tag, comm, &recv_req);
printf("recv_req = %x\n", recv_req);
MPI_Request_free(&recv_req);
}
MPI_Finalize();
printf("[%d] done\n", mpi_id);
return 0;
} I'm running it on macOS Ventura with Homebrew MPICH 4.1.2 (ch4:ofi). This looks definitely like an issue in MPICH (eager vs. rendezvous protocol implementations?). @YarShev Maybe you can workaround the message truncation error the usual Python way: try:
r.Wait()
except MPI.Exception as exc:
if exc.error_class == MPI.ERR_TRUNCATE:
pass # ignore the expected truncation error
else:
raise # but do not ignore other MPI errors |
@dalcinl, thanks for the workaround but it seems to me a little clumsy. Moreover, we have to cancel multiple recvs in a row. This workaround looks like the only way for now to completely cancel the communication request though. That said, we'll probably wait for a fix to use request.Free. |
Indeed. The issue is process 1 exit before finishing the rendezvous protocol with process 0. Adding an The proper fix probably is to check for pending requests and try to complete them before a timeout at MPI_Finalize. Ref. #6513. |
@YarShev If you can make sure the receiving processes do not exit before the sender processes finishing the sending, just free the request should work. Depend on your application, you may simply add a |
For now we just allow remaining operations between workers to complete when the rank 0 exits the program. Look forward to seeing a fix for request.Free. |
@hzhou, which nearest release will include the fix for this issue? |
MPICH 4.2b1, it is planned to release this November. |
I see, thanks! |
Hi there,
I am a developer of Modin and unidist. We use MPI as a backend in unidist, and mpich can be used as one of the implementations in that backend. Modin uses unidist to distribute computation.
We have been looking at one of our benchmarks recently. The benchmark is HM, which has to do with data processing and uses Modin for this purpose. Modin distributes computation with unidist on MPI.
We have checked the benchmark in a single node with mpich, Open MPI and Intel MPI. Everything works well. Then, we tried running the benchmark in a cluster consisting of two nodes. Open MPI and Intel MPI work well, but mpich fails.
How does unidist work? We have Root process (rank 0) and multiple workers (rank 1, ..., N), which sit in infinite while loop and wait for
tasks to execute. Root sends out tasks to workers that start executing them. Workers themselves can communicate with each other to send/recv necessary data for task execution.
How does the HM fails? Root process goes up to the end of the benchmark, sends out a termination signal to workers, and ends up execution. Workers themselves have asyncronous pending requests got from
Isends
during the flow. Once workers receive the termination signal, they gradually finish all remaining tasks, particularly, callWait
on every request to make sure there is no pending isend operation, as well as call recv and irecv to receive/cancel communication. However, all of a sudden, a worker on one node receives wrong data from other worker from other node. We see the error in the workers.I tried using mpich from conda and got the error on unidist side because mpi.recv returns wrong data, i.e., we expect to receive other data. Also, there is the following output from mpi internals
I also tried mpich built from source and only get the error on unidist side because mpi.recv returns wrong data, i.e., we expect to receive other data. There is no any output from mpi internals.
I would really appreciate if you could help handle the issue. Maybe there has already been a similar issue in internode communication using isend.
Thanks in advance.
The text was updated successfully, but these errors were encountered: