Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not return in multi-node MPI applications #429

Open
parrotsky opened this issue Jul 22, 2022 · 2 comments
Open

Can not return in multi-node MPI applications #429

parrotsky opened this issue Jul 22, 2022 · 2 comments

Comments

@parrotsky
Copy link

parrotsky commented Jul 22, 2022

Hi, First I would like to thank the contributors for providing such an elegant and easy-to-go library to profile MPI programs.
MY problem:
I built a mpi cluster within a lan with up to 8 devices (Linux Ubuntu 20.04) according to the MPI tutorial.
I want to use Caliper to profile my applications over multiple devices. And before that, I wrote a simple hello world to test if it works.
The code is as below:

#include <mpi.h>
#include <stdio.h>
#include <caliper/cali.h>
#include <caliper/cali-manager.h>
// ...
// ...
int main(int argc, char** argv) {

	//l Initialize the MPI environment
	cali::ConfigManager mgr;
	mgr.add("runtime-report,event-trace(output=trace.cali)");
	int provided;
	MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
	if (provided < MPI_THREAD_MULTIPLE) {
		fprintf(stderr, "xxx MPI does not provide needed thread support!\n");
		return -1;
		// Error - MPI does not provide needed threading level
	}

	//     MPI_Init(&argc, &argv);

	mgr.start(); 
	// ...
	// Get the number of processes
	int world_size;
	MPI_Comm_size(MPI_COMM_WORLD, &world_size);


	// Get the rank of the process
	int world_rank;
	//   CALI_MARK_BEGIN("iemann_slice_precompute");
	MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
	//CALI_MARK_END("iemann_slice_precompute");
	// Get the name of the processor
	char processor_name[MPI_MAX_PROCESSOR_NAME];
	int name_len;
	MPI_Get_processor_name(processor_name, &name_len);

	// Print off a hello world message
	printf("Hello world from processor %s, rank %d out of %d processors\n",
			processor_name, world_rank, world_size);

	// Finalize the MPI environment.
	//
	mgr.flush();
	mgr.stop();
	MPI_Finalize();
}

the program works perfectly with multi-threads on a single device.

sky@nx01:~/cloud$ mpirun -np 2 ./hello
Hello world from processor nx01, rank 0 out of 2 processors
Hello world from processor nx01, rank 1 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.000952      0.001182      0.001067 13.165525 
MPI_Get_processor_name      0.000133      0.000193      0.000163  2.011228 
Function               Count (min) Count (max) Time (min) Time (max) Time (avg) Time %    
                                 9          13   0.040653   0.040994   0.040823 92.516799 
MPI_Comm_dup                     2           2   0.001527   0.002249   0.001888  4.278705 
MPI_Recv                         4           4   0.000935   0.000935   0.000935  1.059478 
MPI_Comm_free                    1           1   0.000170   0.000287   0.000228  0.517841 
MPI_Get_processor_name           1           1   0.000170   0.000285   0.000228  0.515575 
MPI_Send                         4           4   0.000421   0.000421   0.000421  0.477048 
MPI_Finalize                     1           1   0.000069   0.000134   0.000102  0.230026 
MPI_Probe                        2           2   0.000186   0.000186   0.000186  0.210762 
MPI_Get_count                    2           2   0.000171   0.000171   0.000171  0.193766 

When I test them over two devices(nodes), the program could not return normally and got stuck in somewhere.

sky@nx01:~/cloud$ mpirun -np 2 --host nx01,nx02 ./hello
Hello world from processor nx02, rank 1 out of 2 processors
Hello world from processor nx01, rank 0 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.003007      0.003007      0.003007 29.905520 
MPI_Get_processor_name      0.000132      0.000132      0.000132  1.312780 

Is there anybody who encounters the same issue or figure out where the bug locates?
Thanks a lot for answering.

@daboehme
Copy link
Member

daboehme commented Jul 25, 2022

Hello @parrotsky ,

This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking MPI_Send. It's possible that an MPI_Send finishes immediately for a target process on the same node but waits for a matching MPI_Recv to be called first when it goes over the network.

@parrotsky
Copy link
Author

Hello @parrotsky ,

This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking MPI_Send. It's possible that an MPI_Send finishes immediately for a target process on the same node but waits for a matching MPI_Recv to be called first when it goes over the network.

Hi, @daboehme Thanks for your reply.
The MPI_Send in the profiling log reminds me that the Caliper may duplicate the MPI_Comm. And the program may stuck in the MPI_Send or MPI_recv.
In the multi-process single device profiling report, we can find Both MPI_Send and MPI_Recv are called.
However, in the hello_world example, I simply call the function MPI_Get_rank, without MPI_Send or MPI_Recv.
So, I agree the problem may exist in the Caliper that some processes with MPI_Send/Recv finished too early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants