-
Notifications
You must be signed in to change notification settings - Fork 927
Description
Open-MPI 4.0.1
Packaged with SMPI
Operating system/version: RedHat
Computer hardware: Power9
Network type: Shared Memory / IB
Details of the problem
When running a simple test case of shmem_atomic_set (provided below) using OSHMEM/UCX, and the data type is either float or double, data is wrong at the target.
Eyeballing the code, it seems that there are two issues:
1 - In oshmem/mca/atomic/ucx/atomic_ucx_module.c, it seems that all the "value" fields in all definitions is of type uint64_t; thus, any float and double types will lose its resolution. Also, the ucp_atomic_fetch_nb has the same issue.
2 - The test case will show that it is not just a matter of losing resolution, it seems that data is overwritten with wring value.
Testcase used:
//This test will do a set on both a double value and float value. The test is expected to be ran with 2 tasks, with the 2 tasks doing the set to the other task.
//Expected values are:
//TASK 0 FLOAT = 101.0 DOUBLE = 201.0
//TASK 1 FLOAT = 100.0 DOUBLE = 200.0
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include "shmem.h"
#define BUF_LEN 1024
//
// MAIN
//
int main( int argc, char *argv[] ) {
float *float_target;
double *dbl_target;
int *int_target;
float float_origin;
float float_ver;
int int_origin;
int int_ver;
double dbl_origin;
double dbl_ver;
int rank, n_pes, source, dest;
shmem_init();
rank = shmem_my_pe();
n_pes = shmem_n_pes();
if( !(float_target = shmem_malloc(BUF_LEN)) || !(dbl_target = shmem_malloc(BUF_LEN)) || !(int_target = shmem_malloc(BUF_LEN))){
fprintf(stderr, "shmem_malloc failed\n");
shmem_finalize();
exit(1);
}
fprintf(stderr, "float_target: %p dbl_target: %p \n",float_target, dbl_target);
*float_target = 1.0;
*int_target = 1;
*dbl_target = 1.0;
fprintf(stderr, "*float_target: %f *dbl_target: %f \n",*float_target, *dbl_target);
if (rank == 0){
dest = 1;
float_origin = rank + 100;
float_ver = dest + 100;
int_origin = rank + 100;
int_ver = dest + 100;
dbl_origin = rank + 200;
dbl_ver = dest + 200;
}
else{
dest = 0;
float_origin = rank + 100;
float_ver = dest + 100;
int_origin = rank + 100;
int_ver = dest + 100;
dbl_origin = rank + 200;
dbl_ver = dest + 200;
}
shmem_barrier_all();
shmem_atomic_set(float_target, float_origin, dest);
shmem_atomic_set(int_target, int_origin, dest);
shmem_atomic_set(dbl_target, dbl_origin, dest);
shmem_barrier_all();
if (*float_target != float_ver){
fprintf(stderr, "ERROR, expected value = %f, received value = %f float_target: %p\n", float_ver, *float_target,float_target);
}
if (*dbl_target != dbl_ver){
fprintf(stderr, "ERROR, expected value = %f, received value = %f\n", dbl_ver, *dbl_target);
}
shmem_free(float_target);
shmem_finalize();
return 0;
}
I run the test with:
oshrun -np 2 simple_set
This can be recreated in shared memory or over IB.