Skip to content

OSHMEM/mca/atomic/ucx: Data Integrity issue in shmem_atomic_set with float and double datatypes #7137

@sssharka

Description

@sssharka

Open-MPI 4.0.1
Packaged with SMPI

Operating system/version: RedHat
Computer hardware: Power9
Network type: Shared Memory / IB


Details of the problem

When running a simple test case of shmem_atomic_set (provided below) using OSHMEM/UCX, and the data type is either float or double, data is wrong at the target.

Eyeballing the code, it seems that there are two issues:
1 - In oshmem/mca/atomic/ucx/atomic_ucx_module.c, it seems that all the "value" fields in all definitions is of type uint64_t; thus, any float and double types will lose its resolution. Also, the ucp_atomic_fetch_nb has the same issue.
2 - The test case will show that it is not just a matter of losing resolution, it seems that data is overwritten with wring value.

Testcase used:

//This test will do a set on both a double value and float value. The test is expected to be ran with 2 tasks, with the 2 tasks doing the set to the other task.
//Expected values are:
//TASK 0 FLOAT = 101.0 DOUBLE = 201.0
//TASK 1 FLOAT = 100.0 DOUBLE = 200.0

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>

#include "shmem.h"

#define BUF_LEN 1024


//
// MAIN
//


int main( int argc, char *argv[] ) {
    float       *float_target;
    double      *dbl_target;
    int         *int_target;

    float float_origin;
    float float_ver;
    int int_origin;
    int int_ver;
    double dbl_origin;
    double dbl_ver;

    int rank, n_pes, source, dest;

    shmem_init();

    rank = shmem_my_pe();

    n_pes = shmem_n_pes();

    if( !(float_target = shmem_malloc(BUF_LEN)) || !(dbl_target = shmem_malloc(BUF_LEN)) || !(int_target = shmem_malloc(BUF_LEN))){
        fprintf(stderr, "shmem_malloc failed\n");
        shmem_finalize();
        exit(1);
    }
     fprintf(stderr, "float_target: %p dbl_target: %p \n",float_target, dbl_target);
    *float_target = 1.0;
    *int_target = 1;
    *dbl_target = 1.0;
     fprintf(stderr, "*float_target: %f *dbl_target: %f \n",*float_target, *dbl_target);

    if (rank == 0){
        dest = 1;
        float_origin = rank + 100;
        float_ver = dest + 100;
        int_origin = rank + 100;
        int_ver = dest + 100;
        dbl_origin = rank + 200;
        dbl_ver = dest + 200;
    }
    else{
        dest = 0;
        float_origin = rank + 100;
        float_ver = dest + 100;
        int_origin = rank + 100;
        int_ver = dest + 100;
        dbl_origin = rank + 200;
        dbl_ver = dest + 200;
    }


    shmem_barrier_all();

    shmem_atomic_set(float_target, float_origin, dest);
    shmem_atomic_set(int_target, int_origin, dest);
    shmem_atomic_set(dbl_target, dbl_origin, dest);

    shmem_barrier_all();

    if (*float_target != float_ver){
        fprintf(stderr, "ERROR, expected value = %f, received value = %f float_target: %p\n", float_ver, *float_target,float_target);
    }


    if (*dbl_target != dbl_ver){
        fprintf(stderr, "ERROR, expected value = %f, received value = %f\n", dbl_ver, *dbl_target);
    }

    shmem_free(float_target);

    shmem_finalize();

    return 0;

}

I run the test with:
oshrun -np 2 simple_set

This can be recreated in shared memory or over IB.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions