pack external32 long double conversion (extended 80 / quad 128) #8941

markalle · 2021-05-10T07:30:02Z

[updated the top-level github description to match the commit message]:

On architectures that store long doubles as 80 bit extended precisions
or as 64 bit "float64"s, we need conversions to 128 bit quad precision to
satisfy MPI_Pack_external/Unpack_external. I added a couple more
arguments to pFunction to know what architecture the 'to' and 'from'
buffers are. Previously we had architecture info 'local' and 'remote'
but I don't know how to correlate local/remote with to/from without
adding more arguments as I did.

With the incresed information about the context, the conversion function
can now convert the long double as needed.

I'm using code Lisandro Dalcin contributed for the floating point
conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64.
These conversion functions require the data to be in local endianness,
but one of the sides in pack/unpack is always local so operations can
be done in an order that allows the long double conversion to see the
data in local endianness.

I also added a path to use __float128 for the conversion
for #ifdef HAVE___FLOAT128 as that ought to be the more reliable
method than rolling our own bitwise conversions.

The reason for all the arch.h changes is the former code was
inconsistent as to how bits were labeled within a byte, and had
masks like LONGISxx that didn't match the bits they were supposed
to contain.

markalle · 2021-05-10T07:32:23Z

Fixes #8918
not well tested yet.

So far just tested on Mac.

Needs tested somewhere bigendian, and somewhere that __float128 is defined since I only just added that path and didn't test it at all yet. A testcase is available at
https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048

dalcinl · 2021-05-10T07:40:27Z

@markalle I'll insist with a point I made before. I believe my code is overall correct for BE arches, but what I'm not sure about is the helper union layout for long double on big endian either 32bit or 64bits. After making sure of the union layout, the rest of the conversion code should work correctly.

markalle · 2021-05-10T20:23:30Z

I wasn't saying your code is wrong for big-endian machines, I'm saying it only converts data that's in the local endianness, it can't be used directly on big-endian data on a little-endian machine, but that's okay because each conversion we do involves either going to or from the local endianness so I've ordered the endianness-conversion and the f80/f128 conversions so it only has to operate on local endianness data.

I'm undecided how much to invest in performance. I think with all the steps involved and multiple copies already taking place, and using memcpy() instead of dereference it's going to be insanely slow. But it's definitely possible to be packing on a 3-byte boundary for example, so I think it has to use memcpy() or at least have that as an option anyway.

I still haven't studied the code carefully enough to be confident that the big-endian side is right. I ran two QEMU setups (ppc / mips) and both were storing long double as a 64-bit double, so I didn't actually get to see how a big-endian with 80-bit extended precision lays out its data. My guess though is that the current code has the padding in the wrong place, although I'm only guessing. Current code for big-endian 80-bit extended:

    unsigned sign  :  1;
    unsigned exp   : 15;
    unsigned pad   : 16;
    unsigned frac1 : 32;
    unsigned frac0 : 32;

I would have expected the pad at the bottom, so sign,exp,frac1,frac0 would all be adjacent. Is there a reason you initially coded it as above? Without a test machine I'm not 100% confident, but I'm inclined to move the padding and make the data adjacent

markalle · 2021-05-10T23:04:40Z

I didn't go too far with performance, but those memcpy() were awfully expensive in at least some cases, so I put an alignment check and conditionally avoid the memcpy now, and repushed.

And I'm making a guess about what an extended double precision looks like natively on a big endian machine and moving the padding vs what @dalcinl had. I'm not positive though, just suspicious that it ought to be this way.

dalcinl · 2021-05-11T05:30:40Z

Is there a reason you initially coded it as above?

I followed the layout in ieee754.h from glibc. Look for union ieee854_long_double in that header [link]

However, I found it suspicious, too. Look at GCC's software-fp implementation [link]
They have the pad at the beginning! That cannot be right on i386, right?

The truth is, I don't really know whether there exists any big-endian architecture where long double is extended precision (float80) instead of just double or quad precision. If there is none, then all our care about float80 big-endian is pointless. I would say go ahead with any union layout you want, leave a comment in the code, and let users complain if they get the wrong thing.

markalle · 2021-05-11T13:50:02Z

Thanks, that's a good argument for that format then. So I just repushed to put the padding back where that header had it.

opal/datatype/opal_copy_functions_heterogeneous.c

gpaulsen · 2021-05-13T19:28:57Z

@dalcinl are you happy with this PR? Should we merge it?

jsquyres · 2021-05-13T21:51:59Z

@bosilca is out on vacation this week 🍹 ; he probably needs to review this before we merge.

dalcinl · 2021-05-14T10:16:30Z

@markalle I believe we missed a few important points in our original discussions.

I have a Raspberry Pi at home (armv7l). In there, long double is the same as double. So we need to handle this case and implement float64 <-> float128 conversion routines to handle that case.
I managed to use a qemu VM to install Debian8 on PowerPC 32bit. In there, long double uses 16 bytes, but it is a different format than Intel's float80. For a quick explanation see first section 1.1 here. This format is still used in POWER7/POWER8 and ppc64le Linux builds.

So, in short, a long double <-> quad precision implementation should be done the following way:

If __float128 is available, use it and move on. After handling byte order, you are done.
Otherwise, we look at the LDBL_MANT_DIG and LDBL_MAX_EXP macro from float.h.
a) if LDBL_MANT_DIG == 53 and LDBL_MAX_EXP == 1024, then long double is float64 and we need new conversion routines to handle f64 <-> f128. Incidentally, note that this is the case on Windows with Microsoft compilers, even if running on Intel architectures.
b) if LDBL_MANT_DIG == 64 and LDBL_MAX_EXP == 16384, then long double should be Intel's float80 extended precision. You already have all the needed bits.
c) if LDBL_MANT_DIG == 112 and LDBL_MAX_EXP == 16384, then long double is actually IEEE quad precision float128. So you just need to care about byte order. Note: POWER9 has hardware support for float128, so we may see long double becoming quad precision soon. It seems Fedora has plans for it.
d) if LDBL_MANT_DIG == 106 and LDBL_MAX_EXP == 1024 this is IBM extended precision format (relevant for POWER7 and POWER8). I do not know the details of this format (other than it is built out of two double values), but conversion routines should not be hard. We could add support in a subsequent PR.
e) Otherwise, ERROR! We do not know the fp format, and errors should never pass silently.

Sorry for not figuring this out before. Fortunately, your none of work so far goes to waste. We just need more.
I could contribute the f64 <-> f128 routines. But you will have to want a three to four days for them.

@gpaulsen @jsquyres At this point, I think this is not ready for merge, IMHO we can do better.

ggouaillardet · 2021-05-14T10:24:51Z

FWIW and IIRC, the existing conversion subroutine was used between x86_64 and sparcv9

dalcinl · 2021-05-14T13:29:25Z

@markalle I've updated my gist with float64 <-> float128 conversion routines. Please note I also changed the union names used in the implementation.

markalle · 2021-05-25T20:20:08Z

Repushed. I used the new code conversions where @dalcinl added f64 as well as f80, but renamed the functions to f80_to_f128() etc.

The usage from the OMPI level is in the macro that defines a pFunction for LONG_DOUBLE conversion. If it's not converting LONG_DOUBLE or the long double format in both the to/from architectures is the same it does memcpy, then if it has to convert, it does

endianness to local
long double in "from_arch" format to f128, if the from_arch doesn't already have its long double == f128
f128 to long double in "to_arch" format, if the to_arch isn't asking for long double == f128
endianness to to_arch

ibm-ompi · 2021-05-25T20:40:29Z

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/053f2148801eacbef27f8da18a5bd33d

ibm-ompi · 2021-05-25T20:51:26Z

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0bf5c2ff6dbfecebf2f574543c22194a

markalle · 2021-05-25T23:24:43Z

Update: okay, testing shows not quite ready still. I think the detection is still a slight weak spot

markalle · 2021-05-26T00:28:51Z

Repushed based on ppc testing and my comment above. That path worked fine in the default mode (where __float128 is available) but I also tested artificially turning off HAVE__FLOAT128 just to make it test another path, and in that mode it didn't have a conversion for MANTISSA=106 and EXP=10. There's an argument for erroring out in that case if we don't have a conversion for the detected long double format, but at least for now I decided to allow it since the detection is new and I'd hate to produce a false failure where we error out on a conversion when we didn't have to.

And besides, in the regular mode where I don't artificially push it through an alternate path the code was fine using __float128.

ibm-ompi · 2021-05-26T00:39:37Z

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e86c3001047ea2c0d1440f392563b96f

gpaulsen · 2021-05-26T03:07:55Z

opal_copy_functions_heterogeneous.c:522:35: error: __float128 is not supported on this target
                  f128_buf_to += sizeof(__float128);
opal_copy_functions_heterogeneous.c:580:37: error: __float128 is not supported on this target
                  f128_buf_from += sizeof(__float128);

dalcinl · 2022-06-02T14:32:38Z

@jsquyres Could you please add the Target: v5.0.0 label as well?

gpaulsen · 2022-06-20T14:03:16Z

@bosilca Do you have time to review?

gpaulsen · 2022-06-20T14:05:15Z

retest

gpaulsen · 2022-06-20T14:09:18Z

bot:retest

dalcinl · 2022-08-21T15:26:02Z

How can we get this one out of the backlog?

markalle · 2022-09-26T22:13:07Z

I just rebased and repushed, no changes. I'd love to be able to get this enhancement in. It's been tested through a decent variety of architectures including the multiple paths it has for the conversions

bwbarrett · 2022-09-28T02:05:30Z

bot:aws:retest

dalcinl · 2022-10-06T22:00:02Z

@bwbarrett Could you please trigger retest of the failing row?

jsquyres · 2022-10-07T12:33:17Z

bot:ompi:retest

awlauria · 2022-10-07T14:24:58Z

@gpaulsen please review and we'll get this in.

gpaulsen · 2022-10-11T19:52:12Z

@bosilca I don't have enough experience in this important area of code to review this before v5.0.0... Is this something you're interested in? Could you review sometime soon, otherwise this might need to delay until post v5.0.0.

dalcinl · 2022-10-11T20:03:09Z

opal/util/arch.h

@@ -178,7 +178,7 @@
 **   bits 1 & 2: length of long double: 00=64, 01=96,10 = 128
 **   bits 3 & 4: no. of rel. bits in the exponent: 00 = 10, 01 = 14)
 **   bits 5 - 7: no. of bits of mantisse ( 000 = 53,  001 = 64, 010 = 105,


@markalle Your next line change is good, but to keep columns aligned (notice the two spaces after 53,), I think you fix this one too:

Suggested change

** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,

** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,

making the change, thx

dalcinl · 2022-10-11T20:28:14Z

@gpaulsen @jsquyres @awlauria Why did you guys remove the v5.0.x label? This PR has been waiting for about a year and a half! @markalle did a terrific job. The code paths have been tested on a variety or architectures to make sure things are working properly. And the more important thing is that this PR fixes things that are otherwise broken right now. Isn't all the testing we have done on a variety of arches enough for you guys to have some confidence the implementation is sound?

awlauria · 2022-10-11T21:00:32Z

@dalcinl PR's are traditionally labeled with their target branch.

We're still open to bringing this back to v5.0.x but for that to happen it needs a review, and to date we have not gotten one yet.

opal/datatype/opal_copy_functions_heterogeneous.c

bosilca · 2022-10-12T16:14:21Z

opal/datatype/opal_copy_functions_heterogeneous.c

+Summaryizing the logic of the pFunc copy functions
+with regard to long doubles:
+
+For terminology I''ll use


typo (double single quote)

Some versions of gcc complain about unmatched quotes inside "#if 0" blocks, but I guess the solution is I just shouldn't be using "#if 0" to make my comment blocks. I'll switch

opal/datatype/opal_copy_functions_heterogeneous.c

opal/util/arch.h

opal/datatype/opal_copy_functions_heterogeneous.c

opal/util/arch.h

On architectures that store long doubles as 80 bit extended precisions or as 64 bit "float64"s, we need conversions to 128 bit quad precision to satisfy MPI_Pack_external/Unpack_external. I added a couple more arguments to pFunction to know what architecture the 'to' and 'from' buffers are. Previously we had architecture info 'local' and 'remote' but I don't know how to correlate local/remote with to/from without adding more arguments as I did. With the incresed information about the context, the conversion function can now convert the long double as needed. I'm using code Lisandro Dalcin contributed for the floating point conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64. These conversion functions require the data to be in local endianness, but one of the sides in pack/unpack is always local so operations can be done in an order that allows the long double conversion to see the data in local endianness. I also added a path to use __float128 for the conversion for #ifdef HAVE___FLOAT128 as that ought to be the more reliable method than rolling our own bitwise conversions. The reason for all the arch.h changes is the former code was inconsistent as to how bits were labeled within a byte, and had masks like LONGISxx that didn't match the bits they were supposed to contain. Signed-off-by: Mark Allen <markalle@us.ibm.com>

awlauria · 2022-10-17T15:01:58Z

Thanks @bosilca and @markalle

markalle mentioned this pull request May 10, 2021

Pack/Unpack external32 with long double still broken #8918

Closed

gpaulsen requested a review from bosilca May 10, 2021 13:29

markalle force-pushed the pack_long_double_conversion branch from 50dce95 to 1db7464 Compare May 10, 2021 23:00

markalle force-pushed the pack_long_double_conversion branch from 1db7464 to 25dd2e8 Compare May 11, 2021 13:48

dalcinl reviewed May 11, 2021

View reviewed changes

opal/datatype/opal_copy_functions_heterogeneous.c Show resolved Hide resolved

dalcinl mentioned this pull request May 12, 2021

test: Update testsuite to run with MPICH and Open MPI development mpi4py/mpi4py#37

Merged

dalcinl added a commit to mpi4py/mpi4py that referenced this pull request May 12, 2021

TODO: See open-mpi/ompi#8941

7ff93fa

dalcinl added a commit to mpi4py/mpi4py that referenced this pull request May 13, 2021

TODO: See open-mpi/ompi#8941

c2136a4

dalcinl mentioned this pull request May 13, 2021

README.md: trivial change to force mpi4py testing jsquyres/ompi#4

Open

markalle force-pushed the pack_long_double_conversion branch from 25dd2e8 to 2817bf5 Compare May 25, 2021 20:10

markalle force-pushed the pack_long_double_conversion branch from 2817bf5 to b737f3a Compare May 25, 2021 22:34

markalle force-pushed the pack_long_double_conversion branch from b737f3a to ff138cd Compare May 26, 2021 00:21

awlauria added the Target: v5.0.x label Jun 2, 2022

gpaulsen removed the Target: v5.0.x label Sep 26, 2022

markalle force-pushed the pack_long_double_conversion branch from 3aea80f to 1272ff5 Compare September 26, 2022 22:10

awlauria requested a review from gpaulsen October 7, 2022 14:24

gpaulsen removed their request for review October 11, 2022 19:52

dalcinl reviewed Oct 11, 2022

View reviewed changes

bosilca reviewed Oct 12, 2022

View reviewed changes

markalle force-pushed the pack_long_double_conversion branch 2 times, most recently from 43add0f to 8488fb9 Compare October 14, 2022 06:38

bosilca reviewed Oct 14, 2022

View reviewed changes

opal/datatype/opal_copy_functions_heterogeneous.c Outdated Show resolved Hide resolved

bosilca reviewed Oct 14, 2022

View reviewed changes

opal/util/arch.h Outdated Show resolved Hide resolved

markalle force-pushed the pack_long_double_conversion branch from 8488fb9 to a54b204 Compare October 14, 2022 19:05

markalle force-pushed the pack_long_double_conversion branch from a54b204 to 308a94e Compare October 14, 2022 20:31

bosilca approved these changes Oct 14, 2022

View reviewed changes

awlauria merged commit 0349a5d into open-mpi:main Oct 17, 2022

	** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,
	** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,

pack external32 long double conversion (extended 80 / quad 128) #8941

pack external32 long double conversion (extended 80 / quad 128) #8941

Conversation

markalle commented May 10, 2021 • edited Loading

markalle commented May 10, 2021

dalcinl commented May 10, 2021

markalle commented May 10, 2021

markalle commented May 10, 2021

dalcinl commented May 11, 2021

markalle commented May 11, 2021

gpaulsen commented May 13, 2021

jsquyres commented May 13, 2021

dalcinl commented May 14, 2021

ggouaillardet commented May 14, 2021

dalcinl commented May 14, 2021

markalle commented May 25, 2021

ibm-ompi commented May 25, 2021

ibm-ompi commented May 25, 2021

markalle commented May 25, 2021

markalle commented May 26, 2021

ibm-ompi commented May 26, 2021

gpaulsen commented May 26, 2021

dalcinl commented Jun 2, 2022

gpaulsen commented Jun 20, 2022

gpaulsen commented Jun 20, 2022

gpaulsen commented Jun 20, 2022

dalcinl commented Aug 21, 2022

markalle commented Sep 26, 2022

bwbarrett commented Sep 28, 2022

dalcinl commented Oct 6, 2022

jsquyres commented Oct 7, 2022

awlauria commented Oct 7, 2022

gpaulsen commented Oct 11, 2022

dalcinl Oct 11, 2022

Choose a reason for hiding this comment

markalle Oct 13, 2022

Choose a reason for hiding this comment

dalcinl commented Oct 11, 2022

awlauria commented Oct 11, 2022

bosilca Oct 12, 2022

Choose a reason for hiding this comment

markalle Oct 13, 2022

Choose a reason for hiding this comment

awlauria commented Oct 17, 2022

markalle commented May 10, 2021 •

edited

Loading