Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack external32 long double conversion (extended 80 / quad 128) #8941

Merged
merged 1 commit into from
Oct 17, 2022

Conversation

markalle
Copy link
Contributor

@markalle markalle commented May 10, 2021

[updated the top-level github description to match the commit message]:

On architectures that store long doubles as 80 bit extended precisions
or as 64 bit "float64"s, we need conversions to 128 bit quad precision to
satisfy MPI_Pack_external/Unpack_external. I added a couple more
arguments to pFunction to know what architecture the 'to' and 'from'
buffers are. Previously we had architecture info 'local' and 'remote'
but I don't know how to correlate local/remote with to/from without
adding more arguments as I did.

With the incresed information about the context, the conversion function
can now convert the long double as needed.

I'm using code Lisandro Dalcin contributed for the floating point
conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64.
These conversion functions require the data to be in local endianness,
but one of the sides in pack/unpack is always local so operations can
be done in an order that allows the long double conversion to see the
data in local endianness.

I also added a path to use __float128 for the conversion
for #ifdef HAVE___FLOAT128 as that ought to be the more reliable
method than rolling our own bitwise conversions.

The reason for all the arch.h changes is the former code was
inconsistent as to how bits were labeled within a byte, and had
masks like LONGISxx that didn't match the bits they were supposed
to contain.

@markalle
Copy link
Contributor Author

Fixes #8918
not well tested yet.

So far just tested on Mac.

Needs tested somewhere bigendian, and somewhere that __float128 is defined since I only just added that path and didn't test it at all yet. A testcase is available at
https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048

@dalcinl
Copy link
Contributor

dalcinl commented May 10, 2021

@markalle I'll insist with a point I made before. I believe my code is overall correct for BE arches, but what I'm not sure about is the helper union layout for long double on big endian either 32bit or 64bits. After making sure of the union layout, the rest of the conversion code should work correctly.

@gpaulsen gpaulsen requested a review from bosilca May 10, 2021 13:29
@markalle
Copy link
Contributor Author

I wasn't saying your code is wrong for big-endian machines, I'm saying it only converts data that's in the local endianness, it can't be used directly on big-endian data on a little-endian machine, but that's okay because each conversion we do involves either going to or from the local endianness so I've ordered the endianness-conversion and the f80/f128 conversions so it only has to operate on local endianness data.

I'm undecided how much to invest in performance. I think with all the steps involved and multiple copies already taking place, and using memcpy() instead of dereference it's going to be insanely slow. But it's definitely possible to be packing on a 3-byte boundary for example, so I think it has to use memcpy() or at least have that as an option anyway.

I still haven't studied the code carefully enough to be confident that the big-endian side is right. I ran two QEMU setups (ppc / mips) and both were storing long double as a 64-bit double, so I didn't actually get to see how a big-endian with 80-bit extended precision lays out its data. My guess though is that the current code has the padding in the wrong place, although I'm only guessing. Current code for big-endian 80-bit extended:

    unsigned sign  :  1;
    unsigned exp   : 15;
    unsigned pad   : 16;
    unsigned frac1 : 32;
    unsigned frac0 : 32;

I would have expected the pad at the bottom, so sign,exp,frac1,frac0 would all be adjacent. Is there a reason you initially coded it as above? Without a test machine I'm not 100% confident, but I'm inclined to move the padding and make the data adjacent

@markalle markalle force-pushed the pack_long_double_conversion branch from 50dce95 to 1db7464 Compare May 10, 2021 23:00
@markalle
Copy link
Contributor Author

I didn't go too far with performance, but those memcpy() were awfully expensive in at least some cases, so I put an alignment check and conditionally avoid the memcpy now, and repushed.

And I'm making a guess about what an extended double precision looks like natively on a big endian machine and moving the padding vs what @dalcinl had. I'm not positive though, just suspicious that it ought to be this way.

@dalcinl
Copy link
Contributor

dalcinl commented May 11, 2021

Is there a reason you initially coded it as above?

I followed the layout in ieee754.h from glibc. Look for union ieee854_long_double in that header [link]

However, I found it suspicious, too. Look at GCC's software-fp implementation [link]
They have the pad at the beginning! That cannot be right on i386, right?

The truth is, I don't really know whether there exists any big-endian architecture where long double is extended precision (float80) instead of just double or quad precision. If there is none, then all our care about float80 big-endian is pointless. I would say go ahead with any union layout you want, leave a comment in the code, and let users complain if they get the wrong thing.

@markalle markalle force-pushed the pack_long_double_conversion branch from 1db7464 to 25dd2e8 Compare May 11, 2021 13:48
@markalle
Copy link
Contributor Author

Thanks, that's a good argument for that format then. So I just repushed to put the padding back where that header had it.

dalcinl added a commit to mpi4py/mpi4py that referenced this pull request May 12, 2021
dalcinl added a commit to mpi4py/mpi4py that referenced this pull request May 13, 2021
@gpaulsen
Copy link
Member

@dalcinl are you happy with this PR? Should we merge it?

@jsquyres
Copy link
Member

@bosilca is out on vacation this week 🍹 ; he probably needs to review this before we merge.

@dalcinl
Copy link
Contributor

dalcinl commented May 14, 2021

@markalle I believe we missed a few important points in our original discussions.

  • I have a Raspberry Pi at home (armv7l). In there, long double is the same as double. So we need to handle this case and implement float64 <-> float128 conversion routines to handle that case.

  • I managed to use a qemu VM to install Debian8 on PowerPC 32bit. In there, long double uses 16 bytes, but it is a different format than Intel's float80. For a quick explanation see first section 1.1 here. This format is still used in POWER7/POWER8 and ppc64le Linux builds.

So, in short, a long double <-> quad precision implementation should be done the following way:

  1. If __float128 is available, use it and move on. After handling byte order, you are done.
  2. Otherwise, we look at the LDBL_MANT_DIG and LDBL_MAX_EXP macro from float.h.
    a) if LDBL_MANT_DIG == 53 and LDBL_MAX_EXP == 1024, then long double is float64 and we need new conversion routines to handle f64 <-> f128. Incidentally, note that this is the case on Windows with Microsoft compilers, even if running on Intel architectures.
    b) if LDBL_MANT_DIG == 64 and LDBL_MAX_EXP == 16384, then long double should be Intel's float80 extended precision. You already have all the needed bits.
    c) if LDBL_MANT_DIG == 112 and LDBL_MAX_EXP == 16384, then long double is actually IEEE quad precision float128. So you just need to care about byte order. Note: POWER9 has hardware support for float128, so we may see long double becoming quad precision soon. It seems Fedora has plans for it.
    d) if LDBL_MANT_DIG == 106 and LDBL_MAX_EXP == 1024 this is IBM extended precision format (relevant for POWER7 and POWER8). I do not know the details of this format (other than it is built out of two double values), but conversion routines should not be hard. We could add support in a subsequent PR.
    e) Otherwise, ERROR! We do not know the fp format, and errors should never pass silently.

Sorry for not figuring this out before. Fortunately, your none of work so far goes to waste. We just need more.
I could contribute the f64 <-> f128 routines. But you will have to want a three to four days for them.

@gpaulsen @jsquyres At this point, I think this is not ready for merge, IMHO we can do better.

@ggouaillardet
Copy link
Contributor

FWIW and IIRC, the existing conversion subroutine was used between x86_64 and sparcv9

@dalcinl
Copy link
Contributor

dalcinl commented May 14, 2021

@markalle I've updated my gist with float64 <-> float128 conversion routines. Please note I also changed the union names used in the implementation.

@markalle markalle force-pushed the pack_long_double_conversion branch from 25dd2e8 to 2817bf5 Compare May 25, 2021 20:10
@markalle
Copy link
Contributor Author

Repushed. I used the new code conversions where @dalcinl added f64 as well as f80, but renamed the functions to f80_to_f128() etc.

The usage from the OMPI level is in the macro that defines a pFunction for LONG_DOUBLE conversion. If it's not converting LONG_DOUBLE or the long double format in both the to/from architectures is the same it does memcpy, then if it has to convert, it does

  • endianness to local
  • long double in "from_arch" format to f128, if the from_arch doesn't already have its long double == f128
  • f128 to long double in "to_arch" format, if the to_arch isn't asking for long double == f128
  • endianness to to_arch

@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/053f2148801eacbef27f8da18a5bd33d

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0bf5c2ff6dbfecebf2f574543c22194a

@markalle markalle force-pushed the pack_long_double_conversion branch from 2817bf5 to b737f3a Compare May 25, 2021 22:34
@markalle
Copy link
Contributor Author

Update: okay, testing shows not quite ready still. I think the detection is still a slight weak spot

@markalle markalle force-pushed the pack_long_double_conversion branch from b737f3a to ff138cd Compare May 26, 2021 00:21
@markalle
Copy link
Contributor Author

Repushed based on ppc testing and my comment above. That path worked fine in the default mode (where __float128 is available) but I also tested artificially turning off HAVE__FLOAT128 just to make it test another path, and in that mode it didn't have a conversion for MANTISSA=106 and EXP=10. There's an argument for erroring out in that case if we don't have a conversion for the detected long double format, but at least for now I decided to allow it since the detection is new and I'd hate to produce a false failure where we error out on a conversion when we didn't have to.

And besides, in the regular mode where I don't artificially push it through an alternate path the code was fine using __float128.

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e86c3001047ea2c0d1440f392563b96f

@gpaulsen
Copy link
Member

opal_copy_functions_heterogeneous.c:522:35: error: __float128 is not supported on this target
                  f128_buf_to += sizeof(__float128);
opal_copy_functions_heterogeneous.c:580:37: error: __float128 is not supported on this target
                  f128_buf_from += sizeof(__float128);

@dalcinl
Copy link
Contributor

dalcinl commented Jun 2, 2022

@jsquyres Could you please add the Target: v5.0.0 label as well?

@gpaulsen
Copy link
Member

@bosilca Do you have time to review?

@gpaulsen
Copy link
Member

retest

@gpaulsen
Copy link
Member

bot:retest

@dalcinl
Copy link
Contributor

dalcinl commented Aug 21, 2022

How can we get this one out of the backlog?

@markalle markalle force-pushed the pack_long_double_conversion branch from 3aea80f to 1272ff5 Compare September 26, 2022 22:10
@markalle
Copy link
Contributor Author

I just rebased and repushed, no changes. I'd love to be able to get this enhancement in. It's been tested through a decent variety of architectures including the multiple paths it has for the conversions

@bwbarrett
Copy link
Member

bot:aws:retest

@dalcinl
Copy link
Contributor

dalcinl commented Oct 6, 2022

@bwbarrett Could you please trigger retest of the failing row?

@jsquyres
Copy link
Member

jsquyres commented Oct 7, 2022

bot:ompi:retest

@awlauria awlauria requested a review from gpaulsen October 7, 2022 14:24
@awlauria
Copy link
Contributor

awlauria commented Oct 7, 2022

@gpaulsen please review and we'll get this in.

@gpaulsen
Copy link
Member

@bosilca I don't have enough experience in this important area of code to review this before v5.0.0... Is this something you're interested in? Could you review sometime soon, otherwise this might need to delay until post v5.0.0.

@gpaulsen gpaulsen removed their request for review October 11, 2022 19:52
opal/util/arch.h Outdated
@@ -178,7 +178,7 @@
** bits 1 & 2: length of long double: 00=64, 01=96,10 = 128
** bits 3 & 4: no. of rel. bits in the exponent: 00 = 10, 01 = 14)
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markalle Your next line change is good, but to keep columns aligned (notice the two spaces after 53,), I think you fix this one too:

Suggested change
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making the change, thx

@dalcinl
Copy link
Contributor

dalcinl commented Oct 11, 2022

@gpaulsen @jsquyres @awlauria Why did you guys remove the v5.0.x label? This PR has been waiting for about a year and a half! @markalle did a terrific job. The code paths have been tested on a variety or architectures to make sure things are working properly. And the more important thing is that this PR fixes things that are otherwise broken right now. Isn't all the testing we have done on a variety of arches enough for you guys to have some confidence the implementation is sound?

@awlauria
Copy link
Contributor

@dalcinl PR's are traditionally labeled with their target branch.

We're still open to bringing this back to v5.0.x but for that to happen it needs a review, and to date we have not gotten one yet.

Summaryizing the logic of the pFunc copy functions
with regard to long doubles:

For terminology I''ll use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (double single quote)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some versions of gcc complain about unmatched quotes inside "#if 0" blocks, but I guess the solution is I just shouldn't be using "#if 0" to make my comment blocks. I'll switch

opal/util/arch.h Outdated Show resolved Hide resolved
@markalle markalle force-pushed the pack_long_double_conversion branch 2 times, most recently from 43add0f to 8488fb9 Compare October 14, 2022 06:38
opal/util/arch.h Outdated Show resolved Hide resolved
@markalle markalle force-pushed the pack_long_double_conversion branch from 8488fb9 to a54b204 Compare October 14, 2022 19:05
On architectures that store long doubles as 80 bit extended precisions
or as 64 bit "float64"s, we need conversions to 128 bit quad precision to
satisfy MPI_Pack_external/Unpack_external.  I added a couple more
arguments to pFunction to know what architecture the 'to' and 'from'
buffers are.  Previously we had architecture info 'local' and 'remote'
but I don't know how to correlate local/remote with to/from without
adding more arguments as I did.

With the incresed information about the context, the conversion function
can now convert the long double as needed.

I'm using code Lisandro Dalcin contributed for the floating point
conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64.
These conversion functions require the data to be in local endianness,
but one of the sides in pack/unpack is always local so operations can
be done in an order that allows the long double conversion to see the
data in local endianness.

I also added a path to use __float128 for the conversion
for #ifdef HAVE___FLOAT128 as that ought to be the more reliable
method than rolling our own bitwise conversions.

The reason for all the arch.h changes is the former code was
inconsistent as to how bits were labeled within a byte, and had
masks like LONGISxx that didn't match the bits they were supposed
to contain.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle markalle force-pushed the pack_long_double_conversion branch from a54b204 to 308a94e Compare October 14, 2022 20:31
@awlauria awlauria merged commit 0349a5d into open-mpi:main Oct 17, 2022
@awlauria
Copy link
Contributor

Thanks @bosilca and @markalle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants