-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack external32 long double conversion (extended 80 / quad 128) #8941
Conversation
Fixes #8918 So far just tested on Mac. Needs tested somewhere bigendian, and somewhere that __float128 is defined since I only just added that path and didn't test it at all yet. A testcase is available at |
@markalle I'll insist with a point I made before. I believe my code is overall correct for BE arches, but what I'm not sure about is the helper union layout for |
I wasn't saying your code is wrong for big-endian machines, I'm saying it only converts data that's in the local endianness, it can't be used directly on big-endian data on a little-endian machine, but that's okay because each conversion we do involves either going to or from the local endianness so I've ordered the endianness-conversion and the f80/f128 conversions so it only has to operate on local endianness data. I'm undecided how much to invest in performance. I think with all the steps involved and multiple copies already taking place, and using memcpy() instead of dereference it's going to be insanely slow. But it's definitely possible to be packing on a 3-byte boundary for example, so I think it has to use memcpy() or at least have that as an option anyway. I still haven't studied the code carefully enough to be confident that the big-endian side is right. I ran two QEMU setups (ppc / mips) and both were storing long double as a 64-bit double, so I didn't actually get to see how a big-endian with 80-bit extended precision lays out its data. My guess though is that the current code has the padding in the wrong place, although I'm only guessing. Current code for big-endian 80-bit extended:
I would have expected the pad at the bottom, so sign,exp,frac1,frac0 would all be adjacent. Is there a reason you initially coded it as above? Without a test machine I'm not 100% confident, but I'm inclined to move the padding and make the data adjacent |
50dce95
to
1db7464
Compare
I didn't go too far with performance, but those memcpy() were awfully expensive in at least some cases, so I put an alignment check and conditionally avoid the memcpy now, and repushed. And I'm making a guess about what an extended double precision looks like natively on a big endian machine and moving the padding vs what @dalcinl had. I'm not positive though, just suspicious that it ought to be this way. |
I followed the layout in However, I found it suspicious, too. Look at GCC's software-fp implementation [link] The truth is, I don't really know whether there exists any big-endian architecture where |
1db7464
to
25dd2e8
Compare
Thanks, that's a good argument for that format then. So I just repushed to put the padding back where that header had it. |
@dalcinl are you happy with this PR? Should we merge it? |
@bosilca is out on vacation this week 🍹 ; he probably needs to review this before we merge. |
@markalle I believe we missed a few important points in our original discussions.
So, in short, a
Sorry for not figuring this out before. Fortunately, your none of work so far goes to waste. We just need more. @gpaulsen @jsquyres At this point, I think this is not ready for merge, IMHO we can do better. |
FWIW and IIRC, the existing conversion subroutine was used between |
25dd2e8
to
2817bf5
Compare
Repushed. I used the new code conversions where @dalcinl added f64 as well as f80, but renamed the functions to The usage from the OMPI level is in the macro that defines a pFunction for LONG_DOUBLE conversion. If it's not converting LONG_DOUBLE or the long double format in both the to/from architectures is the same it does memcpy, then if it has to convert, it does
|
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/053f2148801eacbef27f8da18a5bd33d |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/0bf5c2ff6dbfecebf2f574543c22194a |
2817bf5
to
b737f3a
Compare
Update: okay, testing shows not quite ready still. I think the detection is still a slight weak spot |
b737f3a
to
ff138cd
Compare
Repushed based on ppc testing and my comment above. That path worked fine in the default mode (where __float128 is available) but I also tested artificially turning off HAVE__FLOAT128 just to make it test another path, and in that mode it didn't have a conversion for MANTISSA=106 and EXP=10. There's an argument for erroring out in that case if we don't have a conversion for the detected long double format, but at least for now I decided to allow it since the detection is new and I'd hate to produce a false failure where we error out on a conversion when we didn't have to. And besides, in the regular mode where I don't artificially push it through an alternate path the code was fine using __float128. |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/e86c3001047ea2c0d1440f392563b96f |
|
@jsquyres Could you please add the |
@bosilca Do you have time to review? |
retest |
bot:retest |
How can we get this one out of the backlog? |
3aea80f
to
1272ff5
Compare
I just rebased and repushed, no changes. I'd love to be able to get this enhancement in. It's been tested through a decent variety of architectures including the multiple paths it has for the conversions |
bot:aws:retest |
@bwbarrett Could you please trigger retest of the failing row? |
bot:ompi:retest |
@gpaulsen please review and we'll get this in. |
@bosilca I don't have enough experience in this important area of code to review this before v5.0.0... Is this something you're interested in? Could you review sometime soon, otherwise this might need to delay until post v5.0.0. |
opal/util/arch.h
Outdated
@@ -178,7 +178,7 @@ | |||
** bits 1 & 2: length of long double: 00=64, 01=96,10 = 128 | |||
** bits 3 & 4: no. of rel. bits in the exponent: 00 = 10, 01 = 14) | |||
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markalle Your next line change is good, but to keep columns aligned (notice the two spaces after 53,
), I think you fix this one too:
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105, | |
** bits 5 - 7: no. of bits of mantisse ( 000 = 53, 001 = 64, 010 = 105, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the change, thx
@gpaulsen @jsquyres @awlauria Why did you guys remove the v5.0.x label? This PR has been waiting for about a year and a half! @markalle did a terrific job. The code paths have been tested on a variety or architectures to make sure things are working properly. And the more important thing is that this PR fixes things that are otherwise broken right now. Isn't all the testing we have done on a variety of arches enough for you guys to have some confidence the implementation is sound? |
@dalcinl PR's are traditionally labeled with their target branch. We're still open to bringing this back to v5.0.x but for that to happen it needs a review, and to date we have not gotten one yet. |
Summaryizing the logic of the pFunc copy functions | ||
with regard to long doubles: | ||
|
||
For terminology I''ll use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo (double single quote)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some versions of gcc complain about unmatched quotes inside "#if 0" blocks, but I guess the solution is I just shouldn't be using "#if 0" to make my comment blocks. I'll switch
43add0f
to
8488fb9
Compare
8488fb9
to
a54b204
Compare
On architectures that store long doubles as 80 bit extended precisions or as 64 bit "float64"s, we need conversions to 128 bit quad precision to satisfy MPI_Pack_external/Unpack_external. I added a couple more arguments to pFunction to know what architecture the 'to' and 'from' buffers are. Previously we had architecture info 'local' and 'remote' but I don't know how to correlate local/remote with to/from without adding more arguments as I did. With the incresed information about the context, the conversion function can now convert the long double as needed. I'm using code Lisandro Dalcin contributed for the floating point conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64. These conversion functions require the data to be in local endianness, but one of the sides in pack/unpack is always local so operations can be done in an order that allows the long double conversion to see the data in local endianness. I also added a path to use __float128 for the conversion for #ifdef HAVE___FLOAT128 as that ought to be the more reliable method than rolling our own bitwise conversions. The reason for all the arch.h changes is the former code was inconsistent as to how bits were labeled within a byte, and had masks like LONGISxx that didn't match the bits they were supposed to contain. Signed-off-by: Mark Allen <markalle@us.ibm.com>
a54b204
to
308a94e
Compare
[updated the top-level github description to match the commit message]:
On architectures that store long doubles as 80 bit extended precisions
or as 64 bit "float64"s, we need conversions to 128 bit quad precision to
satisfy MPI_Pack_external/Unpack_external. I added a couple more
arguments to pFunction to know what architecture the 'to' and 'from'
buffers are. Previously we had architecture info 'local' and 'remote'
but I don't know how to correlate local/remote with to/from without
adding more arguments as I did.
With the incresed information about the context, the conversion function
can now convert the long double as needed.
I'm using code Lisandro Dalcin contributed for the floating point
conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64.
These conversion functions require the data to be in local endianness,
but one of the sides in pack/unpack is always local so operations can
be done in an order that allows the long double conversion to see the
data in local endianness.
I also added a path to use __float128 for the conversion
for #ifdef HAVE___FLOAT128 as that ought to be the more reliable
method than rolling our own bitwise conversions.
The reason for all the arch.h changes is the former code was
inconsistent as to how bits were labeled within a byte, and had
masks like LONGISxx that didn't match the bits they were supposed
to contain.