Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fortran: first try if type(*), dimension(..) works as-is #11591

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

Refs. #11582

@ggouaillardet
Copy link
Contributor Author

:bot:ibm:retest

@ggouaillardet ggouaillardet marked this pull request as draft April 17, 2023 04:47
@ggouaillardet ggouaillardet force-pushed the topic/fortran_ignore_tkr branch 5 times, most recently from 34f6114 to e9fe244 Compare April 17, 2023 08:59
@ggouaillardet ggouaillardet marked this pull request as ready for review April 18, 2023 00:18
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. you've removed the option to fall-back to TYPE(), DIMENSION() if TYPE(*), DIMENSION(..) doesn't work. I think we can check DIMENSION(..) first, but we should fall back to what we have previously, as I don't think it was ever wrong.

  2. You've added ASYNCHRONOUS but only to the new option. It seems inconsistent, why not update it in all the options? In my mind CONTIGUOUS is probably more important, but perhaps that's a different PR.

@ggouaillardet
Copy link
Contributor Author

  1. fair point, I updated the PR accordingly
  2. the thing is some gfortran versions pass the configury test with TYPE(*), DIMENSION(..), but fail to build Open MPI because even if ASYNCHRONOUS alone works fine, TYPE (*), DIMENSION(..), ASYNCHRONOUS is not supported (!), and to make things worse, we test ignore_tkr before ASYNCHRONOUS since the former is required by use mpi but the latter is only necessary for use mpi_f08. I will welcome any ideas on how to better test this feature.

@lrbison
Copy link
Contributor

lrbison commented Apr 19, 2023

I re-scanned the mpi 4.0 standard here. Disregard my comment about contiguous, and I see your point about asynchronous.

That is surprising to me that DIMENSION(..) is allowed but not with ASYNCHRONOUS. I wonder why.

After thinking about it some more, I do have a concern that we are changing many interfaces to include ASYNCHRONOUS, when the standard only suggests the non-blocking routines include the keyword. My gut tells me this won't cause any compilation problems, but could potentially cause a performance regression as it ties the compilers hands during optimization somewhat. It would be good to ensure we don't introduce a regression with this change.

Better yet, perhaps we should introduce a separate ignore_trk_f08_async, and use this macro rather than independently adding ignore_trk and OMPI_ASYNCHRONOUS.

@ggouaillardet
Copy link
Contributor Author

After thinking about it some more, I do have a concern that we are changing many interfaces to include ASYNCHRONOUS

ASYNCHRONOUS is only used when testing TYPE(*), DIMENSION(..), but is not added to the definitions of the MPI subroutines that do not have it.

@lrbison
Copy link
Contributor

lrbison commented Apr 20, 2023

ASYNCHRONOUS is only used when testing

Ah, I missed that. Thank you.

lrbison
lrbison previously approved these changes Apr 20, 2023
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. I'm not familiar enough with the IBM CI to comment on those errors.

Can someone explain why they are failing?

@ggouaillardet
Copy link
Contributor Author

@jjhursey could you please (have someone) have a look at the CI issue on IBM platforms?

  • it seems the use_mpi test timeout with GNU compiler
  • there is a linking error with IBM compilers, that could indicate something is missing when creating the library. I do not know what exactly nor how to detect and prevent that.

@jjhursey
Copy link
Member

jjhursey commented May 5, 2023

@awlauria @gpaulsen I don't have cycles to look at the CI failure.

Let's retest to get the logs back. bot:ibm:retest

@jsquyres
Copy link
Member

bot:ibm:retest

@lrbison lrbison self-requested a review May 18, 2023 14:36
@lrbison lrbison dismissed their stale review May 18, 2023 14:36

CI failures

@gpaulsen
Copy link
Member

bot:ibm:retest

Replace old pragma declarations with the newr
@OMPI_FORTRAN_IGNORE_TKR_PREDECL@ syntax

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
First try if "type(*), dimension(..) :: foo" works as-is
(e.g. without any pragma/directive).

Thanks Chris Parrott for bringing this to our attention

Refs. open-mpi#11582

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@jsquyres
Copy link
Member

jsquyres commented Jul 9, 2024

@ggouaillardet This PR came up again in the last few days. I took the liberty of rebasing it to the tip of main (since it had some conflicts).

Here's what my configure found, on a Mac M2 with gfortran 14 and Fedora 38 with gfortran 14:

checking for Fortran compiler support of TYPE(*), DIMENSION(..)... yes
checking Fortran compiler ignore TKR syntax... 1:type(*), DIMENSION(..):!

And then Open MPI built and installed fine, and built all the example programs correctly.

However, I notice that if I try to run example_usempi or example_usempif08, it hangs here:

$ mpirun --oversubscribe -np 4 ring_usempif08
Process 0 sending 10 to  1 tag 201 ( 4 processes in ring)
Process 0 sent to  1
Process 0 decremented value:  9
Process 0 decremented value:  8
Process 0 decremented value:  7
Process 0 decremented value:  6
Process 0 decremented value:  5
Process 0 decremented value:  4
Process 0 decremented value:  3
Process 0 decremented value:  2
Process 0 decremented value:  1
Process 0 decremented value:  0
Process  0 exiting

ring_c and ring_mpifh work fine.

I ran out of time before being able to investigate further.

It's a bit weird that it does a bunch of sends and receives seemingly correctly (i.e., the ring value decrements properly) and then hangs at the end.

Any idea what's happening here?

FYI @edgargabriel

@ggouaillardet ggouaillardet marked this pull request as draft July 10, 2024 06:00
@ggouaillardet
Copy link
Contributor Author

I think I see what is going on.

in ompi/mpi/fortran/use-mpi-f08/bindings/mpi-f-interfaces.h we (oversimplify) declare

subroutine ompi_send_f(buf) BIND(C, name="ompi_send_f")
   OMPI_FORTRAN_IGNORE_TKR_TYPE, INTENT(IN) :: buf 
end subroutine ompi_send_f

so when the type for buf is type(*), DIMENSION(..) I suspect the buf passed to ompi_send_f is a CFI_desc_t but the subroutine still expects a buffer address.

In the case of ring_usempif08.f90, it means the messages are sent/received from/to the wrong addresses, and rank 1 believes it received 0 instead of 10 and leaves immediately,

I moved the PR to draft from now, and I will try to think of a simple fix.

@jsquyres
Copy link
Member

Mm; right. I was thinking that the first thing in the descriptor was the pointer to the value, and therefore it would be transparent to OMPI's code (i.e., we would just de-reference, and would get the value, just like we did before descriptors). But I guess that's wrong: OMPI will need to double de-reference.

Meaning: accepting type(*), dimension(..) would be a gigantic change. Right?

@ggouaillardet
Copy link
Contributor Author

That will be quite a change, but might not be that big via automation.

I have a proof of concept in mind I will investigate from tomorrow.

Also, that will be a first step towards supporting Fortran subarrays.

@ggouaillardet
Copy link
Contributor Author

I will have to revisit this issue with the latest Nvidia compilers.

As long as the mpif-h bindings are supported, we need to use the pragmas for the use mpi module.
And if the pragmas do not work, then we cannot build use mpi, we currently do not even try to build use mpi_f08.

@ggouaillardet
Copy link
Contributor Author

Currently, we (generally, it depends on the compiler) use the pragmas for both use mpi and use mpi_f08.
Though the standard does not specify how to make use mpi work, the type of the buffers is TYPE(*), DIMENSION(..) for use mpi_f08.
So unless I missed something, we are doing it wrong.
I can appreciate that was done when Fortran 2008 bindings were partially supported by most Fortran compilers, but I guess we are past that. Since fixing it will break the ABI, I would suggest we only build use mpi_f08 bindings if TYPE(*), DIMENSION(..) is supported by the Fortran compiler, and hard code the right type everywhere under use-mpi-f08.

@jsquyres any thoughts?

@jsquyres
Copy link
Member

jsquyres commented Jul 12, 2024

It's true that -- at least before Fortran compilers widely supported type(*), dimension(..) -- we could claim correct support for mpi_f08, because there's language in the spec about what you can do if compilers didn't support all the language features. That being said, Fortran compilers that support type(*), dimension(..) are much more widespread nowadays. It's probably true that Open MPI has a (mostly) functional mpi_f08 module, but it isn't technically adherent to the standard any more.

We have known that this was coming for a long time, but there's never been enough priority and resources to fix it, unfortunately.

That being said, I'm pretty convinced that the way to fix it is to generate all 3 of the Fortran bindings (mpif.h, mpi, and mpi_f08). The Fortran bindings are pretty formulaic -- much more so than the C bindings, and this should be a pretty do-able (albeit large) task. Paired with the Python library out of the MPI Forum that is used to maintain the meta data of the bindings themselves (or something like it -- even if it's just some kind of machine-readable index listing file of the bindings), we should be able to generate both the Fortran declarations and the corresponding implementations.

There have been multiple attempts at doing this over the past few years, but none were completed.

The most recent attempt was out of Los Alamos. If memory serves, binding generation was a byproduct of embiggening / "big count". But that may still be the best path forward.

True: we'll need to have some discussions about ABI. It might even be simplest to take a rip-the-band-aid-off type of solution -- just do a whole new mpi_f08 module and not worry about ABI. We could even support both the new and old mpi_f08 implementations at the same time (e.g., give a configure-type option as to which to build, and ultimately sunset / remove the old mpi_f08 implementation after a few releases).

@ggouaillardet
Copy link
Contributor Author

Thanks @jsquyres.

Could you please discuss this during the next call and provide some directions?
Is Open MPI 6 scheduled ? Will it support the new "standard" MPI ABI? if not, do we already know whether we will break the ABI with Open MPI 5?

@jsquyres
Copy link
Member

A few points for ya -- feel free to ask for more data (I forget that your timezone is inconvenient to attend the OMPI engineering webexes, etc.):

  • OMPI 6 is aimed at end-of-year(ish).
  • There is no "standard" ABI yet -- there have been a bunch of drafts, but nothing has passed the Forum yet. Hence, we paused work on ABI and focused on embiggening (and all of its ripple effects throughout the code base).
    • We had a bunch of discussions and planning about the Forum-sponsored ABI, however.
    • The plan was to support both the Forum ABI and the OMPI ABI (so that we don't screw our existing customers).
    • The short version is that the intent is -- for the C bindings, for example -- to have 2 different libraries: one for the Forum ABI and one for the OMPI ABI. The Forum ABI library would probably do a little swizzling of the Forum data structures to the back-end OMPI data structures and then invoke the back end ompi_BLAH() functionality.

@jsquyres
Copy link
Member

@ggouaillardet Given that I don't think we'll be able to have a quick-fix for supporting type(*), dimension(..), is there a different approach possible to #11582 that might work for v5.0.x (and/or v4.1.x)?

@ggouaillardet
Copy link
Contributor Author

it seems the root cause was a false positive, i will investigate how to harden the test.

@hppritcha
Copy link
Member

incorrect info here. Jake's big count fortran side will include the assumed rank changes - #12226. I think this PR can be closed.

@jsquyres
Copy link
Member

incorrect info here. Jake's big count fortran side will include the assumed rank changes - #12226.

That's good news!

Will it include all Fortran APIs, or just the ones with bigcount?

I think this PR can be closed.

If there's a solution for the issue described in #11582 for v5.0.x (and potentially v4.1.x), this is worth keeping open.

@ggouaillardet
Copy link
Contributor Author

Believe it or not, that did the trick!

I will make some more testing and issue a new PR this week

diff --git a/config/ompi_fortran_check_ignore_tkr.m4 b/config/ompi_fortran_check_ignore_tkr.m4
index 3686bca82e..ac23f42b91 100644
--- a/config/ompi_fortran_check_ignore_tkr.m4
+++ b/config/ompi_fortran_check_ignore_tkr.m4
@@ -200,12 +200,12 @@ AC_DEFUN([OMPI_FORTRAN_CHECK_IGNORE_TKR_SUB], [
   end subroutine force_assumed_shape
 
   module check_ignore_tkr
-  interface
-     subroutine foobar(buffer, count)
+  interface foobar
+     subroutine foobar_x(buffer, count)
        $1 buffer
        $2, intent(in) :: buffer
        integer, intent(in) :: count
-     end subroutine foobar
+     end subroutine foobar_x
   end interface
   end module
 

@bcornille
Copy link

To give some info from flang-new using ggouallardet:topic/nvfortran_ignore_tkr, it produces

OMPI_FORTRAN_F08_PREDECL='!'
OMPI_FORTRAN_F08_TYPE='type(*), dimension(..)'
OMPI_FORTRAN_IGNORE_TKR_PREDECL='!DIR$ IGNORE_TKR'
OMPI_FORTRAN_IGNORE_TKR_TYPE='type(*)'

This leads to the mpi_f08 interfaces using type(*) with !DIR$ IGNORE_TKR

interface
subroutine mpi_allreduce_f08(sendbuf,recvbuf,count,datatype,op,comm,ierror)
use mpi_f08_types,only:mpi_comm
use mpi_f08_types,only:mpi_datatype
use mpi_f08_types,only:mpi_op
type(*),intent(in)::sendbuf
!dir$ ignore_tkr(tkrdm) sendbuf
type(*)::recvbuf
!dir$ ignore_tkr(tkrdm) recvbuf
integer(4),intent(in)::count
type(mpi_datatype),intent(in)::datatype
type(mpi_op),intent(in)::op
type(mpi_comm),intent(in)::comm
integer(4),intent(out),optional::ierror
end
end interface

I don't know if the intention is that OMPI_FORTRAN_F08_TYPE be used when it does work, but that doesn't seem to be the behavior. So despite seeming to support assumed rank, that interface is not used for mpi_f08.

Here are the outputs for the checks

configure:50970: checking for Fortran compiler support of TYPE(*), DIMENSION(..)
configure:51054: flang-new -c   conftest.f90 >&5
./conftest.f90:7:17: warning: The global subprogram 'force_assumed_shape' is not compatible with its local procedure declaration (incompatible dummy argument #1: incompatible dummy data object types: REAL(4) vs COMPLEX(4))
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:56:14: Declaration of 'force_assumed_shape'
    subroutine force_assumed_shape(a, count)
               ^^^^^^^^^^^^^^^^^^^
./conftest.f90:7:17: Declaration of 'force_assumed_shape'
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:14:17: warning: The global subprogram 'foo' is not compatible with its local procedure declaration (incompatible procedure attributes: ImplicitInterface)
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:59:10: Declaration of 'foo'
      call foo(a, count)
           ^^^
./conftest.f90:14:17: Declaration of 'foo'
       subroutine foo(buffer, count)
                  ^^^
configure:49118: checking for Fortran compiler support of TYPE(*), DIMENSION(*)
configure:49202: flang-new -c   conftest.f90 >&5
error: Semantic errors in conftest.f90
./conftest.f90:7:17: warning: The global subprogram 'force_assumed_shape' is not compatible with its local procedure declaration (incompatible dummy argument #1: incompatible dummy data object types: REAL(4) vs COMPLEX(4))
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:56:14: Declaration of 'force_assumed_shape'
    subroutine force_assumed_shape(a, count)
               ^^^^^^^^^^^^^^^^^^^
./conftest.f90:7:17: Declaration of 'force_assumed_shape'
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:14:17: warning: The global subprogram 'foo' is not compatible with its local procedure declaration (incompatible procedure attributes: ImplicitInterface)
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:59:10: Declaration of 'foo'
      call foo(a, count)
           ^^^
./conftest.f90:14:17: Declaration of 'foo'
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:41:12: error: Rank of dummy argument is 0, but actual argument has rank 1
    call foo(buffer1, count)
             ^^^^^^^
configure:49236: checking for Fortran compiler support of !GCC$ ATTRIBUTES NO_ARG_CHECK
configure:49320: flang-new -c   conftest.f90 >&5
error: Semantic errors in conftest.f90
./conftest.f90:7:17: warning: The global subprogram 'force_assumed_shape' is not compatible with its local procedure declaration (incompatible dummy argument #1: incompatible dummy data object types: REAL(4) vs COMPLEX(4))
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:56:14: Declaration of 'force_assumed_shape'
    subroutine force_assumed_shape(a, count)
               ^^^^^^^^^^^^^^^^^^^
./conftest.f90:7:17: Declaration of 'force_assumed_shape'
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:14:17: warning: The global subprogram 'foo' is not compatible with its local procedure declaration (incompatible procedure attributes: ImplicitInterface)
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:59:10: Declaration of 'foo'
      call foo(a, count)
           ^^^
./conftest.f90:14:17: Declaration of 'foo'
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:77:5: error: No specific subroutine of generic 'foobar' matches the actual arguments
      call foobar(var(1,1,1), 1)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^
configure:49354: checking for Fortran compiler support of !DIR$ IGNORE_TKR
configure:49438: flang-new -c   conftest.f90 >&5
./conftest.f90:7:17: warning: The global subprogram 'force_assumed_shape' is not compatible with its local procedure declaration (incompatible dummy argument #1: incompatible dummy data object types: REAL(4) vs COMPLEX(4))
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:56:14: Declaration of 'force_assumed_shape'
    subroutine force_assumed_shape(a, count)
               ^^^^^^^^^^^^^^^^^^^
./conftest.f90:7:17: Declaration of 'force_assumed_shape'
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:14:17: warning: The global subprogram 'foo' is not compatible with its local procedure declaration (incompatible procedure attributes: ImplicitInterface)
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:59:10: Declaration of 'foo'
      call foo(a, count)
           ^^^
./conftest.f90:14:17: Declaration of 'foo'
       subroutine foo(buffer, count)
                  ^^^

@bcornille
Copy link

Backporting the LLVM !$DIR IGNORE_TKR check would be helpful for flang-new regardless.

@ggouaillardet
Copy link
Contributor Author

@bcornille This is the message I get with LLVM 18.1.8

configure:49354: checking for Fortran compiler support of !DIR$ IGNORE_TKR
configure:49438: flang-new -c   conftest.f90 >&5
./conftest.f90:7:17: warning: The global subprogram 'force_assumed_shape' is not compatible with its local procedure declaration (incompatible dummy argument #1: incompatible dummy data object types: REAL(4) vs COMPLEX(4))
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:56:14: Declaration of 'force_assumed_shape'
    subroutine force_assumed_shape(a, count)
               ^^^^^^^^^^^^^^^^^^^
./conftest.f90:7:17: Declaration of 'force_assumed_shape'
       subroutine force_assumed_shape(a, count)
                  ^^^^^^^^^^^^^^^^^^^
./conftest.f90:14:17: warning: The global subprogram 'foo' is not compatible with its local procedure declaration (incompatible procedure attributes: ImplicitInterface)
       subroutine foo(buffer, count)
                  ^^^
./conftest.f90:59:10: Declaration of 'foo'
      call foo(a, count)
           ^^^
./conftest.f90:14:17: Declaration of 'foo'
       subroutine foo(buffer, count)
                  ^^^
error: loc("/home/users/u0001043/build/ompi-llvm-18.1.8/conftest.f90":41:3): /tmp/llvm-project-18.1.8.src/flang/lib/Lower/CallInterface.cpp:949: not yet implemented: support for polymorphic types
LLVM ERROR: aborting

flang-new clearly states this requires a feature that is not yet implemented.
(same behavior with main branch and #12681, both are unable to build the mpi_f08 bindings.

But if I used LLVM main branch (the release/19.x branch should be shortly), both main and #12681 successfully build the mpi_f08 bindings with the !DIR$ IGNORE_TKR directive.

I am not sure how to interpret your previous message. I think this PR will be dropped (wrong approach that would require way too many changes and break the ABI) in favor of #12681. As far as flang-new is concerned, Open MPI will only support LLVM 19 and higher. Does that make sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants