Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU exection of atm_rk_integration_setup via OpenACC #1223

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

abishekg7
Copy link
Collaborator

@abishekg7 abishekg7 commented Aug 6, 2024

This PR enables the GPU execution of atm_rk_integration_setup subroutine

An initial OpenACC port of the array assignments in atm_rk_integration_setup. Needed to split up ACC loops with gang vector collapse(2) into separate ACC loop statements to achieve correct results.

@abishekg7 abishekg7 changed the base branch from master to develop August 6, 2024 16:52
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 249a183 to 7badda5 Compare August 6, 2024 17:03
@abishekg7 abishekg7 marked this pull request as ready for review August 6, 2024 17:28
@mgduda mgduda self-requested a review August 6, 2024 17:45
@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Aug 6, 2024
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 7badda5 to 40b7b75 Compare August 6, 2024 17:51
@gdicker1
Copy link
Collaborator

gdicker1 commented Sep 9, 2024

@abishekg7 I tried the changes in this PR, everything works as expected. Though I wonder if we could still use collapse statements?

I think mixing the levels of parallelism and the collapse statements causes issues. I was able to get things to reproduce results by just using acc loop collapse(2) (collapse(3) in one place). From the best practices guide I also think it might be fine to do something like acc loop gang collapse(2) OR acc loop vector collapse(2). My preference is not to state the level of parallelism so the compiler gets it right for me.

@mgduda
Copy link
Contributor

mgduda commented Sep 12, 2024

@gdicker1 I don't quite think I follow your comment. If my understanding is correct, we found that collapsing loops with different levels of parallelism leads to incorrect results, and the only place where we could collapse vector loops already has a collapse(2) clause. In which instances, specifically, would you suggest not explicitly stating the level of parallelism? Perhaps I'm distrustful of compilers, but it seems to me that being explicit about our intent as developers should lead to better outcomes than hoping the compiler will make good decisions.

@mgduda
Copy link
Contributor

mgduda commented Sep 12, 2024

I think it might be worth taking a fresh look at the commit message

    initial OpenACC port of atm_rk_integration_setup
    
    - splitting up the loop gang vector collapse(2) into two separate loops, as it leads
      to erroneous results otherwise

and PR description. The text about splitting up the loop might not make sense to anyone who doesn't know the history of the porting of the atm_rk_integration_setup routine, and so that remark could either be omitted or rewritten. Perhaps we could consider it a true unless stated otherwise, but even still it might also be good to state in the commit message that there are no changes to results.

@gdicker1
Copy link
Collaborator

@gdicker1 I don't quite think I follow your comment...

@mgduda, I agree that being proscriptive with the parallelism should be best practice.

With what I'd call a fully collapse-able loop (e.g. a 2-level loop, with code only in the innermost loop body, and not inside another loop) I am simply unsure what level of parallelism is assigned by the compiler when collapsed. So, that's where my preference to let "the compiler get it right for me" comes from - and it applies just to the fully collapse-able loops.

I think the following example would be fine:

      !$acc loop collapse(2)
      do iEdge = edgeStart,edgeEnd
         do k = 1,nVertLevels
            ru_save(k,iEdge) = ru(k,iEdge)
            u_2(k,iEdge) = u_1(k,iEdge)
         end do
      end do

( I also would have expected !$acc loop gang vector collapse(2) to be correct before this PR, since I think it matches with the build output I got for the code example above: "1663, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x" )

- Removing the condition for obtaining num_scalars in subroutine atm_srk3. This
  condition introduced issues when running the Jablonowski-Williamson dycore
  case
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 40b7b75 to 88859a6 Compare October 17, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants