Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Woptim#24] update spack, uberenv, radiuss spack configs and use spack user_cache_path #1284

Merged
merged 37 commits into from
Aug 5, 2022

Conversation

adrienbernede
Copy link
Member

@adrienbernede adrienbernede commented Jun 16, 2022

Summary

  • This PR is an update of CI parameters (Spack, Uberenv, RADIUSS-Spack-Configs)

Design review

At the basis, the changes in this branch are the same as in task/add-tioga-ci-pipeline ( #1272 ), without the tioga specifics.
Thanks @kab163 for doing a good part of the update jobs.

Then I added changes to make sure Spack is used in an isolated environment, especially regarding clingo install. Those changes are based on @trws suggestions found here: spack/spack#31030 (comment).

I tested those change using may CI account. With this PR, the same test should run as @davidbeckingsale and give the same results, because clingo setup is now independent of the user config.

CI results:

  • There is one job failing, which should be looked at and fixed by RAJA team.
  • After comparing with Umpire, there is one last change needed, things were going too well...

Misc.

@adrienbernede adrienbernede changed the title [WOPTIM#24] update spack, uberenv, radiuss spack configs and use spack user_cache_path [Woptim#24] update spack, uberenv, radiuss spack configs and use spack user_cache_path Jun 22, 2022
@rhornung67
Copy link
Member

@adrienbernede I don't see how your changes cause the CMake issue related to desul atomics? Do you have any ideas?

@rhornung67
Copy link
Member

@adrienbernede @davidbeckingsale After other PRs were merged, I merged develop into this. Now, it appears that all the target export stuff is broken. I don't know what's going on here.

@adrienbernede
Copy link
Member Author

David thought it could be because of a hidden BLT update.
The initial message suggests that the cuz target is defined in two different export sets. It would work if it were in only one.
I can try to pin point BLT version, it will have to wait Thursday.

@rhornung67
Copy link
Member

@adrienbernede that's OK. We'll track it down.

@trws
Copy link
Member

trws commented Jul 6, 2022

@adrienbernede, @rhornung67 and I poked at this some, and I just pushed what seems to be a minimal fix. The short version is that the RAJA export set didn't contain the exported BLT targets, and the targets were getting exported twice somehow. I'm not sure this will solve the whole problem, but hopefully one step closer.

@rhornung67
Copy link
Member

@trws asking @davidbeckingsale to chime in. He says your change will break other things.

@davidbeckingsale
Copy link
Member

@trws asking @davidbeckingsale to chime in. He says your change will break other things.

I'm second guessing myself now - let's see what the CI says. My concern was the fact that raja-config.cmake looks for camp. Both depend on the BLT targets, but if they are in the RAJA export set, they won't be there when find_dependency(camp) is called.

@trws
Copy link
Member

trws commented Jul 6, 2022

They do, camp for some reason generates a camp::openmp exported target despite not actually having a line of code requesting it (still not sure how/why) and it depends on that. RAJA depends on the blt target, and needs to have it exported in the same export set, the error about multiple exports of the same target still has me worried, but I'm hoping this will get us closer.

@trws
Copy link
Member

trws commented Jul 6, 2022

I should say, the thing I'm not sure about is the downstream impact, you reworked the export setup for a reason as I recall @davidbeckingsale, and I certainly don't want to break app builds or something where they need the BLT targets as well.

@davidbeckingsale
Copy link
Member

The export split fixed finding a camp that was built as part of RAJA - since it could depend on the BLT targets which were not imported until the RAJA targets were pulled in. But then adding target export to camp perhaps made the fix redundant because camp now puts the targets in its own export set too.

@rhornung67
Copy link
Member

When we think we have this stuff straightened out and we have tioga CI working, we should have a couple of apps try it out and then do a new coordinated release of RAJA, Umpire, camp, etc.

@rhornung67
Copy link
Member

@davidbeckingsale something's amiss with the new cmake test you added. That appears to be what is breaking gitlab CI.

@davidbeckingsale
Copy link
Member

Yup, that's what I was thinking would happen! But it's not the test that's wrong :p

@trws
Copy link
Member

trws commented Jul 6, 2022

Ok, the change I made means that file doesn't exist anymore, I'm not sure we can both fix the RAJA and camp targets and still generate that file based on the errors we were getting, do we need it?

@trws
Copy link
Member

trws commented Jul 28, 2022

Ok, I think this is ready to go finally. It still has workarounds in it for the AMD issues, and refers explicitly to camp@main in the raja package embedded for uberenv. Some of this can go away when spack/spack#31739 lands, others when a new release is cut for everything.

After this goes in, based on discussions we've had, I think we need to look at the following cleanup to avoid having this be quite so painful next time:

  1. promote some things to blt
    1. sycl support
    2. target export with the protections we developed here so they can be inherited across projects
  2. add support in either build_and_test.sh or through the uberenv stuff somewhere for pulling the git hash of the submodules used in raja and passing them as constraints on the spack spec. For example ^camp/deadbeef might be generated, so we don't have to manually hack this stuff to get CI to work.
  3. update our cmake requirement to 3.20 (anything over 3.16 could work, 3.20 has things we really want, and is just shy of the one that breaks a lot of projects)

@davidbeckingsale
Copy link
Member

@trws do we need the logic you added to camp to look for blt:: prefixed targets here as well?

@trws
Copy link
Member

trws commented Jul 28, 2022

Good question, the final version of the camp and raja updates I cooked up here don't actually do that, but if as part of the blt update we also unconditionally add the aliases, and protect the creation of them in blt from errors caused by them already existing, then we probably should. It would make it easier to make sure composed projects always get the appropriate targets, but right now we didn't need it so I didn't pop it in.

@davidbeckingsale
Copy link
Member

Ah gotcha, that makes sense.

@trws
Copy link
Member

trws commented Jul 29, 2022

And of course, now that lassen is back up and everything else works, corona's TCE update is breaking us. 😖

More importantly, the icpc 18 build is broken, though 19 is fine. Are we still intentionally supporting 18 at this point @rhornung67?

@rhornung67
Copy link
Member

No. Let's trash Intel 18. A few apps are still on Intel 19. I'm trying to persuade them to move to Intel 2022 if they want an intel compiler. We can remove Intel 19 when everyone is off that.

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through all this mess.

@trws
Copy link
Member

trws commented Jul 29, 2022

Thanks @rhornung67, the new corona TCE is currently breaking us because the rocm 5.1.0 doesn't exist anymore on there. Corresponding PR on radiuss-spack-configs is already up.

@adrienbernede
Copy link
Member Author

Awesome work thank you so much !

@rhornung67
Copy link
Member

Thanks @rhornung67, the new corona TCE is currently breaking us because the rocm 5.1.0 doesn't exist anymore on there. Corresponding PR on radiuss-spack-configs is already up.

Why does LC insist on removing earlier versions of compilers so quickly. We need them around so we can do comparisons when issues arise. I thought we mentioned this to them before.

@trws
Copy link
Member

trws commented Jul 29, 2022

Tioga still has them all, and corona still has some other older versions. I'm guessing something unfortunate happened on corona that broke it or corrupted it or something.

@rhornung67
Copy link
Member

Nothing like debugging CI when things happen that you can't control, huh?

@rhornung67
Copy link
Member

I think that when this PR is merged, we can turn off corona (if we don't need it) and move to tioga. Correct?

@trws
Copy link
Member

trws commented Jul 29, 2022

Yes, and we'll get more relevant test results that way too honestly. Though be aware that it will be flipping to flux in a few weeks, shortly after corona does.

@adrienbernede
Copy link
Member Author

adrienbernede commented Jul 29, 2022

hmm, that’s good to know. I’ll be attending the Flux AWS tutorial, hopefully I’ll still have enough time to move scripts from slurm to flux...

@adrienbernede
Copy link
Member Author

Also, we are planning to use radiuss-shared-ci on raja, so I’ll have to implement a tioga pipeline there.
Probably a good idea to test first on Umpire, what do you think @davidbeckingsale ?
Do RAJA and Umpire service accounts have access to Tioga already ?

@trws
Copy link
Member

trws commented Jul 29, 2022

hmm, that’s good to know. I’ll be attending the Flux AWS tutorial, hopefully I’ll still have enough time to move scripts from slurm to flux...

On this I'm hoping it will be minimal work to get going, Ryan Day is maintaining a set of wrappers for flux commands so srun and sbatch and similar should keep working for common options. Only more complex cases will likely need meaningful tweaking.

If you have access to rzalastor, there's already a 4-node flux partition on there that's ready for testing, not sure if the wrappers have made it on there or not though.

SPEC: " tests=none %intel@18.0.2"
DEFAULT_TIME: 40
extends: .build_and_test_on_ruby
# icpc_18_0_2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

DEFAULT_TIME: 40
extends: .build_and_test_on_ruby
# icpc_18_0_2:
# variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rhornung67
Copy link
Member

@adrienbernede RAJA has not switched to use a service account for Gitlab CI yet. It runs as @davidbeckingsale

@trws
Copy link
Member

trws commented Jul 30, 2022

It's green, it's finally ready to go.

@rhornung67
Copy link
Member

Merge this or wait for other PRs to be merged (camp, blt, radiuss-spack-configs, etc.) first?

@trws
Copy link
Member

trws commented Aug 1, 2022

The others are all merged except spack, but on spack I'm seeing umpire builds failing, @davidbeckingsale?

@trws trws merged commit b6f4646 into develop Aug 5, 2022
@adrienbernede adrienbernede deleted the bernede1/ci/update branch August 10, 2022 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants