-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Action with Pangea-3 installation reproduction and ppc64le emulation #3159
Conversation
d4bb015
to
316161f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3159 +/- ##
========================================
Coverage 56.59% 56.59%
========================================
Files 1064 1064
Lines 89752 89752
========================================
+ Hits 50791 50793 +2
+ Misses 38961 38959 -2 ☔ View full report in Codecov by Sentry. |
316161f
to
8dc2beb
Compare
32a9af3
to
bdbf521
Compare
Few notes:
|
I am moving this out of the merge queue coz it has conflicts and still needs code owner approval. Hopefully, after merging develop, you will be able to build this with shared libs and the compilation time will go down significantly coz having a 2.5h job taking the 32 cores runnes is not really sustainable IMO. |
Hi @CusiniM , @CusiniM: The shared library compilation is broken on powerpc architecture since... a while... (due to symbols overflows when linking the library (it was already broken before the geosx->geos PR of the beginning of the month)). @rrsettgast : Once again, the last modifications of the I understand that the PR asks time but it is just not possible for us to contribute to Geos and to focus on our researches:
My personal opinion is that it worth to wait for few hours to check the ci success and allow other teams to work before merging errors in the develop. For people that are not members of LNLL, ci failure or not, we are waiting for weeks when it is not months before the integration of our work.... |
Erratum (I haven't looked at the new failure before writting):
|
Hello @Algiane, I don't think @CusiniM is saying that we shouldn't have this PR merged, or that we don't want this PR merged. We are all agreed that the goal to have CI coverage for P3 is a worthy goal. We are trying to work around the bottleneck this PR will create in our CI workflow. While enabling of shared libraries for GEOS in #3282 needed to be done eventually, it was done now specifically to help you get around the linking problems you were seeing so that we could merge this PR. |
The decision of integreating or not this PR is not my concern: it is between code owners and TotalEnergy managers. Let me just gather here all the relevant informations (we had a lot of private exchanges). Aim of the PRThe current ci process doesn't test the Pangea3 environment (
Today it is the main bottleneck of the Inria-TotalEnergy team. Solutions that have been investigatedThe main reproduction issue is the Acces to a power-pc node
Cross compilationDead-end because we will not be able to run the binary we will produce as the emulation is not easily compatible with the GPU use inside a docker image (it will also be very hard, if even possible, to have the entire project built and a hell to maintain). Emulation of
|
Job config | Ubuntu CUDA clang ci job |
Pangea3 ci job |
manual build on pangea 3 cluster |
---|---|---|---|
Runner | streak-1 | streak2-32core | |
ncores used | 16 | 32 | 32 |
emulation | no | yes | no |
Job length (no cache) | Ubuntu CUDA clang ci job) |
Pangea3 ci job |
manual build on pangea 3 cluster |
---|---|---|---|
geosx build |
1h10 | 1h43 | 32m49 |
Unit tests build | 13m | 32m29 | 8m35 |
Total build (+install) time | 1h26 | 2h20 | |
geosx build with wave solver only |
55m | 14m9 |
Possible solutions for the dev bottlneck and develop
unstabilities
- much more rigorous PR reviews. Some errors that have been merged recently shouldn't have been validated by a review. Very probably, as these were very large PRs, not all the files were examined.
- PRs often contains unrelated modifications. If it can be admitted on very small PRs, it shouldn't on PR that modifies 50 or more files.
Possible solutions for the ci bottleneck
- take advantage from the code modularity and build only the
geos
binary and theWaveSolver
solver; - do not run the
integrated tests
job (that run onstreak2
too) when not needed; - setting job depencies in such a way that the Pange3 job will be triggered only when all other jobs succeed;
- adding a specific flag to manually trigger this job when everything else is OK;
- in both cases, the
all_job_succeed
job has to fail without the P3 job; - the purchase of a machine dedicated to this job and hosted and managed at inria is under discussion between managers.
Status of the PR
It is ready to be integrated since June (with no conflicts with the develop
branch): I have worked to maintain, integrate the new developments (TPLs + Geos), to fix the introduced bugs and to resolve the introduced conflicts (I have passed more time to do that than develop the PR itself).
Concretely, the PR holds in ~10 lines pretty easy to understand. Everything is documented in the PR description.
The job I have been asked for is done so I'll leave you in charge to solve conflicts until you make up your mind and decide either to reject or to integrate this work.
GCP seems listening to the market: https://cloud.google.com/blog/products/compute/ibm-power-systems-now-available-on-google-cloud |
Post #3159 (comment) edited on September, the 12th, to add timer infos and new ideas in the "solutions" section. |
@Algiane @rrsettgast I think there is good point here. It's not the first time that the computational expenses are highlighted. |
…ulation (#3340) * Add pangea-3 job and emulation step. * Replace relative path in cmake for P3-wave-solver host-config file. * Build wave solver only in Pangea 3 job. * Build Geos executable only in Pangea 3 job. --------- Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
…ulation (#3340) * Add pangea-3 job and emulation step. * Replace relative path in cmake for P3-wave-solver host-config file. * Build wave solver only in Pangea 3 job. * Build Geos executable only in Pangea 3 job. --------- Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
…ulation (#3340) * Add pangea-3 job and emulation step. * Replace relative path in cmake for P3-wave-solver host-config file. * Build wave solver only in Pangea 3 job. * Build Geos executable only in Pangea 3 job. --------- Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
No, sorry for the oversight |
New job that:
emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of
qemu
through theqemu-user-static
image);deploys a AlmaLinux-8 image with prebuilt TPLs and Geos' dependencies installed with respect to the required pangea3 modules:
adds a
HOST_ARCH
matrix variable to the job matrix to trigger the installation and the use of the emulation layer;builds Geos and the unit tests on
streak-2
self-hosted runnerassociated to the TPLs PR 257.
Remark
Unit tests are not run because, due to the emulation layer, the GPUs cannot be used inside docker (the x86_64 drivers coming from the host are not usable inside the ppc64le image). Using them is theoritically possible but not so easy: if I am not wrong, at least one GPU has to be dedicated to the ppc64le image, the suitable drivers have to be installed and the GPU restarted with this driver.
Tasks
ci: run CUDA builds