Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Action with Pangea-3 installation reproduction and ppc64le emulation #3159

Closed
wants to merge 52 commits into from

Conversation

Algiane
Copy link
Contributor

@Algiane Algiane commented Jun 4, 2024

New job that:

  • emulates a ppc64 architecture (using the docker/setup-qemu-action that relies on the use of qemu through the qemu-user-static image);

  • deploys a AlmaLinux-8 image with prebuilt TPLs and Geos' dependencies installed with respect to the required pangea3 modules:

    • CMake-3.26
    • gcc-9.4.0
    • ompi-4.1.2
    • cuda-11.5.0
    • openblas-0.3.18
    • lsf-10.1
  • adds a HOST_ARCH matrix variable to the job matrix to trigger the installation and the use of the emulation layer;

  • builds Geos and the unit tests on streak-2 self-hosted runner

  • associated to the TPLs PR 257.

Remark
Unit tests are not run because, due to the emulation layer, the GPUs cannot be used inside docker (the x86_64 drivers coming from the host are not usable inside the ppc64le image). Using them is theoritically possible but not so easy: if I am not wrong, at least one GPU has to be dedicated to the ppc64le image, the suitable drivers have to be installed and the GPU restarted with this driver.

Tasks

@Algiane Algiane added type: feature New feature or request ci: run CUDA builds Allows to triggers (costly) CUDA jobs flag: requires updated TPL(s) Needs a specific TPL PR type: CI Concerns github workflows or generic CI DO NOT MERGE ! flag: no rebaseline Does not require rebaseline labels Jun 4, 2024
@Algiane Algiane self-assigned this Jun 4, 2024
@Algiane Algiane marked this pull request as draft June 4, 2024 09:44
@Algiane Algiane marked this pull request as ready for review June 4, 2024 09:48
@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch 4 times, most recently from d4bb015 to 316161f Compare June 6, 2024 08:19
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.59%. Comparing base (fda7ed2) to head (cfab88c).
Report is 29 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3159   +/-   ##
========================================
  Coverage    56.59%   56.59%           
========================================
  Files         1064     1064           
  Lines        89752    89752           
========================================
+ Hits         50791    50793    +2     
+ Misses       38961    38959    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch from 316161f to 8dc2beb Compare June 6, 2024 12:30
@Algiane Algiane force-pushed the feature/algiane/pangea3-action branch 10 times, most recently from 32a9af3 to bdbf521 Compare June 21, 2024 13:09
@Algiane Algiane added the ci: run integrated tests Allows to run the integrated tests in GEOS CI label Aug 14, 2024
@Algiane
Copy link
Contributor Author

Algiane commented Aug 19, 2024

Few notes:

  • for now it is not possible to use another runner than streak2-32cores because the compilation inside the emulated image is too slow;
  • the PR ci: Action with Pangea-3 installation reproduction and ppc64le emulation #3159 has been (partially) merged inside the current PR to solve a deadlock when using ninja to build the project (it allows the use of unix makefiles instead of ninja). It also reproduce exactly the environment used on pangea (we don't use ninja);
  • we build static libraries because the shared library compilation has been broken for a while on powerpc (link edition of libgeos_core fail due to R_PPC_REL24 relocation error;

@Algiane Algiane requested review from rrsettgast and CusiniM August 19, 2024 16:06
@CusiniM
Copy link
Collaborator

CusiniM commented Aug 26, 2024

I am moving this out of the merge queue coz it has conflicts and still needs code owner approval. Hopefully, after merging develop, you will be able to build this with shared libs and the compilation time will go down significantly coz having a 2.5h job taking the 32 cores runnes is not really sustainable IMO.

@Algiane
Copy link
Contributor Author

Algiane commented Aug 27, 2024

Hi @CusiniM ,

@CusiniM: The shared library compilation is broken on powerpc architecture since... a while... (due to symbols overflows when linking the library (it was already broken before the geosx->geos PR of the beginning of the month)).

@rrsettgast : Once again, the last modifications of the develop branch has broken the compilation for the Pangea3 users... even if it is easy to fix, it is really annoying.....

I understand that the PR asks time but it is just not possible for us to contribute to Geos and to focus on our researches:

  • the develop branch compilation has been broken 3 times in one month on P3
  • when it is broken, the erroneous commits are not reverted

My personal opinion is that it worth to wait for few hours to check the ci success and allow other teams to work before merging errors in the develop. For people that are not members of LNLL, ci failure or not, we are waiting for weeks when it is not months before the integration of our work....

@Algiane
Copy link
Contributor Author

Algiane commented Aug 27, 2024

Erratum (I haven't looked at the new failure before writting):

  • the compilation of the geos shared lib itself now works on P3 (commit d8c31662): splitting the compilation into smaller parts should have solved to symbol overflow we were having;
  • the newly introduced error is different and more likely due to an issue when installing the built shared libraries or in the LD_LIBRARY_PATH setting.

@rrsettgast
Copy link
Member

Hello @Algiane,
What is the the latest build error on P3?

I don't think @CusiniM is saying that we shouldn't have this PR merged, or that we don't want this PR merged. We are all agreed that the goal to have CI coverage for P3 is a worthy goal. We are trying to work around the bottleneck this PR will create in our CI workflow. While enabling of shared libraries for GEOS in #3282 needed to be done eventually, it was done now specifically to help you get around the linking problems you were seeing so that we could merge this PR.

@Algiane
Copy link
Contributor Author

Algiane commented Aug 29, 2024

The decision of integreating or not this PR is not my concern: it is between code owners and TotalEnergy managers.

Let me just gather here all the relevant informations (we had a lot of private exchanges).

Aim of the PR

The current ci process doesn't test the Pangea3 environment (ppc64le architecture, redhat-8 operating system, CPU and GPU compilers versions, pangea3 host-config file) :

Today it is the main bottleneck of the Inria-TotalEnergy team.

Solutions that have been investigated

The main reproduction issue is the ppc64le architecture so I will focus on this specific issue.

Acces to a power-pc node

  • at TotalE: not possible: self-hosted runners are not authorized to be triggered on the infra due to cybersecurity concerns, so it cannot be used with GEOS repo;
  • at Inria / on national clusters: not possible: the ci request is too great for the resources at our disposal;
  • on the cloud: I did a very quick search and couldn't find native powerpc hours but it could worth to investigate this a little more if a recurrent bill for cloud hours is possible;

Cross compilation

Dead-end because we will not be able to run the binary we will produce as the emulation is not easily compatible with the GPU use inside a docker image (it will also be very hard, if even possible, to have the entire project built and a hell to maintain).

Emulation of ppc64le architecture

It is the only solution that has been proven to be feasible and maintainable.

Main features are described in the inital PR descritption (#3159 (comment)). AlmaLinux-8 is a 1:1 binary copy of RedHat and since it was ready, the PR highlighted all the regressions encountered on pangea3.

The docker image built by the TPL ci can directly be used to work in the P3 environment, either on a ppc arch, or on another arch with emulation.

Main drawback is the compilation slowdown: it has been evaluated and commented in February here: GEOS-DEV/thirdPartyLibs#257 (comment). You can expect a slown by a factor about 14;

Another possible issue can be the CI failure while the actual pangea3 build would succeed due to weird emulation errors (see PR #3276 : the error when linking the project with ninja seems to occurs only with the emulation layer).
To my experience, this has only happened once in the last few months but I don't have the hindsight to know whether it will be a frequent problem.
My proposition was to help to find and solve this kind of bugs and to remove the pangea3 job if emulation is found to be unstable

Slowdown issue and possible future CI bottleneck

For now we know that we have a development bottleneck creating a lost of time for developers. it can be solved but the solution will perhaps (probably) create a ci bottleneck: I consider the human time more valuable than the machine one but it is a personal opinion.

Comparative build time

Job configuration and duration (without cache):

Job config Ubuntu CUDA clang ci job Pangea3 ci job manual build on pangea 3 cluster
Runner streak-1 streak2-32core
ncores used 16 32 32
emulation no yes no
Job length (no cache) Ubuntu CUDA clang ci job) Pangea3 ci job manual build on pangea 3 cluster
geosx build 1h10 1h43 32m49
Unit tests build 13m 32m29 8m35
Total build (+install) time 1h26 2h20
geosx build with wave solver only 55m 14m9

Possible solutions for the dev bottlneck and develop unstabilities

  • much more rigorous PR reviews. Some errors that have been merged recently shouldn't have been validated by a review. Very probably, as these were very large PRs, not all the files were examined.
  • PRs often contains unrelated modifications. If it can be admitted on very small PRs, it shouldn't on PR that modifies 50 or more files.

Possible solutions for the ci bottleneck

  • take advantage from the code modularity and build only the geos binary and the WaveSolver solver;
  • do not run the integrated tests job (that run on streak2 too) when not needed;
  • setting job depencies in such a way that the Pange3 job will be triggered only when all other jobs succeed;
  • adding a specific flag to manually trigger this job when everything else is OK;
  • in both cases, the all_job_succeed job has to fail without the P3 job;
  • the purchase of a machine dedicated to this job and hosted and managed at inria is under discussion between managers.

Status of the PR

It is ready to be integrated since June (with no conflicts with the develop branch): I have worked to maintain, integrate the new developments (TPLs + Geos), to fix the introduced bugs and to resolve the introduced conflicts (I have passed more time to do that than develop the PR itself).

Concretely, the PR holds in ~10 lines pretty easy to understand. Everything is documented in the PR description.

The job I have been asked for is done so I'll leave you in charge to solve conflicts until you make up your mind and decide either to reject or to integrate this work.

@untereiner
Copy link
Contributor

untereiner commented Aug 30, 2024

Access to a power-pc node

GCP seems listening to the market: https://cloud.google.com/blog/products/compute/ibm-power-systems-now-available-on-google-cloud
I don't think you may have access to GPU nodes but it may worth asking for pricing

@Algiane
Copy link
Contributor Author

Algiane commented Sep 12, 2024

Post #3159 (comment) edited on September, the 12th, to add timer infos and new ideas in the "solutions" section.

@Algiane Algiane marked this pull request as draft September 12, 2024 08:08
@untereiner
Copy link
Contributor

untereiner commented Sep 12, 2024

@Algiane @rrsettgast I think there is good point here. It's not the first time that the computational expenses are highlighted.
Maybe go a bit further with sccache and reorganize the workflows as proposed here #2579
Here #3266 is a first step of reorganization

rrsettgast pushed a commit that referenced this pull request Sep 13, 2024
…ulation (#3340)

* Add pangea-3 job and emulation step.

* Replace relative path in cmake for P3-wave-solver host-config file.

* Build wave solver only in Pangea 3 job.

* Build Geos executable only in Pangea 3 job.

---------

Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
rrsettgast pushed a commit that referenced this pull request Sep 17, 2024
…ulation (#3340)

* Add pangea-3 job and emulation step.

* Replace relative path in cmake for P3-wave-solver host-config file.

* Build wave solver only in Pangea 3 job.

* Build Geos executable only in Pangea 3 job.

---------

Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
rrsettgast pushed a commit that referenced this pull request Sep 17, 2024
…ulation (#3340)

* Add pangea-3 job and emulation step.

* Replace relative path in cmake for P3-wave-solver host-config file.

* Build wave solver only in Pangea 3 job.

* Build Geos executable only in Pangea 3 job.

---------

Co-authored-by: Gaetan <159525405+Bubusch@users.noreply.github.com>
@rrsettgast
Copy link
Member

@Algiane @sframba
Now that we have merged #3340 , does this PR contain anything that you think needs to be merged?

@Algiane
Copy link
Contributor Author

Algiane commented Sep 23, 2024

No, sorry for the oversight

@Algiane Algiane closed this Sep 23, 2024
@Algiane Algiane deleted the feature/algiane/pangea3-action branch September 23, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI DO NOT MERGE ! flag: no rebaseline Does not require rebaseline flag: requires updated TPL(s) Needs a specific TPL PR type: CI Concerns github workflows or generic CI type: feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EPIC] Update TTE builds
9 participants