Skip to content
This repository has been archived by the owner on Jun 13, 2023. It is now read-only.

Version upgrades #38

Merged
merged 7 commits into from
Apr 24, 2023
Merged

Version upgrades #38

merged 7 commits into from
Apr 24, 2023

Conversation

hellkite500
Copy link
Contributor

@hellkite500 hellkite500 commented Apr 16, 2023

This image also builds/runs on arm 64 successfully, so this should allow the cross compiling of x86 and arm images.

These commits do a few things worth noting.

  1. Upgrade all rocky image layers to version 9.1
  2. Update the t-route build and install to use the current master branch
  3. Patches ngen and the used submodules to prevent an infinite loop bug in certain I/O code.
  4. Refactors the image build steps to build and run ALL tests for both serial and parallel builds.

Note that in step 4, the tests built and run will automtaically be adjusted based on the build flags/args, and if any test fails, the build will error/fail.

Closes #24 from a technical perspective, ngen parallel should work, but the realization in that issue may need to be debugged with correct library paths?
Closes #25

Can also close #20

@hellkite500
Copy link
Contributor Author

hellkite500 commented Apr 16, 2023

Why does the docker build fail on the PR runner 😕

HDF5_DIR=/usr pip3 install -v --install-option="\'--hdf5 \${HDF5_DIR}\'" --install-option="\'--jobs=$(nproc) \'"  --no-build-isolation tables

with a pip error no such option: --install-option

but doesn't fail locally and it looks like it worked fine in PR 37? 🤔

@hellkite500
Copy link
Contributor Author

After removing build caches and trying the build again, I was able to reproduce this locally, so something in pip/tables changed in the last few days and causes what used to be a working option to a non-working option. I'll work on resolving this.

@hellkite500
Copy link
Contributor Author

Failing now with exit code 143, which comes from the host, not the build, but the --install-options has been removed and the tables portion is completing. This error came up towards the end of the ngen build steps.

Exit Code 143 means that the container received a SIGTERM signal from the operating system, which asks the container to gracefully terminate, and the container succeeded in gracefully terminating (otherwise you will see Exit Code 137).

@jameshalgren
Copy link
Collaborator

jameshalgren commented Apr 17, 2023

Exit Code 143 in this case is probably a resource issue: actions/runner-images#6680
Tested on the default runner on the enterprise repository and saw the same failure:
https://github.com/CIROH-UA/CloudInfra_dev/actions/runs/4725665169

@jameshalgren
Copy link
Collaborator

jameshalgren commented Apr 18, 2023

Merging this PR is high priority. It looks like the GitHub runners are all failing -- My suspicion (I'm inexperienced) is that the 2 core 8Gb runner is getting overwhelmed. We will not have larger runners available quickly, though we should probably work on that.

If we can confirm that this builds on several external platforms, does anyone disagree with merging the PR over the failed tests?

I'd propose that we test on these or confirm already-complete tests on the following environments, then merge the PR:

@hellkite500 @arpita0911patel -- please briefly weigh in here.

test sequence:

git clone https://github.com/hellkite500:version-upgrades test_docker_build
cd test_docker_build/docker
docker buildx build -f Dockerfile -t local/dmod_ngen_test_DOESTHISBUILDWORK .

Git download command

git clone -b version-upgrades https://github.com/hellkite500/CloudInfra.git test_docker_build

@hellkite500
Copy link
Contributor Author

I'm ok with merging for now. I built on apple arm locally and I'm working on testing an x86 cross build. Some additional validation as mentioned would be good.

@jameshalgren
Copy link
Collaborator

Update: On MacOS, tried the build with and without specifying the target platform.
when building with the platform set to amd64, the error occurred in the blosc build:

docker buildx build --platform=linux/amd64 -f Dockerfile -t local/dmod_ngen_test_003_amd64 .

image

image

@jameshalgren
Copy link
Collaborator

When running on MacOS without specifying the platform, the error was in the ngen build somewhere.

docker buildx build -f Dockerfile -t local/dmod_ngen_test_003_arm64 .

image

image

@benlee0423
Copy link
Collaborator

/test_docker_build/docker$ docker buildx build -f Dockerfile -t local/dmod_ngen_test_DOESTHISBUILDWORK .
[+] Building 0.0s (0/0)
ERROR: invalid tag "local/dmod_ngen_test_DOESTHISBUILDWORK": repository name must be lowercase

Use the command below instead.

docker buildx build -f Dockerfile -t local/dmod_ngen_test_doesthisbuildwork .

@benlee0423
Copy link
Collaborator

benlee0423 commented Apr 19, 2023

I was able to install docker-ce on Ubuntu.

docker buildx build -f Dockerfile -t local/dmod_ngen_test_ubuntu_arm64 . &> build.log

Getting an error, obviously this is due to disk space. What is minimum disk space to run this build?

 > [rocky_build_ngen 1/8] COPY --chown= --from=rocky_init_repo /ngen/ngen /ngen/ngen:
------
Dockerfile:462
--------------------
 460 |     ARG BUILD_SLOTH
 461 |     
 462 | >>> COPY --chown=${USER} --from=rocky_init_repo ${WORKDIR}/ngen ${WORKDIR}/ngen
 463 |     COPY --chown=${USER} --from=rocky_build_troute ${WORKDIR}/t-route/wheels /tmp/t-route-wheels
 464 |     COPY --chown=${USER} --from=rocky_build_troute ${WORKDIR}/t-route/requirements.txt /tmp/t-route-requirements.txt
--------------------
ERROR: failed to solve: failed to copy files: copy file range failed: no space left on device

@benlee0423
Copy link
Collaborator

I built 4 times, and the build process got killed when it is running the below line.
I also attached entire build log and system log.
EC2 instance spec:
OS: Ubuntu 22.04.2 LTS
RAM: 8GB
SSD: 60GB

#41 183.2 In file included from /ngen/ngen/include/forcing/GenericDataProvider.hpp:5,
#41 183.2                  from /ngen/ngen/include/core/catchment/HY_CatchmentRealization.hpp:7,
#41 183.2                  from /ngen/ngen/include/core/catchment/HY_CatchmentArea.hpp:4,
#41 183.2                  from /ngen/ngen/include/realizations/catchment/Catchment_Formulation.hpp:8,
#41 183.2                  from /ngen/ngen/include/realizations/catchment/Bmi_Formulation.hpp:9,
#41 183.2                  from /ngen/ngen/test/realizations/catchments/Bmi_Testing_Util.hpp:55,
#41 183.2                  from /ngen/ngen/test/realizations/catchments/Bmi_Cpp_Multi_Array_Test.cpp:6:
#41 183.2 /ngen/ngen/include/forcing/DataProviderSelectors.hpp:138:5: warning: converting ‘CSVDataSelector’ to a reference to a base class ‘CatchmentAggrDataSelector’ will never use a type conversion operator [-Wclass-conversion]
#41 183.2   138 |     operator const CatchmentAggrDataSelector&() const { return *this; }
#41 183.2       |     ^~~~~~~~
#41 183.2 /ngen/ngen/include/forcing/DataProviderSelectors.hpp:153:5: warning: converting ‘BMIDataSelector’ to a reference to a base class ‘CatchmentAggrDataSelector’ will never use a type conversion operator [-Wclass-conversion]
#41 183.2   153 |     operator const CatchmentAggrDataSelector&() const { return *this; }
#41 183.2       |     ^~~~~~~~
#41 183.2 /ngen/ngen/include/forcing/DataProviderSelectors.hpp:178:5: warning: converting ‘NetCDFDataSelector’ to a reference to a base class ‘CatchmentAggrDataSelector’ will never use a type conversion operator [-Wclass-conversion]
#41 183.2   178 |     operator const CatchmentAggrDataSelector&() const { return *this; }
#41 183.2       |     ^~~~~~~~

build.log

i-09337d5c8429f1036.log

@hellkite500
Copy link
Contributor Author

This is getting strange. @benlee0423 I don't actually see an error in the build log, it just cuts off, but it looks like around the same place that @jameshalgren build was getting a segfault from the compiler. I have been able to reproduce similar behavior on an amd64 build on my macos, it is not always at the same place or even building the same program. Last run I did was segfaulting gfortran building the fortran unit test model after the ngen builds. But I saw it fail in the ngen build a couple times as well.

I'm thinking this may be some environment setting that is hindering the compiler -- I'll keep investigating and see what I can find.

@hellkite500
Copy link
Contributor Author

@jameshalgren on a fresh build (I had to factory reset rancher, again...) I was able to hit what looks like the same error you did, but I noticed a segmentaiton fault a little further up the error stack, can you look and see if you have that as well?

0 346.2       
#0 346.2         failed with:
#0 346.2       
#0 346.2          Segmentation fault
#0 346.2       
#0 346.2       
#0 346.2       CMake Error at /tmp/pip-build-env-f6lupk8c/overlay/lib64/python3.9/site-packages/cmake/data/share/cmake-3.26/Modules/Internal/CheckSourceCompiles.cmake:101 (try_compile):
#0 346.2         Failed to generate test project build system.
#0 346.2       Call Stack (most recent call first):
#0 346.2         /tmp/pip-build-env-f6lupk8c/overlay/lib64/python3.9/site-packages/cmake/data/share/cmake-3.26/Modules/CheckCSourceCompiles.cmake:76 (cmake_check_source_compiles)
#0 346.2         blosc2/c-blosc2/internal-complibs/zlib-ng-2.0.7/CMakeLists.txt:419 (check_c_source_compiles)
#0 346.2       
#0 346.2       
#0 346.2       -- Configuring incomplete, errors occurred!
#0 346.2       Traceback (most recent call last):
#0 346.2         File "/tmp/pip-build-env-f6lupk8c/overlay/lib/python3.9/site-packages/skbuild/setuptools_wrap.py", line 666, in setup
#0 346.2           env = cmkr.configure(
#0 346.2         File "/tmp/pip-build-env-f6lupk8c/overlay/lib/python3.9/site-packages/skbuild/cmaker.py", line 356, in configure
#0 346.2           raise SKBuildError(msg)
#0 346.2       
#0 346.2       An error occurred while configuring with CMake.

@jameshalgren
Copy link
Collaborator

jameshalgren commented Apr 20, 2023

@hellkite500
I see the segmentation fault with a similar but not identical build step. I only see this error when building for amd64, e.g., docker buildx build --platform=linux/amd64 -f Dockerfile -t local/dmod_ngen_test_003_amd64 .

#0 338.5       -- Performing Test HAVE_SSE2_INTRIN
#0 338.5       -- Performing Test HAVE_SSE2_INTRIN - Success
#0 338.5       -- Performing Test HAVE_SSSE3_INTRIN
#0 338.5       CMake Error:
#0 338.5         Running
#0 338.5
#0 338.5          '/tmp/pip-build-env-54n_6kvc/overlay/lib64/python3.9/site-packages/ninja/data/bin/ninja' '-C' '/tmp/ngen-deps/netcdf-cxx4/build/python-blosc2/_skbuild/linux-x86_64-3.9/cmake-build/CMakeFiles/CMakeScratch/TryCompile-TKS0Tt' '-t' 'recompact'
#0 338.5
#0 338.5         failed with:
#0 338.5
#0 338.5          Segmentation fault
#0 338.5
#0 338.5
#0 338.5       CMake Error at /tmp/pip-build-env-54n_6kvc/overlay/lib64/python3.9/site-packages/cmake/data/share/cmake-3.26/Modules/Internal/CheckSourceCompiles.cmake:101 (try_compile):
#0 338.5         Failed to generate test project build system.
#0 338.5       Call Stack (most recent call first):
#0 338.5         /tmp/pip-build-env-54n_6kvc/overlay/lib64/python3.9/site-packages/cmake/data/share/cmake-3.26/Modules/CheckCSourceCompiles.cmake:76 (cmake_check_source_compiles)
#0 338.5         blosc2/c-blosc2/internal-complibs/zlib-ng-2.0.7/CMakeLists.txt:506 (check_c_source_compiles)
#0 338.5         blosc2/c-blosc2/internal-complibs/zlib-ng-2.0.7/CMakeLists.txt:545 (check_c_source_compile_or_run)
#0 338.5
#0 338.5
#0 338.5       -- Configuring incomplete, errors occurred!
#0 338.5       Traceback (most recent call last):
#0 338.5         File "/tmp/pip-build-env-54n_6kvc/overlay/lib/python3.9/site-packages/skbuild/setuptools_wrap.py", line 666, in setup
#0 338.5           env = cmkr.configure(
#0 338.5         File "/tmp/pip-build-env-54n_6kvc/overlay/lib/python3.9/site-packages/skbuild/cmaker.py", line 358, in configure
#0 338.5           raise SKBuildError(msg)
#0 338.5

@hellkite500
Copy link
Contributor Author

We may not be able to cross compile from M1 macs, see docker/for-mac#6204 which references an upstream qemu bug/issue that I bet we are running into.

Other references to the same issue:
https://erlangforums.com/t/segfault-for-docker-image-on-non-native-platform/1871/4
https://erlangforums.com/t/otp-25-0-rc3-release-candidate-3-is-released/1317/25

I imagine we are running into similar issues. We will need to test on a native x86 build.

@hellkite500
Copy link
Contributor Author

@jameshalgren on your native mac build, are how much RAM do you have allocated to your rancher vm?

@jameshalgren
Copy link
Collaborator

jameshalgren commented Apr 20, 2023

on your native mac build, are how much RAM do you have allocated to your rancher vm?

16 Gb. Retrying with 48Gb.
@hellkite500
With additional RAM, this apparently finished on the native mac build. (Meaning, it finished, but I have tried a couple of simulations; they are not working but it is most likely a configuration issue and I'm still sorting that out.)

@hellkite500
Copy link
Contributor Author

hellkite500 commented Apr 20, 2023

as for the x86 issues, I was able to get an environment to test this build in, and I think I can generally reproduce the failure seen by @benlee0423. The problem there isn't an explicit failure, but it looks to me like a potential fork bomb is eating up all the resources and just hanging/crashing the machine. Might be related to this issue documented on ngen. I'll dig into that a little more and see.

Here is what I see when the build hangs and crashes the machine. But this is definitely the compiler spawning, not the testing process, so may not be related at all the mentioned issue...
image

@hellkite500
Copy link
Contributor Author

So I think I sorted this out....we asked make to do it, and make did it...from the make documentation (which is the build generator used)

If the ‘-j’ option is followed by an integer, this is the number of recipes to execute at once; this is called the number of job slots. If there is nothing looking like an integer after the ‘-j’ option, there is no limit on the number of job slots. 

and in the ngen build line, we use -j ${BUILD_PARALLEL_JOBS}. But this ARG is never actually set, so make is is spawning up to an unlimited number of parallel build processes, depending on the number of possible concurrent jobs a given dependency in the build can support. This essentially causes the system to eat up all its free memory, and all the swap, and essentially hang....

I'll push a fix momentarily that ensure BUILD_PARALLEL_JOBS <= nproc on the system, then I think this will succeed.

@hellkite500
Copy link
Contributor Author

My local arm build and an AWS x86 build both successful for me now!

@benlee0423
Copy link
Collaborator

Docker build is successful in AWS arm64.
EC2 instance spec:
OS: Ubuntu 22.04.2 LTS
RAM: 8GB
SSD: 60GB

docker buildx build -f Dockerfile -t local/dmod_ngen_test_ubuntu_arm64 . &> build.log &

docker images
REPOSITORY                          TAG       IMAGE ID       CREATED          SIZE
local/dmod_ngen_test_ubuntu_arm64   latest    c5dac6248e95   32 minutes ago   3.41GB

Nice job @hellkite500 for getting this work.

Copy link
Collaborator

@jameshalgren jameshalgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised.

@jameshalgren jameshalgren merged commit 449a602 into AlabamaWaterInstitute:main Apr 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Routing needs to be added to the latest docker image. ngen-parallel mode is not working.
3 participants