Issue 0042 (Dockerfile and related environments) #49

SouzaEM · 2024-06-07T13:30:19Z

I'm adding a Dockerfile and a README explaining how to build and use the Docker images. To avoid multiple Dockerfiles with the same configurations (which would be harder to maintain), I'm proposing the use of multi-stage builds. The idea is to have three images: one for release, one for running the automated tests, and one for development.

Could you check if the implementation addresses the use cases of spyro project? I'm creating this pull request as a draft so we can discuss how to improve these images.

…truction to mount the working copy to the container

codecov · 2024-06-07T13:42:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.46%. Comparing base (fbd2ba8) to head (f97931b).

❗ Current head f97931b differs from pull request most recent head 5b8d876

Please upload reports for the commit 5b8d876 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #49      +/-   ##
==========================================
- Coverage   82.02%   80.46%   -1.57%     
==========================================
  Files          47       47              
  Lines        3361     3173     -188     
==========================================
- Hits         2757     2553     -204     
- Misses        604      620      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Olender · 2024-06-07T14:24:59Z

docker/Dockerfile

+
+# spyro dependencies
+USER root
+# Is there a reason for using development versions?


I am not sure. I haven't tried with other versions. Line 20 comes from SeismicMesh dependencies and the first iteration of its installation tutorial. I guess that only the dev version of Cgal is needed, and that is only for compatibility purposes in older Ubuntu installations.

Ok. I will remove this comment

Olender · 2024-06-07T14:45:19Z

docker/README.md

+
+Then, start a container and share your local repository:
+````
+docker run -v $PWD/spyro:/home/firedrake/shared/spyro -it devtag:1.0


I think we should change this line to just:
docker run -v $PWD:/home/firedrake/shared/spyro -it devtag:1.0

That way we still have access to all the other folders and files in spyro-1 (such as the tutorials, tests, paper scripts, etc.)

Yes. I will update the README

Olender · 2024-06-07T15:00:25Z

docker/Dockerfile

+
+FROM spyro_base AS spyro_release
+
+RUN . /home/firedrake/firedrake/bin/activate; pip install git+https://github.com/Olender/spyro-1.git


I think it would make sense that the base package also has access to the spyro repository with a git clone and a pip install. That way a new user has access to the notebook_tutorials and can test if everything installed correctly. It is also important to have a shots, results, and meshes folders, since those are the default locations for files generated during simulations (they would already be there with a git clone).

During development, I believe the user will want to use his local copy of spyro, so adding the main spyro branch to the image may led to confusions regarding which version is being used.

For testing, the idea is to make the runner clone the commit under test, build the latest test image (to make sure that the runner is using a compatible image), and run the tests. In this case, I think only the spyro under test is important for the image.

These are the reasons why I didn't installed spyro in the base image. The last line of the Dockerfile makes spyro available for the user (in the current branch of the user). Do you think that the current version is enough for accessing notebooks, shots, results, etc?

Actually, the test image is always testing the main branch. I have to figure out a way to make it test the current version.

Maybe we could have a base version prior to spyro_base with all dependencies installed but no spyro. Then we git clone and checkout outside of the docker before running the tests?

I believe the checkout has to be made before building the image, because the Docker configuration file is now part of spyro. Although it will change less frequently than the other parts of the code, the test routine should run with an image that would be produced with the source code that is under test. Otherwise, we could run in inconsistencies, such as a developer forgetting to commit his changes to the Dockerfile and his new code does not work with previous Docker image.

I think the solution would be something similar to the development version. Maybe we end up with only two images (release and dev/test). That would be even better.

Olender · 2024-06-07T15:03:32Z

docker/README.md

+
+Then, the following command may be called for running the tests:
+````
+docker run -it testtag:1.0 /bin/bash -c "source /home/firedrake/firedrake/bin/activate; python3 -m pytest --maxfail=1 ."


The tests are currently failing inside the docker. I am going to try to figure out why

The first test to fail produces the following log (tested in firedrake tags 2024-05, 2024-04, and 2024-03):

shared/spyro/test/test_MMS.py:36: in run_solve Wave_obj.forward_solve() shared/spyro/spyro/solvers/acoustic_wave.py:46: in forward_solve self.wave_propagator() shared/spyro/spyro/io/basicio.py:86: in wrapper u, u_r = func(*args, **dict(kwargs, source_num=snum)) shared/spyro/spyro/solvers/acoustic_wave.py:113: in wave_propagator usol, usol_recv = time_integrator(self, source_id=source_num) shared/spyro/spyro/solvers/time_integration.py:10: in time_integrator return time_integrator_mms(Wave_object, source_id=source_id) shared/spyro/spyro/solvers/time_integration.py:24: in time_integrator_mms return central_difference_MMS(Wave_object, source_id=source_id) shared/spyro/spyro/solvers/time_integration_central_difference.py:280: in central_difference_MMS Wave_object.solver.solve(X, B) petsc4py/PETSc/Log.pyx:188: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func ??? petsc4py/PETSc/Log.pyx:189: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func ??? firedrake/src/firedrake/firedrake/linear_solver.py:159: in solve b = self._lifted(b) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <firedrake.linear_solver.LinearSolver object at 0x7f787f451d20> b = Coefficient(WithGeometry(FunctionSpace(<firedrake.mesh.MeshTopology object at 0x7f7889c7aef0>, FiniteElement('Q', quadrilateral, 4, variant='spectral'), name=None), Mesh(VectorElement(FiniteElement('Q', quadrilateral, 1), dim=2), 1)), 31) def _lifted(self, b): u, update, blift = self._rhs u.dat.zero() for bc in self.A.bcs: bc.apply(u) update() # blift contains -A u_bc > blift += b E TypeError: unsupported operand type(s) for +=: 'Cofunction' and 'Function' firedrake/src/firedrake/firedrake/linear_solver.py:129: TypeError

Since this error has been present, it has been present for the last three months of Firedrake. It's either a long-standing bug or a deprecated feature. Could you check 2023-12? My latest local Firedrake install was 2023-05, but yours should be newer. For now, we need a stable version to use with our tests.

I am currently installing the latest Firedrake locally to test Spyro and check if it reproduces the same error as the docker. I have also done a quick look at the open bugs in the Firedrake repo to see if any match ours. After that, I plan on figuring out exactly what causes the error and submitting a minimal code with it as a bug in the repo.

I'm running a test with 2023-08 (the oldest image in DockerHub with the exception of experimental) and this error has not happened yet (the test_MMS.py is still running). As this error happens before the code actually runs, I believe that version 2023-08 will not have this issue.

I am running the tests locally using the latest version. Not only is there an ¨unsupported operand type(s) for +=: 'Cofunction' and 'Function'¨ error in the MMS test, but there are numerical instabilities and more significant errors in other forward tests that go through. My initial guess is that something broke the mass lumped triangular elements in Firedrake and since we are probably the only ones that use them no one noticed it until now. They still build diagonal mass matrices and pass Firedrake's unit tests. I still have more tests to run to diagnose if this is the issue. If it is, I will try to submit an issue and/or a fix for it in their GitHub (since I already have Firedrake code for some new ML elements that I need to submit and am familiar with the relevant FIAT and TSFC files).

I think the problem is also related to our LinearSolver object. I am going to try, in the compatibility branch, to update the RHS everywhere to cofunctions and to change our solver object to LinearVariationalSolver. It is possible that this will make the code compatible with newer versions of Firedrake

Unfortunately, you do not have access to an older Firedrake version. I had a look at all the available Firedrakes after 2023-08, and they all are very slow and have greater errors overall.

I think we can install older versions (such as 'https://github.com/firedrakeproject/firedrake/tree/Firedrake_20221208.0'). It may be more complex to install them with Docker, but at least we will we have the same performance as before.

I noticed that I skipped testing for 2023-09 before. I retested using all available Firedrake Docker versions, and found that the 2023-09 version passes all tests and performs better than the 2023-08 version. I suggest we prioritize adjusting the Docker file to focus on this version for now.

If version 2023-09 is working, it would save us a lot of time. I was trying to install version Firedrake_20230316.0, but the process is slow because the install script does not work out of the box with Docker and some errors happen after the PETSc installation (which is slow). Can I use the Docker image of version 2023-09 as our base installation?

That sounds like a good plan. Using the Docker image of version 2023-09 should streamline the process and save us time. Installing older versions and managing dependencies can be really annoying. Let's go with 2023-09 for now. We can revisit and update to the latest Firedrake version when we are compatible.

SouzaEM · 2024-06-07T15:07:33Z

The test_gradient_2d.py test is failing. The first issue is that it is calculating a log of negative error values. After changing the code to calculate the log of the absolute error, the test still fails because the error is not converging as required by the test.

Olender · 2024-06-07T15:14:42Z

Every test fails for me

It gives the error :
TypeError: unsupported operand type(s) for +=: 'Cofunction' and 'Function'
for basically every test. My initial guess would be something broken with the firedrake version used, but if you are not seeing the same error with the same container, than it must be something else.

Olender · 2024-06-07T15:35:11Z

I am unsure how Docker works, but is it possible that my FROM firedrakeproject/firedrake:2024-05 is different from yours? If you ran it at the middle of the month, could you have a different version that got cached, and every time a new container is built from it, it calls the cache? I am going to try out 2024-04. Firedrake sometimes breaks.

SouzaEM · 2024-06-07T16:17:30Z

The fail in test_gradient_2d.py happened in the runner (without the Docker image). I guess the run was triggered by the pull request. But this pull request did not change any Python files.

I checked in my computer and all the tests fail in the Docker image. I believed the problems are related to the Firedrake version. The Docker tag 2024-05 should be the same for everybody. I can test different versions of Firedrake images. You don't have to worry.

Olender · 2024-06-20T13:01:09Z

I think I have almost figured out the changes in our code needed to get the triangular elements working in the newest Firedrake versions. Quadrilaterals are going to be a challenge.

SouzaEM · 2024-06-20T13:29:22Z

I think I have almost figured out the changes in our code needed to get the triangular elements working in the newest Firedrake versions. Quadrilaterals are going to be a challenge.

Let's use an older version until we fix the quadrilaterals elements?

Olender · 2024-06-21T15:00:39Z

We also currently have an institutional docker hub account (https://hub.docker.com/r/ndfti/spyro).

…dient_2D.py is now working (if the absolute errors are used

SouzaEM · 2024-06-24T13:38:27Z

I changed the firedrake base image to 2023-09 and the SeismicMesh reference to GitHub (instead of PyPI). Now the test_gradient_2d.py is running, but I'm having issues with test/test_cpw_calc.pyand test/test_cpw_calc_analytical_gen.py. I'm trying to figure out why these tests are failing.

Olender · 2024-06-24T13:55:59Z

Could it be running out of memory in test_cpw_calc_analytical.py? Could you run with the -s flag in pytest and send me the outputs? I am going to run them again here and try to reproduce the error.

SouzaEM · 2024-06-24T14:02:25Z

The test_cpw_calc.py is failing when mpiexec is used:

python3 -m pytest test/test_cpw_calc.py
1 passed, 251 warnings in 384.50s (0:06:24)

mpiexec -n 4 python3 -m pytest test/test_cpw_calc.py
FAILED test/test_cpw_calc.py::test_cpw_calc - ZeroDivisionError: float divisi...
1 failed, 19 warnings in 5.15s

Also, the test holds the terminal until I press Ctrl+C.

Could you run with the -s flag in pytest and send me the outputs? I am going to run them again here and try to reproduce the error.

I will work on that.

Olender · 2024-06-24T14:46:30Z

So there are some issues with our SeismicMeh API related to parallelism. Specifically, for 3D problems, we often need to use different numbers of cores for meshing and for FWI. Because of this, I usually call SeismicMesh directly in the bash script on our cluster for 3D cases.

The cpw tests generate multiple different 2D meshes to determine the cells per wavelength parameter for a given error. SeismicMesh handles parallel meshing well, but it usually requires a different communicator to be passed to it (it should be comm.comm in Spyro). For 2D meshes, I typically leave it running only on rank 0. I currently run the cpw_calc in serial mode, but I can look into debugging it for parallel execution as a future issue.

It used to work in parallel, but because of our current coverage issue (which ends up requiring that every parallel test has to be run twice, once for testing and once for coverage), only tests in the test_parallel and test_3d folders are run in parallel to reduce testing runtime. I suggest opening an issue related to debugging and adding an appropriate smaller parallel test, and leaving it as it is for now (and running it in serial, like the .yaml file).

For parallel debugging I recommend using VS Code and following Firedrake's suggestions:https://github.com/firedrakeproject/firedrake/wiki/Parallel-MPI-Debugging-with-debugpy

SouzaEM · 2024-06-24T16:54:12Z

The issue was running the tests in test/test_cpw_calc.py in parallel. The test run successfully in serial. However, there is an issue in method estimate_timestep of spyro/utils/estimate_timestep.py that sets a large value for dt:

if np.sqrt(max_eigval) > 0.0:
    max_dt = float(2 / np.sqrt(max_eigval))
else:
    max_dt = 100000000

Regarding the other tests, the cases in test_3d and test_parallel run without problems. The test_ad/test_gradient_AD.py is failing because the 'parallelism_type' is not set in the input dictionary.

As these issues are not directly related to the Dockerfile implementation, I believe we can solve them in another issue to make the docker image available. What do you think?

Olender · 2024-06-24T18:02:28Z

The test_ad tests aren't currently run in the CI. They are left over code from even before the refactoring. We could open an issue to re-add the adjoint based gradient tests.

Major code refactoring

SouzaEM added 4 commits May 7, 2024 11:35

Add first version of Dockerfile with release image

ae9b3d7

Add testing environment to Docker images

5dadc19

Add development environment to Docker images

4707465

Add python3-tk to enable visualization with the container and fix ins…

f97931b

…truction to mount the working copy to the container

SouzaEM requested a review from Olender June 7, 2024 13:30

SouzaEM self-assigned this Jun 7, 2024

Olender reviewed Jun 7, 2024

View reviewed changes

This was referenced Jun 13, 2024

Sometimes gradient test does not pass #51

Closed

Sometimes gradient test does not pass NDF-Poli-USP/spyro#106

Open

Change firedrake and SeismicMesh versions in Dockerfile. The test_gra…

e8cb6c8

…dient_2D.py is now working (if the absolute errors are used

Merge remote-tracking branch 'origin/main' into issue_0042

6fe8f64

SouzaEM marked this pull request as ready for review June 25, 2024 11:20

Olender approved these changes Jun 25, 2024

View reviewed changes

Olender and others added 2 commits June 25, 2024 11:27

Merge pull request NDF-Poli-USP#104 from Olender/main

52372a5

Major code refactoring

Merge branch 'main' into issue_0042

5b8d876

SouzaEM merged commit 19f54f0 into main Jun 25, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 0042 (Dockerfile and related environments) #49

Issue 0042 (Dockerfile and related environments) #49

SouzaEM commented Jun 7, 2024

codecov bot commented Jun 7, 2024 •

edited

Loading

Olender Jun 7, 2024

SouzaEM Jun 7, 2024

Olender Jun 7, 2024

SouzaEM Jun 7, 2024

Olender Jun 7, 2024

SouzaEM Jun 7, 2024

SouzaEM Jun 7, 2024

Olender Jun 7, 2024

SouzaEM Jun 7, 2024

Olender Jun 7, 2024

SouzaEM Jun 10, 2024

Olender Jun 10, 2024

SouzaEM Jun 10, 2024

Olender Jun 10, 2024

Olender Jun 13, 2024

SouzaEM Jun 13, 2024

Olender Jun 20, 2024

SouzaEM Jun 21, 2024

Olender Jun 21, 2024

SouzaEM commented Jun 7, 2024

Olender commented Jun 7, 2024 •

edited

Loading

Olender commented Jun 7, 2024

SouzaEM commented Jun 7, 2024

Olender commented Jun 20, 2024

SouzaEM commented Jun 20, 2024

Olender commented Jun 21, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024


		FROM spyro_base AS spyro_release

		RUN . /home/firedrake/firedrake/bin/activate; pip install git+https://github.com/Olender/spyro-1.git

Issue 0042 (Dockerfile and related environments) #49

Issue 0042 (Dockerfile and related environments) #49

Conversation

SouzaEM commented Jun 7, 2024

codecov bot commented Jun 7, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SouzaEM commented Jun 7, 2024

Olender commented Jun 7, 2024 • edited Loading

Olender commented Jun 7, 2024

SouzaEM commented Jun 7, 2024

Olender commented Jun 20, 2024

SouzaEM commented Jun 20, 2024

Olender commented Jun 21, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024

SouzaEM commented Jun 24, 2024

Olender commented Jun 24, 2024

codecov bot commented Jun 7, 2024 •

edited

Loading

Olender commented Jun 7, 2024 •

edited

Loading