Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 0042 (Dockerfile and related environments) #49

Merged
merged 8 commits into from
Jun 25, 2024
Merged

Conversation

SouzaEM
Copy link
Collaborator

@SouzaEM SouzaEM commented Jun 7, 2024

I'm adding a Dockerfile and a README explaining how to build and use the Docker images. To avoid multiple Dockerfiles with the same configurations (which would be harder to maintain), I'm proposing the use of multi-stage builds. The idea is to have three images: one for release, one for running the automated tests, and one for development.

Could you check if the implementation addresses the use cases of spyro project? I'm creating this pull request as a draft so we can discuss how to improve these images.

@SouzaEM SouzaEM requested a review from Olender June 7, 2024 13:30
@SouzaEM SouzaEM self-assigned this Jun 7, 2024
Copy link

codecov bot commented Jun 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.46%. Comparing base (fbd2ba8) to head (f97931b).

Current head f97931b differs from pull request most recent head 5b8d876

Please upload reports for the commit 5b8d876 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #49      +/-   ##
==========================================
- Coverage   82.02%   80.46%   -1.57%     
==========================================
  Files          47       47              
  Lines        3361     3173     -188     
==========================================
- Hits         2757     2553     -204     
- Misses        604      620      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


# spyro dependencies
USER root
# Is there a reason for using development versions?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I haven't tried with other versions. Line 20 comes from SeismicMesh dependencies and the first iteration of its installation tutorial. I guess that only the dev version of Cgal is needed, and that is only for compatibility purposes in older Ubuntu installations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will remove this comment

docker/README.md Outdated

Then, start a container and share your local repository:
````
docker run -v $PWD/spyro:/home/firedrake/shared/spyro -it devtag:1.0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this line to just:
docker run -v $PWD:/home/firedrake/shared/spyro -it devtag:1.0

That way we still have access to all the other folders and files in spyro-1 (such as the tutorials, tests, paper scripts, etc.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will update the README


FROM spyro_base AS spyro_release

RUN . /home/firedrake/firedrake/bin/activate; pip install git+https://github.com/Olender/spyro-1.git
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense that the base package also has access to the spyro repository with a git clone and a pip install. That way a new user has access to the notebook_tutorials and can test if everything installed correctly. It is also important to have a shots, results, and meshes folders, since those are the default locations for files generated during simulations (they would already be there with a git clone).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During development, I believe the user will want to use his local copy of spyro, so adding the main spyro branch to the image may led to confusions regarding which version is being used.

For testing, the idea is to make the runner clone the commit under test, build the latest test image (to make sure that the runner is using a compatible image), and run the tests. In this case, I think only the spyro under test is important for the image.

These are the reasons why I didn't installed spyro in the base image. The last line of the Dockerfile makes spyro available for the user (in the current branch of the user). Do you think that the current version is enough for accessing notebooks, shots, results, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the test image is always testing the main branch. I have to figure out a way to make it test the current version.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have a base version prior to spyro_base with all dependencies installed but no spyro. Then we git clone and checkout outside of the docker before running the tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the checkout has to be made before building the image, because the Docker configuration file is now part of spyro. Although it will change less frequently than the other parts of the code, the test routine should run with an image that would be produced with the source code that is under test. Otherwise, we could run in inconsistencies, such as a developer forgetting to commit his changes to the Dockerfile and his new code does not work with previous Docker image.

I think the solution would be something similar to the development version. Maybe we end up with only two images (release and dev/test). That would be even better.

docker/README.md Outdated

Then, the following command may be called for running the tests:
````
docker run -it testtag:1.0 /bin/bash -c "source /home/firedrake/firedrake/bin/activate; python3 -m pytest --maxfail=1 ."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are currently failing inside the docker. I am going to try to figure out why

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first test to fail produces the following log (tested in firedrake tags 2024-05, 2024-04, and 2024-03):

shared/spyro/test/test_MMS.py:36: in run_solve
    Wave_obj.forward_solve()
shared/spyro/spyro/solvers/acoustic_wave.py:46: in forward_solve
    self.wave_propagator()
shared/spyro/spyro/io/basicio.py:86: in wrapper
    u, u_r = func(*args, **dict(kwargs, source_num=snum))
shared/spyro/spyro/solvers/acoustic_wave.py:113: in wave_propagator
    usol, usol_recv = time_integrator(self, source_id=source_num)
shared/spyro/spyro/solvers/time_integration.py:10: in time_integrator
    return time_integrator_mms(Wave_object, source_id=source_id)
shared/spyro/spyro/solvers/time_integration.py:24: in time_integrator_mms
    return central_difference_MMS(Wave_object, source_id=source_id)
shared/spyro/spyro/solvers/time_integration_central_difference.py:280: in central_difference_MMS
    Wave_object.solver.solve(X, B)
petsc4py/PETSc/Log.pyx:188: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
    ???
petsc4py/PETSc/Log.pyx:189: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
    ???
firedrake/src/firedrake/firedrake/linear_solver.py:159: in solve
    b = self._lifted(b)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <firedrake.linear_solver.LinearSolver object at 0x7f787f451d20>
b = Coefficient(WithGeometry(FunctionSpace(<firedrake.mesh.MeshTopology object at 0x7f7889c7aef0>, FiniteElement('Q', quadrilateral, 4, variant='spectral'), name=None), Mesh(VectorElement(FiniteElement('Q', quadrilateral, 1), dim=2), 1)), 31)

    def _lifted(self, b):
        u, update, blift = self._rhs
        u.dat.zero()
        for bc in self.A.bcs:
            bc.apply(u)
        update()
        # blift contains -A u_bc
>       blift += b
E       TypeError: unsupported operand type(s) for +=: 'Cofunction' and 'Function'

firedrake/src/firedrake/firedrake/linear_solver.py:129: TypeError

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this error has been present, it has been present for the last three months of Firedrake. It's either a long-standing bug or a deprecated feature. Could you check 2023-12? My latest local Firedrake install was 2023-05, but yours should be newer. For now, we need a stable version to use with our tests.

I am currently installing the latest Firedrake locally to test Spyro and check if it reproduces the same error as the docker. I have also done a quick look at the open bugs in the Firedrake repo to see if any match ours. After that, I plan on figuring out exactly what causes the error and submitting a minimal code with it as a bug in the repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running a test with 2023-08 (the oldest image in DockerHub with the exception of experimental) and this error has not happened yet (the test_MMS.py is still running). As this error happens before the code actually runs, I believe that version 2023-08 will not have this issue.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running the tests locally using the latest version. Not only is there an ¨unsupported operand type(s) for +=: 'Cofunction' and 'Function'¨ error in the MMS test, but there are numerical instabilities and more significant errors in other forward tests that go through. My initial guess is that something broke the mass lumped triangular elements in Firedrake and since we are probably the only ones that use them no one noticed it until now. They still build diagonal mass matrices and pass Firedrake's unit tests. I still have more tests to run to diagnose if this is the issue. If it is, I will try to submit an issue and/or a fix for it in their GitHub (since I already have Firedrake code for some new ML elements that I need to submit and am familiar with the relevant FIAT and TSFC files).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is also related to our LinearSolver object. I am going to try, in the compatibility branch, to update the RHS everywhere to cofunctions and to change our solver object to LinearVariationalSolver. It is possible that this will make the code compatible with newer versions of Firedrake

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, you do not have access to an older Firedrake version. I had a look at all the available Firedrakes after 2023-08, and they all are very slow and have greater errors overall.

I think we can install older versions (such as 'https://github.com/firedrakeproject/firedrake/tree/Firedrake_20221208.0'). It may be more complex to install them with Docker, but at least we will we have the same performance as before.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that I skipped testing for 2023-09 before. I retested using all available Firedrake Docker versions, and found that the 2023-09 version passes all tests and performs better than the 2023-08 version. I suggest we prioritize adjusting the Docker file to focus on this version for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If version 2023-09 is working, it would save us a lot of time. I was trying to install version Firedrake_20230316.0, but the process is slow because the install script does not work out of the box with Docker and some errors happen after the PETSc installation (which is slow). Can I use the Docker image of version 2023-09 as our base installation?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good plan. Using the Docker image of version 2023-09 should streamline the process and save us time. Installing older versions and managing dependencies can be really annoying. Let's go with 2023-09 for now. We can revisit and update to the latest Firedrake version when we are compatible.

@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 7, 2024

The test_gradient_2d.py test is failing. The first issue is that it is calculating a log of negative error values. After changing the code to calculate the log of the absolute error, the test still fails because the error is not converging as required by the test.

@Olender
Copy link
Owner

Olender commented Jun 7, 2024

Every test fails for me

It gives the error :
TypeError: unsupported operand type(s) for +=: 'Cofunction' and 'Function'
for basically every test. My initial guess would be something broken with the firedrake version used, but if you are not seeing the same error with the same container, than it must be something else.

@Olender
Copy link
Owner

Olender commented Jun 7, 2024

I am unsure how Docker works, but is it possible that my FROM firedrakeproject/firedrake:2024-05 is different from yours? If you ran it at the middle of the month, could you have a different version that got cached, and every time a new container is built from it, it calls the cache? I am going to try out 2024-04. Firedrake sometimes breaks.

@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 7, 2024

The fail in test_gradient_2d.py happened in the runner (without the Docker image). I guess the run was triggered by the pull request. But this pull request did not change any Python files.

I checked in my computer and all the tests fail in the Docker image. I believed the problems are related to the Firedrake version. The Docker tag 2024-05 should be the same for everybody. I can test different versions of Firedrake images. You don't have to worry.

@Olender
Copy link
Owner

Olender commented Jun 20, 2024

I think I have almost figured out the changes in our code needed to get the triangular elements working in the newest Firedrake versions. Quadrilaterals are going to be a challenge.

@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 20, 2024

I think I have almost figured out the changes in our code needed to get the triangular elements working in the newest Firedrake versions. Quadrilaterals are going to be a challenge.

Let's use an older version until we fix the quadrilaterals elements?

@Olender
Copy link
Owner

Olender commented Jun 21, 2024

We also currently have an institutional docker hub account (https://hub.docker.com/r/ndfti/spyro).

…dient_2D.py is now working (if the absolute errors are used
@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 24, 2024

I changed the firedrake base image to 2023-09 and the SeismicMesh reference to GitHub (instead of PyPI). Now the test_gradient_2d.py is running, but I'm having issues with test/test_cpw_calc.pyand test/test_cpw_calc_analytical_gen.py. I'm trying to figure out why these tests are failing.

@Olender
Copy link
Owner

Olender commented Jun 24, 2024

Could it be running out of memory in test_cpw_calc_analytical.py? Could you run with the -s flag in pytest and send me the outputs? I am going to run them again here and try to reproduce the error.

@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 24, 2024

The test_cpw_calc.py is failing when mpiexec is used:

python3 -m pytest test/test_cpw_calc.py
1 passed, 251 warnings in 384.50s (0:06:24)
mpiexec -n 4 python3 -m pytest test/test_cpw_calc.py
FAILED test/test_cpw_calc.py::test_cpw_calc - ZeroDivisionError: float divisi...
1 failed, 19 warnings in 5.15s

Also, the test holds the terminal until I press Ctrl+C.

Could you run with the -s flag in pytest and send me the outputs? I am going to run them again here and try to reproduce the error.

I will work on that.

@Olender
Copy link
Owner

Olender commented Jun 24, 2024

So there are some issues with our SeismicMeh API related to parallelism. Specifically, for 3D problems, we often need to use different numbers of cores for meshing and for FWI. Because of this, I usually call SeismicMesh directly in the bash script on our cluster for 3D cases.

The cpw tests generate multiple different 2D meshes to determine the cells per wavelength parameter for a given error. SeismicMesh handles parallel meshing well, but it usually requires a different communicator to be passed to it (it should be comm.comm in Spyro). For 2D meshes, I typically leave it running only on rank 0. I currently run the cpw_calc in serial mode, but I can look into debugging it for parallel execution as a future issue.

It used to work in parallel, but because of our current coverage issue (which ends up requiring that every parallel test has to be run twice, once for testing and once for coverage), only tests in the test_parallel and test_3d folders are run in parallel to reduce testing runtime. I suggest opening an issue related to debugging and adding an appropriate smaller parallel test, and leaving it as it is for now (and running it in serial, like the .yaml file).

For parallel debugging I recommend using VS Code and following Firedrake's suggestions:https://github.com/firedrakeproject/firedrake/wiki/Parallel-MPI-Debugging-with-debugpy

@SouzaEM
Copy link
Collaborator Author

SouzaEM commented Jun 24, 2024

The issue was running the tests in test/test_cpw_calc.py in parallel. The test run successfully in serial. However, there is an issue in method estimate_timestep of spyro/utils/estimate_timestep.py that sets a large value for dt:

if np.sqrt(max_eigval) > 0.0:
    max_dt = float(2 / np.sqrt(max_eigval))
else:
    max_dt = 100000000

Regarding the other tests, the cases in test_3d and test_parallel run without problems. The test_ad/test_gradient_AD.py is failing because the 'parallelism_type' is not set in the input dictionary.

As these issues are not directly related to the Dockerfile implementation, I believe we can solve them in another issue to make the docker image available. What do you think?

@Olender
Copy link
Owner

Olender commented Jun 24, 2024

The test_ad tests aren't currently run in the CI. They are left over code from even before the refactoring. We could open an issue to re-add the adjoint based gradient tests.

@SouzaEM SouzaEM marked this pull request as ready for review June 25, 2024 11:20
@SouzaEM SouzaEM merged commit 19f54f0 into main Jun 25, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants