-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 0042 (Dockerfile and related environments) #49
Conversation
…truction to mount the working copy to the container
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #49 +/- ##
==========================================
- Coverage 82.02% 80.46% -1.57%
==========================================
Files 47 47
Lines 3361 3173 -188
==========================================
- Hits 2757 2553 -204
- Misses 604 620 +16 ☔ View full report in Codecov by Sentry. |
docker/Dockerfile
Outdated
|
||
# spyro dependencies | ||
USER root | ||
# Is there a reason for using development versions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. I haven't tried with other versions. Line 20 comes from SeismicMesh dependencies and the first iteration of its installation tutorial. I guess that only the dev version of Cgal is needed, and that is only for compatibility purposes in older Ubuntu installations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will remove this comment
docker/README.md
Outdated
|
||
Then, start a container and share your local repository: | ||
```` | ||
docker run -v $PWD/spyro:/home/firedrake/shared/spyro -it devtag:1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change this line to just:
docker run -v $PWD:/home/firedrake/shared/spyro -it devtag:1.0
That way we still have access to all the other folders and files in spyro-1 (such as the tutorials, tests, paper scripts, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will update the README
|
||
FROM spyro_base AS spyro_release | ||
|
||
RUN . /home/firedrake/firedrake/bin/activate; pip install git+https://github.com/Olender/spyro-1.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense that the base package also has access to the spyro repository with a git clone and a pip install. That way a new user has access to the notebook_tutorials and can test if everything installed correctly. It is also important to have a shots, results, and meshes folders, since those are the default locations for files generated during simulations (they would already be there with a git clone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During development, I believe the user will want to use his local copy of spyro, so adding the main spyro branch to the image may led to confusions regarding which version is being used.
For testing, the idea is to make the runner clone the commit under test, build the latest test image (to make sure that the runner is using a compatible image), and run the tests. In this case, I think only the spyro under test is important for the image.
These are the reasons why I didn't installed spyro in the base image. The last line of the Dockerfile makes spyro available for the user (in the current branch of the user). Do you think that the current version is enough for accessing notebooks, shots, results, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the test image is always testing the main branch. I have to figure out a way to make it test the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have a base version prior to spyro_base with all dependencies installed but no spyro. Then we git clone and checkout outside of the docker before running the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the checkout has to be made before building the image, because the Docker configuration file is now part of spyro. Although it will change less frequently than the other parts of the code, the test routine should run with an image that would be produced with the source code that is under test. Otherwise, we could run in inconsistencies, such as a developer forgetting to commit his changes to the Dockerfile and his new code does not work with previous Docker image.
I think the solution would be something similar to the development version. Maybe we end up with only two images (release and dev/test). That would be even better.
docker/README.md
Outdated
|
||
Then, the following command may be called for running the tests: | ||
```` | ||
docker run -it testtag:1.0 /bin/bash -c "source /home/firedrake/firedrake/bin/activate; python3 -m pytest --maxfail=1 ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are currently failing inside the docker. I am going to try to figure out why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first test to fail produces the following log (tested in firedrake tags 2024-05, 2024-04, and 2024-03):
shared/spyro/test/test_MMS.py:36: in run_solve
Wave_obj.forward_solve()
shared/spyro/spyro/solvers/acoustic_wave.py:46: in forward_solve
self.wave_propagator()
shared/spyro/spyro/io/basicio.py:86: in wrapper
u, u_r = func(*args, **dict(kwargs, source_num=snum))
shared/spyro/spyro/solvers/acoustic_wave.py:113: in wave_propagator
usol, usol_recv = time_integrator(self, source_id=source_num)
shared/spyro/spyro/solvers/time_integration.py:10: in time_integrator
return time_integrator_mms(Wave_object, source_id=source_id)
shared/spyro/spyro/solvers/time_integration.py:24: in time_integrator_mms
return central_difference_MMS(Wave_object, source_id=source_id)
shared/spyro/spyro/solvers/time_integration_central_difference.py:280: in central_difference_MMS
Wave_object.solver.solve(X, B)
petsc4py/PETSc/Log.pyx:188: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
???
petsc4py/PETSc/Log.pyx:189: in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
???
firedrake/src/firedrake/firedrake/linear_solver.py:159: in solve
b = self._lifted(b)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <firedrake.linear_solver.LinearSolver object at 0x7f787f451d20>
b = Coefficient(WithGeometry(FunctionSpace(<firedrake.mesh.MeshTopology object at 0x7f7889c7aef0>, FiniteElement('Q', quadrilateral, 4, variant='spectral'), name=None), Mesh(VectorElement(FiniteElement('Q', quadrilateral, 1), dim=2), 1)), 31)
def _lifted(self, b):
u, update, blift = self._rhs
u.dat.zero()
for bc in self.A.bcs:
bc.apply(u)
update()
# blift contains -A u_bc
> blift += b
E TypeError: unsupported operand type(s) for +=: 'Cofunction' and 'Function'
firedrake/src/firedrake/firedrake/linear_solver.py:129: TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this error has been present, it has been present for the last three months of Firedrake. It's either a long-standing bug or a deprecated feature. Could you check 2023-12? My latest local Firedrake install was 2023-05, but yours should be newer. For now, we need a stable version to use with our tests.
I am currently installing the latest Firedrake locally to test Spyro and check if it reproduces the same error as the docker. I have also done a quick look at the open bugs in the Firedrake repo to see if any match ours. After that, I plan on figuring out exactly what causes the error and submitting a minimal code with it as a bug in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running a test with 2023-08 (the oldest image in DockerHub with the exception of experimental) and this error has not happened yet (the test_MMS.py
is still running). As this error happens before the code actually runs, I believe that version 2023-08 will not have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am running the tests locally using the latest version. Not only is there an ¨unsupported operand type(s) for +=: 'Cofunction' and 'Function'¨ error in the MMS test, but there are numerical instabilities and more significant errors in other forward tests that go through. My initial guess is that something broke the mass lumped triangular elements in Firedrake and since we are probably the only ones that use them no one noticed it until now. They still build diagonal mass matrices and pass Firedrake's unit tests. I still have more tests to run to diagnose if this is the issue. If it is, I will try to submit an issue and/or a fix for it in their GitHub (since I already have Firedrake code for some new ML elements that I need to submit and am familiar with the relevant FIAT and TSFC files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is also related to our LinearSolver object. I am going to try, in the compatibility branch, to update the RHS everywhere to cofunctions and to change our solver object to LinearVariationalSolver. It is possible that this will make the code compatible with newer versions of Firedrake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, you do not have access to an older Firedrake version. I had a look at all the available Firedrakes after 2023-08, and they all are very slow and have greater errors overall.
I think we can install older versions (such as 'https://github.com/firedrakeproject/firedrake/tree/Firedrake_20221208.0'). It may be more complex to install them with Docker, but at least we will we have the same performance as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that I skipped testing for 2023-09 before. I retested using all available Firedrake Docker versions, and found that the 2023-09 version passes all tests and performs better than the 2023-08 version. I suggest we prioritize adjusting the Docker file to focus on this version for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If version 2023-09 is working, it would save us a lot of time. I was trying to install version Firedrake_20230316.0
, but the process is slow because the install script does not work out of the box with Docker and some errors happen after the PETSc installation (which is slow). Can I use the Docker image of version 2023-09 as our base installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good plan. Using the Docker image of version 2023-09 should streamline the process and save us time. Installing older versions and managing dependencies can be really annoying. Let's go with 2023-09 for now. We can revisit and update to the latest Firedrake version when we are compatible.
The |
Every test fails for me It gives the error : |
I am unsure how Docker works, but is it possible that my FROM firedrakeproject/firedrake:2024-05 is different from yours? If you ran it at the middle of the month, could you have a different version that got cached, and every time a new container is built from it, it calls the cache? I am going to try out 2024-04. Firedrake sometimes breaks. |
The fail in I checked in my computer and all the tests fail in the Docker image. I believed the problems are related to the Firedrake version. The Docker tag 2024-05 should be the same for everybody. I can test different versions of Firedrake images. You don't have to worry. |
I think I have almost figured out the changes in our code needed to get the triangular elements working in the newest Firedrake versions. Quadrilaterals are going to be a challenge. |
Let's use an older version until we fix the quadrilaterals elements? |
We also currently have an institutional docker hub account (https://hub.docker.com/r/ndfti/spyro). |
…dient_2D.py is now working (if the absolute errors are used
I changed the firedrake base image to 2023-09 and the SeismicMesh reference to GitHub (instead of PyPI). Now the |
Could it be running out of memory in test_cpw_calc_analytical.py? Could you run with the -s flag in pytest and send me the outputs? I am going to run them again here and try to reproduce the error. |
The
Also, the test holds the terminal until I press Ctrl+C.
I will work on that. |
So there are some issues with our SeismicMeh API related to parallelism. Specifically, for 3D problems, we often need to use different numbers of cores for meshing and for FWI. Because of this, I usually call SeismicMesh directly in the bash script on our cluster for 3D cases. The cpw tests generate multiple different 2D meshes to determine the cells per wavelength parameter for a given error. SeismicMesh handles parallel meshing well, but it usually requires a different communicator to be passed to it (it should be comm.comm in Spyro). For 2D meshes, I typically leave it running only on rank 0. I currently run the cpw_calc in serial mode, but I can look into debugging it for parallel execution as a future issue. It used to work in parallel, but because of our current coverage issue (which ends up requiring that every parallel test has to be run twice, once for testing and once for coverage), only tests in the test_parallel and test_3d folders are run in parallel to reduce testing runtime. I suggest opening an issue related to debugging and adding an appropriate smaller parallel test, and leaving it as it is for now (and running it in serial, like the .yaml file). For parallel debugging I recommend using VS Code and following Firedrake's suggestions:https://github.com/firedrakeproject/firedrake/wiki/Parallel-MPI-Debugging-with-debugpy |
The issue was running the tests in
Regarding the other tests, the cases in As these issues are not directly related to the Dockerfile implementation, I believe we can solve them in another issue to make the docker image available. What do you think? |
The test_ad tests aren't currently run in the CI. They are left over code from even before the refactoring. We could open an issue to re-add the adjoint based gradient tests. |
Major code refactoring
I'm adding a Dockerfile and a README explaining how to build and use the Docker images. To avoid multiple Dockerfiles with the same configurations (which would be harder to maintain), I'm proposing the use of multi-stage builds. The idea is to have three images: one for release, one for running the automated tests, and one for development.
Could you check if the implementation addresses the use cases of spyro project? I'm creating this pull request as a draft so we can discuss how to improve these images.