-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenFoam error when using more than one thread in Docker #497
Comments
Hi, after some digging into this I think I have a working solution. I have done it directly in the container but I guess you can add it to the Dockerfile or the source code.
And that's it !, we have the Docker container running with momentum conservation and 25 threads
Notice that I am not an expert on this and this is simply a working solution I got by googling and playing a bit with the Docker. PS: The Docker approach is important for us to run Azure HPC Batching with momentum conservation. For our local HPC I will have to do some testing because we use Singularity and it is more restrictive with privileges ... in case it fails maybe using openFoam image as base instead of Ubuntu 20.04 may be a solution (well..... just thinking out loud) |
@santiagoMonedero Glad you got something working for now. I'm also not a Docker or MPI expert so am not sure off the top of my head if there is a better way to handle this. I'll leave this ticket open and will try to take a closer look soon. Thanks for reporting your fix here. |
Meeting notes 12/5/2024:
|
Sathwik is also going to check in the scripts he's including in his local docker builds so that others attempting to replicate the results will have the same container image he is working with. |
I would recommend keeping the example as small as possible. So just one or two total WindNinja simulations on one node, or one or two total WindNinja simulations per node on just two nodes. Also, it would be good to go back to keeping the NINJAFOAM directory for such cases, helps for debugging. And heck, maybe two WindNinja simulations per node might be overkill as well. But sometimes it helps when testing the ability to run multiple simulations at once to have more than one, hence I thought maybe two. Much appreciated :). |
Testing required before further progress is recorded:
After all three of these succeed, you may use what you learned above to adapt the python script to correctly launch windninja_cli in the slurm environment. |
Meeting notes from 12/6/2024:
|
Meeting notes 12/9/2024/ On Ultron:
Sathwik will be solving the problem of why |
Just spitballing, but shouldn't there be a Lines 41 to 43 in a1340ee
Or since it looks like there's all of two files, you could maybe add a couple of |
wmake is wmake, any `make install` type stuff seems to be done internally in wmake. Even for standard builds, we just run `wmake libso` at the upper level to link libs, and `wmake` for the executables.
OpenFOAM finds the given libs and executables built this way using something like `$(FOAM_USER_LIBBIN)` and `$(FOAM_USER_APPBIN)`, rather than the standard openfoam installation `$(FOAM_LIBBIN)` and `$(FOAM_APPBIN)`, all these
paths set when running `source /opt/openfoam8/etc/bashrc`. Usually openfoam just uses whatever the user creates and defines and puts there, independent of the openfoam installation. So kind of like a runtime library linking list type thing.
So maybe there needs to be some commands to copy and paste these directory contents already pre-built to whatever is using them? I kind of assume something like that would already be happening with OpenFOAM and WindNinja,
but probably have to manually send over the `$FOAM_RUN/../applications` folder, well technically the `$FOAM_RUN/../platforms` folder. (I'm surprised there isn't some kind of `$FOAM_USER_DIR`, guess that's why we use `$FOAM_RUN/../` instead).
|
@sathwikreddy56 What is the status on this? @bnordgren's suggestion is correct: "Or since it looks like there's all of two files, you could maybe add a couple of install or cp commands yourself?" You need to copy the custom built binaries ( Also I see there is a problem in the Dockerfile -- our custom OpenFOAM applications are being built twice, once with Can you please make these changes to the Dockerfile and commit them once things are working? |
So I just tried out the fix that we are trying to do on the server, to my local machine, and no dice, I get all the fun missing libWindNinja.so problems that Sathwik is having on the server. Two things that I tried.
Note that with 2), I had originally posted that it WASN'T working, apparently I had messed up the paths. Using Also note that my idea for 2) came from looking at Why 1) doesn't work, I'm not sure, maybe OpenFOAM got more strict with changes from foam 2.2.0 to foam 8? But usually just putting .dll/.so files into the appropriate place is good enough, I'll look more into it because that fix would be much easier than editing paths each time after checking out the docker image. I also wish that OpenFOAM provided an additional variable |
I just tested on my machine and copying applyInit and libWindNinja.so to /opt/openfoam8/platforms/linux64GccDPInt32Opt/bin and /opt/openfoam8/platforms/linux64GccDPInt32Opt/lib works for me, as expected. |
Quick update on the the issue, I have made the necessary changes to the docker which are
|
Did some more testing on my local machine, and it looks like 1) from the above comment actually works after all? But I got confused because it still throws the warnings of not able to find libNinjaSampling.so when running. But it DOES run, I got |
I am a little confused by @sathwikreddy56's statement. From my understanding, when we containerization something, the host's environment should not affect whether it runs or not. The container itself should be separate, so when @sathwikreddy56 says "copy the required file to systemwide available locations" I feel this goes against the principles of why we containerize projects. I could be misunderstanding so feel free to correct me, but I wanted to add this as more of an outsider looking in. |
@masonwillman, It seems that the containers binds the home of the host to its home in order for it to access the files in the home directory. what I meant by the systemwide available location is that the root locations such as /opt /usr /bin kinda directories in the container are self contained i.e. it is not bound from the host machine only /home is bound for some reasons. when we copy the shared object files from /home to /opt it allows any user to access them rather than a particular user who is the standard concept in packaging the containers. |
If I'm understanding correctly, part of the issue is changing folder names when going to the containers. So the variable Though I agree with Mason, I'm confused why the container changes |
So just talked to Bryce, and in the process of understanding stuff, he had me rerun the fix attempt on my local machine one more time. And I found that I was misreading the warnings I was getting, I had accidentally been running a case that was defined such that it required an additional Dropping that requirement from my case dropped all the warnings that I was getting. So yes fix 1) from the above comment should work great. Man, next time I need to squint harder, and be more careful not to mix up work projects when debugging/testing. |
If I'm understanding correctly, the reason |
@sathwikreddy56 Where are we at on this? Have you been able to modify the Dockerfile to properly install |
@nwagenbrenner I have tested the code and checked the logs now the windninja is able to find the required files in the container but the logs show no details about why reconstructPar() is failing. I still trying to debug the issueas of why multicore runs are failing |
Bryce and I met with @sathwikreddy56 last Friday and did some more troubleshooting. Running the case by hand, we found out that the moveDynamicMesh executable was failing because |
Meeting Notes 12/18/2024
@latwood feel free to add to these items. |
If I'm understanding correctly:
We've moved on to where Sathwik is rebuilding the docker image with the latest WindNinja code, the vtk fix as well as the slope_aspect_grid and flow_separation_grid utility scripts. Once we confirm that the new docker image works as we saw in our meeting, we will then see if the WindNinja runs go all the way to completion, then we should be ready for the slurm script tests. |
Sathwik got a new docker image to run to completion on the head node, with 8 threads, though it's still having problems running with slurm. The new docker image has all the latest WindNinja master branch code merged into it, as well as his latest changes to run on the server. We're planning on meeting again tomorrow to see if we can get more headway. In the meantime, Sathwik is going to do some cleanup and testing to see if he can get it to work on slurm before our meeting. Also, seems like a good time to make a copy of the most up to date files and commands needed to get stuff to run up to this point. |
I have Updated the singularity container with the updated WindNinja code from the master. Also comming to mpi runs singularity container is running properly with 8 threads when run using I have updated the container with the new changes the singularity container is working fine when executed manually with
|
It looks like the first mpiexec call (moveDynamicMesh) is failing, probably
because of the error you reported yesterday where MPI thinks you are
requesting more processors than are available. @latwood could help confirm
that. Can you look into why this is happening under slurm?
…On Thu, Dec 19, 2024, 13:38 Sathwik Reddy ***@***.***> wrote:
I have Updated the singularity container with the updated WindNinja code
from the master.
Also comming to mpi runs singularity container is running properly with 8
threads when run using singularity exec
I have updated the container with the new changes the singularity
container is working fine when executed manually with singularity exec
command but when I use slurm to initialize the container for a scheduled
run it fails I am trying to find the reason for that. The main issue I face
is that reconstruct par error No Times Selected. here is a copy of the
files I used to run the singulairty container.
sims3.sbatch
#!/bin/bash
#SBATCH --job-name=windninja_simulations
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=96:00:00
#SBATCH --output=windninja_output.log
#SBATCH --error=windninja_output.err
# Print job info for debugging
echo "Job started on $(date)" >> windninja_output.log
echo "Job ID: $SLURM_JOB_ID" >> windninja_output.log
echo "Allocated Nodes: $SLURM_NODELIST" >> windninja_output.log
echo "Total Tasks: $SLURM_NTASKS" >> windninja_output.log
echo "CPUs per Task: $SLURM_CPUS_PER_TASK" >> windninja_output.log
# Proceed with the main script if checks pass
echo "Starting Windninja runs on each clip..." >> windninja_output.log
# Debug print: Check if directory is accessible
echo "Checking if /mnt/ohpc/WN_sims is accessible..." >> windninja_output.log
ls -ld /mnt/ohpc/WN_sims >> windninja_output.log 2>&1
# Debug print: Check if the Singularity image exists
echo "Checking if Singularity image exists at /mnt/ohpc/WN_src/WN_updated.sif..." >> windninja_output.log
if [ ! -f /mnt/fsim/windninja/src/wn_latest6.sif ]; then
echo "Singularity image not found!" >> windninja_output.log
exit 1
fi
# Debug print: Start the parallel container launches
echo "Launching 200 tasks in parallel..." >> windninja_output.log
# Launch 200 containers in parallel using srun
#srun --ntasks=1 --cpus-per-task=4 --exclusive bash -c
echo "Starting parallel container launches..." >> windninja_output.log
srun --ntasks=1 singularity exec -B /mnt/ohpc/WN_sims/59:/output -B /mnt/fsim/windninja/src:/data /mnt/fsim/windninja/src/wn_latest6.sif /mnt/fsim/windninja/src/scripts/run.sh
#for simulation in {0..200}; done
# echo "Starting simulation $simulation..." >> windninja_output.log
# srun --ntasks=1 singularity exec -B /mnt/ohpc/WN_sims/$simulation:/output /mnt/ohpc/WN_src/WN_updated.sif python3 /mnt/ohpc/WN_src/scripts/run.sh &
# srun --ntasks=1 singularity exec -B /mnt/fsim/windninja/sims/$simulation:/output -B /mnt/fsim/windninja/src:/data /mnt/fsim/windninja/src/wn_latest6.sif /mnt/fsim/windninja/src/scripts/run.sh &
# singularity exec -B /mnt/ohpc/WN_sims/200:/output -B /home:/home -B /mnt:/mnt /mnt/ohpc/WN_src/wn_latest1.sif /mnt/ohpc/WN_src/scripts/run.sh
#done
#wait
# Debug print: End of script
echo "All simulations completed. Job finished at $(date)" >> windninja_output.log
# Final wait to ensure all background tasks are completed
wait
```
```
`run.sh
#!/bin/bash
export CPL_DEBUG=NINJAFOAM
source /opt/openfoam8/etc/bashrc
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export FOAM_USER_LIBBIN=/usr/local/lib/
# /usr/local/bin/WindNinja_cli $*
#WindNinja_cli /output/dems_folder/dem0/momentum/grass/0o0deg/cli.cfg
OUTPUT_FOLDER="/output"
LOG_FILE="${OUTPUT_FOLDER}/simulation.log"
python3 /data/scripts/run_varyWnCfg3.py > "${LOG_FILE}" 2>&1
```
—
Reply to this email directly, view it on GitHub
<#497 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4GQDJGHJRQ3AVQYA376UD2GMODZAVCNFSM6AAAAABKNLYUL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGYZTKMBVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey Natalie, Yes that's what is happening the slurm is giving required processors but just some kind of communication with the containers I am still working on what is root cause of the issue |
Hi, I am trying to run the example file
cli_momentumSolver_diurnal.cfg
using the Dockerfile in the respository (v3.9) but it fails when using more than 1 thread. All I have done is 1) clone repository, 2) build dockerfile and 3) run the docker interactively to access WindNinja_cli.Surprisingly it gives different errors when creating and using the image in Ubuntu 20.04 through WSL2 in windows 11, and directly using Ubuntu 18.04 without WSL
PS: I know this was a known issue in a previous windninja version but just wanted to give it a try on the new release and check if there is no work around it yet. Thanks!!
The text was updated successfully, but these errors were encountered: