-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] -- Running Containerized Workloads with Multi-Instance Tasks on Azure Batch #15777
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @mksuni, @bgklein, @mscurrell. |
Thank you for your feedback. Tagging and routing to the team best able to assist. |
I tried a similar test using Gromacs. Apparently, the issue is about reading files from To give you an overview; I create a Blob Storage and upload my local case to that Blob Storage Container. Then I basically return a list of resource files which I reference in the task creation. You can find this portion of my code above. Similar to OpenFOAM, Gromacs also fails during the MPI execution in a multi-node setup and when I use just one node, everything runs successfully including MPI. The primary difference between multi-node and single-node tests is the working directory I'm using when I create a Cloud Task. For multi-node runs, I running under Please see the following error output from my Gromacs test:
However, the file This is how job folder (
There is a piece of important information in the error messages though. MPI ranks that error out are 1 and 3. I'm mapping by node, so I believe those are from the compute node, not from the master node. That means somehow compute node cannot access that file. It shouldn't be a permission issue as well because I confirm that I'm running with root privilege. So either there is something wrong with the way that the files are stored when |
Are you specifying the working directory for the task to run in? Running MPI tasks is documented at https://docs.microsoft.com/en-us/azure/batch/batch-mpi#application-command which provides a sample for a working MPI task and execution. |
Yes, I specify it with So it goes like:
The error is:
from ranks 1 and 3 again. This is the tree view of the node:
Since this information comes from the master node, thus I believe it might be misleading. The problem is compute node cannot access the file I've highlighted. Following lines are from the Resource Files section in the URL you sent @bgklein
It says |
No files generated on one node will only exist on the one node. |
I think that clarifies the issue for me. Thank you - closing. |
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Enable SDK automation for track-2 SDKs (Azure#15777) * Add blockchain to latest profile * Add additional types * Enable SDK automation for track-2 SDKs * Revert extra changes * Revert * Revert * Revert Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
Describe the bug
Developing a command-line application that uses Azure Batch SDK.
I can run a single-node MPI application successfully, but multi-node executions fail during the MPI execution. Error messages start with:
Then it fails. But it doesn't fail with
mpirun: unknown command
or sth. similar. It acts like input files don't exist. However, they do exist.I've got two nodes
[10.0.0.4]
and[10.0.0.5]
. The latter should be compute node's internal IP, so I'm guessing that slave cannot pull the docker container properly. <-- See update at the end of this messageOpenMPI is installed inside the container, and I'm not installing an additional OpenMPI on the hosts using
StartTask
orCoordinationCommand
.Command that I'm using:
I've explained below in more details, but the serial command runs fine.
Expected behavior
It should work as the single node MPI jobs.
Actual behavior (include Exception or Stack Trace)
Fails with the error described in the first section.
To Reproduce
Steps to reproduce the behavior (include a code snippet, screenshot, or any additional information that might help us reproduce the issue)
microsoft-azure-batch
. In my case, I choosecentos-container
version7-7
.ContainerConfiguration
to pull the container image from ACR.ContainerConfiguration
.MaxTasksPerComputeNode
to 1. Needed for multi-node executions.pool.CommitAsync().Wait();
- not using start task.CloudTask = multiNodeTask
:I'm basically following the documentation here - https://docs.microsoft.com/en-us/azure/batch/batch-mpi
CloudTask
.TaskContainerSettings
which uses the same container image used during the pool creation.MultiInstanceSettings
and expose port 23 forsshd
as explained here - https://batch-shipyard.readthedocs.io/en/latest/80-batch-shipyard-multi-instance-tasks/CommonResourceFiles
because I need input files to run commands with MPI.$$AZ_BATCH_TASK_SHARED_DIR/caseDirectory
Commands I'm running are a simulation work-flow. First few commands are non-MPI executions and they're basically preparing my case for the simulation. These serial commands run successfully. But when the mpi command starts it fails because it cannot find the MPI environment. By the way, it doesn't give something like
mpirun: unknown command
which is what I'm using to execute my mpi commands. So I basically handled this issue by providing--prefix
command to thempirun
. Ref: https://www.open-mpi.org/faq/?category=running#mpirun-prefixI need some guidance since the documentation I follow is a bit different - it uses an application binary uploaded via CommonResourceFiles instead of containers. I don't want to go that path because applications I'll be using are stuff like openfoam which consists of hundreds of libraries and binaries. Therefore using containers - I believe - is the right thing to do for my use case. Also, documentation about the container workloads explains a single node example without too many details, so I'm kind of stuck at this point.
Another thing that concerns me I'm running MPI commands as root. I'd like to know if there is any workaround to avoid root executions.
Environment:
Please let me know if you need more information.
Update:
My initial thought was this was about docker pull mechanism of the compute nodes. Now I'm thinking it might be related with how primary and sub-tasks are scheduled. See code snippet I'm actually using:
The actual error is - this is an openfoam tutorial case:
I've added a
find . -name points
to the task command. See its resultsThe text was updated successfully, but these errors were encountered: