Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] -- Running Containerized Workloads with Multi-Instance Tasks on Azure Batch #15777

Closed
fertinaz opened this issue Oct 7, 2020 · 7 comments
Labels
Batch Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@fertinaz
Copy link

fertinaz commented Oct 7, 2020

Describe the bug
Developing a command-line application that uses Azure Batch SDK.

I can run a single-node MPI application successfully, but multi-node executions fail during the MPI execution. Error messages start with:

Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts.
/FullPathToSource/mpi: line 46: mpicc: command not found

Then it fails. But it doesn't fail with mpirun: unknown command or sth. similar. It acts like input files don't exist. However, they do exist.

I've got two nodes [10.0.0.4] and [10.0.0.5]. The latter should be compute node's internal IP, so I'm guessing that slave cannot pull the docker container properly. <-- See update at the end of this message

OpenMPI is installed inside the container, and I'm not installing an additional OpenMPI on the hosts using StartTask or CoordinationCommand.

Command that I'm using:

  . /FullPathToApplicationFileInDocker/bashrc ; 
  export OMPI_MCA_btl_vader_single_copy_mechanism=none ; 
  export OMPI_ALLOW_RUN_AS_ROOT=1 ; 
  export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1  ;
  cd $AZ_BATCH_TASK_SHARED_DIR/myCase ;  
  serialCommand ; 
  mpirun -np 4 --host $AZ_BATCH_HOST_LIST \
    --mca btl_tcp_if_include eth0  \
    --map-by node \
    --prefix /opt/openmpi-4.0.4 \
    --oversubscribe \
    parallelCommand"

I've explained below in more details, but the serial command runs fine.

Expected behavior
It should work as the single node MPI jobs.

Actual behavior (include Exception or Stack Trace)
Fails with the error described in the first section.

To Reproduce
Steps to reproduce the behavior (include a code snippet, screenshot, or any additional information that might help us reproduce the issue)

  1. Create a pool with a container from ACR:
  • Using a container supported OS from microsoft-azure-batch. In my case, I choose centos-container version 7-7.
  • Set ContainerConfiguration to pull the container image from ACR.
  • Configure VM using this ContainerConfiguration.
  • Enable inter compute node comm. and set MaxTasksPerComputeNode to 1. Needed for multi-node executions.
  • Create pool with pool.CommitAsync().Wait(); - not using start task.
  1. Create a job:
                    var batchJob = batchClient.JobOperations.CreateJob();
                    batchJob.Id = batchJobId;
                    batchJob.PoolInformation = new PoolInformation { PoolId = batchPoolId };
                    batchJob.Commit();
  1. Create a CloudTask = multiNodeTask:
    I'm basically following the documentation here - https://docs.microsoft.com/en-us/azure/batch/batch-mpi
  • Start with setting-up a multiNodeTask derived from CloudTask.
  • Create TaskContainerSettings which uses the same container image used during the pool creation.
  • Create MultiInstanceSettings and expose port 23 for sshd as explained here - https://batch-shipyard.readthedocs.io/en/latest/80-batch-shipyard-multi-instance-tasks/
  • Set CommonResourceFiles because I need input files to run commands with MPI.
  • Run all commands under $$AZ_BATCH_TASK_SHARED_DIR/caseDirectory

Commands I'm running are a simulation work-flow. First few commands are non-MPI executions and they're basically preparing my case for the simulation. These serial commands run successfully. But when the mpi command starts it fails because it cannot find the MPI environment. By the way, it doesn't give something likempirun: unknown command which is what I'm using to execute my mpi commands. So I basically handled this issue by providing --prefix command to the mpirun. Ref: https://www.open-mpi.org/faq/?category=running#mpirun-prefix

I need some guidance since the documentation I follow is a bit different - it uses an application binary uploaded via CommonResourceFiles instead of containers. I don't want to go that path because applications I'll be using are stuff like openfoam which consists of hundreds of libraries and binaries. Therefore using containers - I believe - is the right thing to do for my use case. Also, documentation about the container workloads explains a single node example without too many details, so I'm kind of stuck at this point.

Another thing that concerns me I'm running MPI commands as root. I'd like to know if there is any workaround to avoid root executions.

Environment:

  • Microsoft.Azure.Batch Version="13.0.0"
  • .NET Core SDK (3.1.402)

Please let me know if you need more information.

Update:

My initial thought was this was about docker pull mechanism of the compute nodes. Now I'm thinking it might be related with how primary and sub-tasks are scheduled. See code snippet I'm actually using:

        string taskCommandLine = "/bin/sh -c '" + taskSharedDir + " ; " + createHostsFile + " ; " + jobCommand + "'";
        CloudTask multiNodeTask = new CloudTask(batchTaskId, taskCommandLine)
        {
            ResourceFiles = caseResourceFiles
        };
        
        var tcs = new TaskContainerSettings(containerImage, containerRunOptions, containerRegistry);
        multiNodeTask.ContainerSettings = tcs;

        var multiInstanceSettings = new MultiInstanceSettings(coordinationCmd, myJob.Nodes)
        {
            CommonResourceFiles = caseResourceFiles
        };

        multiNodeTask.MultiInstanceSettings = multiInstanceSettings;
        multiNodeTask.UserIdentity = new UserIdentity(new AutoUserSpecification(ElevationLevel.Admin, AutoUserScope.Pool));

        batchClient.JobOperations.AddTaskAsync(batchJobId, multiNodeTask).Wait();

The actual error is - this is an openfoam tutorial case:

[1] --> FOAM FATAL ERROR: 
[1] Cannot find file "points" in directory "polyMesh" in times "0" down to constant

I've added a find . -name points to the task command. See its results

./processor3/constant/polyMesh/points
./constant/polyMesh/points
./processor1/constant/polyMesh/points
./processor0/constant/polyMesh/points
./processor2/constant/polyMesh/points
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Oct 7, 2020
@fertinaz fertinaz changed the title [BUG] -- Running Containerized Workloads using Multi-Instance Tasks on Azure Batch [BUG] -- Running Containerized Workloads with Multi-Instance Tasks on Azure Batch Oct 7, 2020
@jsquire jsquire added Batch Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team. labels Oct 8, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Oct 8, 2020
@ghost
Copy link

ghost commented Oct 8, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @mksuni, @bgklein, @mscurrell.

@jsquire
Copy link
Member

jsquire commented Oct 8, 2020

Thank you for your feedback. Tagging and routing to the team best able to assist.

@fertinaz
Copy link
Author

fertinaz commented Oct 10, 2020

I tried a similar test using Gromacs. Apparently, the issue is about reading files from AZ_BATCH_TASK_SHARED_DIR when an MPI execution starts.

To give you an overview; I create a Blob Storage and upload my local case to that Blob Storage Container. Then I basically return a list of resource files which I reference in the task creation. You can find this portion of my code above.

Similar to OpenFOAM, Gromacs also fails during the MPI execution in a multi-node setup and when I use just one node, everything runs successfully including MPI. The primary difference between multi-node and single-node tests is the working directory I'm using when I create a Cloud Task. For multi-node runs, I running under AZ_BATCH_TASK_SHARED_DIR whereas on a single node test I'm using AZ_BATCH_TASK_WORKING_DIR.

Please see the following error output from my Gromacs test:

-------------------------------------------------------
Program:     gmx mdrun, version 2016.5
Source file: src/gromacs/commandline/cmdlineparser.cpp (line 235)
Function:    void gmx::CommandLineParser::parse(int*, char**)
MPI rank:    3 (out of 4)

Error in user input:
Invalid command-line options

-------------------------------------------------------
Program:     gmx mdrun, version 2016.5
Source file: src/gromacs/commandline/cmdlineparser.cpp (line 235)
Function:    void gmx::CommandLineParser::parse(int*, char**)
MPI rank:    1 (out of 4)

Error in user input:
Invalid command-line options
  In command-line option -s
    File
    '/mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/topol_pme.tpr'

However, the file topol_pme.tpr definitely exists. The previous command before this one is a serial execution and generates that file. Also, I've added a find . -name topol_pme.tpr -type f which prints that it exists. I can also confirm that I see it on the portal.

This is how job folder (AZ_BATCH_TASK_SHARED_DIR/testCaseDirectory) looks - right before mpirun is executed. I used the fullpath to make sure it's checking the right directory:

./topol_pme.tpr    ---> Returned from find command
.:
-rw-r--r--. 1 root root   11960 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/#mdout.mdp.1#
-rwxrwx---. 1 1000 1000 3312043 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/conf.gro
-rwxrwx---. 1 1000 1000     918 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/job-mpi-multinode.yaml
-rwxrwx---. 1 1000 1000     508 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/job.yaml
-rw-r--r--. 1 root root   11969 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/mdout.mdp
-rwxrwx---. 1 1000 1000     941 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/pme.mdp
-rwxrwx---. 1 1000 1000     952 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/rf.mdp
-rwxrwx---. 1 1000 1000     648 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/topol.top
-rw-r--r--. 1 root root 1168388 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/topol_pme.tpr  <<< File that cannot be found
-rw-r--r--. 1 root root 1168388 Oct 10 17:32 /mnt/resource/batch/tasks/workitems/batchJob202010101728/job-1/batchTask202010101728/watergmx50bare0048/topol_rf.tpr

There is a piece of important information in the error messages though. MPI ranks that error out are 1 and 3. I'm mapping by node, so I believe those are from the compute node, not from the master node. That means somehow compute node cannot access that file. It shouldn't be a permission issue as well because I confirm that I'm running with root privilege.

So either there is something wrong with the way that the files are stored when AZ_BATCH_TASK_SHARED_DIR is used or I'm completely misinterpreting the way it should be used.

@bgklein
Copy link
Contributor

bgklein commented Oct 12, 2020

Are you specifying the working directory for the task to run in? Running MPI tasks is documented at https://docs.microsoft.com/en-us/azure/batch/batch-mpi#application-command which provides a sample for a working MPI task and execution.

@fertinaz
Copy link
Author

Yes, I specify it with --wdir in the mpi command, by the way I'm using openmpi-4.0.4. Also, in the taskCommandLine initially I cd to $AZ_BATCH_TASK_SHARED_DIR/CaseDirectory and then call the mpirun command.

So it goes like:

  • string taskCommandLine:
setMpiEnvVars + cdTaskSharedDir + jobCommand;
  • jobCommand :
mpirun -np 4 \
  --host $AZ_BATCH_HOST_LIST \
  --map-by node \
  --mca btl_tcp_if_include eth0 \
  --oversubscribe \
  --prefix /opt/openmpi-4.0.4 \
  --wdir $AZ_BATCH_TASK_SHARED_DIR/watergmx50bare0048 \
  $GMX_BIN/gmx_mpi mdrun -npme 0 -notunepme -ntomp 1 -dlb yes -v -nsteps 1000 -resethway -noconfout -s $AZ_BATCH_TASK_SHARED_DIR/watergmx50bare0048/topol_pme.tpr

The error is:

'/mnt/resource/batch/tasks/workitems/batchJob202010121920/job-1/batchTask202010121920/watergmx50bare0048/topol_pme.tpr'
    does not exist or is not accessible.
    The file could not be opened.

from ranks 1 and 3 again.

This is the tree view of the node:

/mnt/resource/batch/tasks
|-- applications
|-- fsmounts
|-- shared
|-- startup
|-- volatile
|   `-- startup
`-- workitems
    `-- batchJob202010121920
        `-- job-1
            `-- batchTask202010121920
                |-- certs
                |-- stderr.txt
                |-- stdout.txt
                |-- watergmx50bare0048
                |   |-- #mdout.mdp.1#
                |   |-- conf.gro
                |   |-- job-mpi-multinode.yaml
                |   |-- job.yaml
                |   |-- mdout.mdp
                |   |-- pme.mdp
                |   |-- rf.mdp
                |   |-- topol.top
                |   |-- topol_pme.tpr   <<< This is the file cannot be accessed by ranks 1 and 3
                |   `-- topol_rf.tpr
                `-- wd
                    |-- mtcagent.log
                    `-- watergmx50bare0048
                        |-- conf.gro
                        |-- job-mpi-multinode.yaml
                        |-- job.yaml
                        |-- pme.mdp
                        |-- rf.mdp
                        `-- topol.top

Since this information comes from the master node, thus I believe it might be misleading. The problem is compute node cannot access the file I've highlighted.

Following lines are from the Resource Files section in the URL you sent @bgklein

You can specify one or more common resource files in the multi-instance settings for a task. These common resource files are downloaded from Azure Storage into each node's task shared directory by the primary and all subtasks. You can access the task shared directory from application and coordination command lines by using the AZ_BATCH_TASK_SHARED_DIR environment variable. The AZ_BATCH_TASK_SHARED_DIR path is identical on every node allocated to the multi-instance task, thus you can share a single coordination command between the primary and all subtasks. Batch does not "share" the directory in a remote access sense, but you can use it as a mount or share point as mentioned earlier in the tip on environment variables.

It says batch does not share the directory in a remote access sense. Does that mean if a serial command before MPI generates a file as a part of the taskCommandLine (probably executed by the master node), other compute nodes will have that file generated in their AZ_BATCH_TASK_SHARED_DIR as well?

@bgklein
Copy link
Contributor

bgklein commented Oct 12, 2020

No files generated on one node will only exist on the one node. commonResourceFiles will exist on every node, and if you did any actions in the coordinationCommandLine those would exist on every node, but actions taken as part of the tasks command line are not dynamically replicated by Batch.

@fertinaz
Copy link
Author

I think that clarifies the issue for me. Thank you - closing.

openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Aug 26, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Oct 19, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-net that referenced this issue Oct 19, 2021
Enable SDK automation for track-2 SDKs (Azure#15777)

* Add blockchain to latest profile

* Add additional types

* Enable SDK automation for track-2 SDKs

* Revert extra changes

* Revert

* Revert

* Revert

Co-authored-by: Mark Cowlishaw <markcowl@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Batch Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

3 participants