Batch Service Guidance for Azure Maintenance Enabling SR-IOV #73

alfpark · 2019-10-15T15:07:04Z

Impact Overview

The Azure platform will be performing a maintenance operation on select hosts with IB/RDMA capable hardware. This will upgrade the Infiniband RDMA components to enable SR-IOV which will allow the full use of all MPI stacks along with IPoIB. Unfortunately, this change is breaking with the current Network Direct-based RDMA components in-place today. You may have received a communication email describing the upcoming maintenance, the contents of notification can be read below.

Note that even if you do not use an IB/RDMA-capable VM size, e.g., STANDARD_H16R, but use a VM size that is hosted on an IB/RDMA-capable host, e.g., STANDARD_H16, you will be impacted.

Maintenance Schedule

Impacted VM Family	Notification	Maintenance Schedule	Approximate Maintenance Window Length	Maintenance Completed
STANDARD_NCv3 (Phase 1)	Link	November 2019	3 hours	Completed
STANDARD_NCv3 (Phase 2)	Link	TBD 2020	3 hours	Not Started
STANDARD_NCv2	Link	August 2020	3 hours	Completed
STANDARD_NDv1	Link	August 2020	3 hours	Completed

Other VM families will be added to this timetable as the maintenance period approaches.

Batch Service Guidance

As the Batch service hosts many of these high performance workloads on GPUs and/or IB/RDMA hardware, we are providing guidance on how to minimize the impact of the upcoming maintenance windows for business continuity. Note that if you follow a recommendation for executing in an alternate region, you will need to ensure that you have all of the appropriate core and service quotas in-place well in advance of the maintenance window so you do not experience disruptions due to quota limitations.

There are three possible scenarios:

Scenario 1: Batch Pools using IB/RDMA-capable VMs

Applicable VM sizes: STANDARD_NC24RS_V3, STANDARD_NC24RS_V2, STANDARD_NC24R, STANDARD_ND24R, STANDARD_H16R, STANDARD_H16MR

The following will apply if no action is taken during the maintenance window:

Compute nodes will lose contact with the Batch service and move to unusable state.
Tasks which were executing on the compute node will be interrupted and requeued.
All existing task data on the compute node will be lost.
Once the maintenance window completes, the following will occur depending upon the pool OS VM configuration:
- Windows: Batch should recover the unusable nodes and eventually transition to idle.
- Linux Marketplace Images with built-in IB/RDMA support: Batch will not be able to recover the unusable nodes as the deployed OS configurations that did work with Network Direct are incompatible with the new SR-IOV IB/RDMA hardware.
- Linux Custom Images: Batch may or may not recover depending upon if the image used can span both Network Direct and SR-IOV IB/RDMA support.

To mitigate interruptions in your workload, please action the following recommendations:

For GPU-based pools:

Create an execution environment on a different GPU-based VM size prior to the maintenance window for the impacted VM family. For example, if you are deployed on STANDARD_NC24RS_V3, you would create a new execution environment in STANDARD_NC24RS_V2 assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Additionally, for Linux pools, if the alternate execution environment for the region has already completed maintenance, then you will need to ensure that your pool OS VM configuration and tasks are compatible with the new SR-IOV IB/RDMA environment. Moreover keep in mind performance implications of moving between VM families with different generations of GPU hardware.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going. It is recommended that this is performed well in advance of the maintenance window and, for Linux pools, that the original execution environment is deleted. This is to ensure that you are not billed for unusable compute nodes once the maintenance window completes for the original execution environment.
Migrate your workload from the alternate environment back to the original environment after
the maintenance window completes. For Linux pools, you may need to re-create your original environment with updated pool OS VM configuration and ensure tasks are compatible with the new SR-IOV IB/RDMA environment.
Teardown the mirrored execution environment in the alternate environment.

For STANDARD_H16R and STANDARD_H16MR pools:

There are two alternate execution environment options:
- Create an execution environment on STANDARD_HB or STANDARD_HC VM families. Because these VM families already support SR-IOV IB/RDMA, you will need to modify the pool OS VM configuration to a compatible setting for Linux pools. Note that this change must be done in any case if migrating back to the original VM size after the maintenance window completes.
- Create an execution environment on STANDARD_A9. Keep in mind that performance may be reduced significantly in certain workloads due to hardware differences. The pool OS VM configuration can most likely be kept the same.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going. It is recommended that this is performed well in advance
of the maintenance window and, for Linux pools, that the original execution environment is deleted. This is to ensure that you are not billed for unusable compute nodes once the maintenance window completes for the original execution environment.
Migrate your workload from the alternate environment back to the original environment after the maintenance window completes. For Linux pools, you may need to re-create your original environment with updated pool OS VM configuration and ensure tasks are compatible with the new SR-IOV IB/RDMA environment.
Teardown the mirrored execution environment in the alternate environment.

Scenario 2: Batch Pools using non-IB/RDMA VM sizes, but on IB/RDMA-capable hosts

Applicable VM sizes: STANDARD_NCv3, STANDARD_NCv2, STANDARD_NC, STANDARD_ND and STANDARD_H with the exclusion of VM sizes from Scenario 1

The following will apply if no action is taken during the maintenance window:

Compute nodes will lose contact with the Batch service and move to unusable state.
Tasks which were executing on the compute node will be interrupted and requeued.
All existing task data on the compute node will be lost.
Once the maintenance window completes, the following will occur depending upon the pool OS VM configuration:
- Windows: Batch should recover the unusable nodes and eventually transition to idle.
- Linux: Batch should recover the unusable nodes and eventually transition to idle.

To mitigate interruptions in your workload, please action the following recommendations:

For GPU-based pools:

Create an execution environment on a different GPU-based VM size prior to the maintenance window for the impacted VM family. For example, if you are deployed on STANDARD_NC6S_V3, you would create a new execution environment in STANDARD_NC6S_V2 assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Moreover, keep in mind performance implications of moving between VM families with different generations of GPU hardware.
Plan to have your workload execute in the alternate environment while the affected
region's maintenance window is on-going.
Migrate your workload from the alternate environment back to the original environment after
the maintenance window completes.
Teardown the mirrored execution environment in the alternate environment.

For STANDARD_H pools, excluing IB/RDMA VM sizes:

Create an alternative execution environment with an acceptable performance and/or price profile for your scenario. Potential suitable VM family substitutes are HB, HC, Fv2, Dv2, Dv3, Ev2, Ev3, STANDARD_A10, and STANDARD_A11.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going.
Migrate your workload from the alternate environment back to the original environment after the maintenance window completes.
Teardown the mirrored execution environment in the alternate environment.

Linux OS Migration Recommendations for SR-IOV

Certain Marketplace images will no longer be compatible once hosts have been upgraded to SR-IOV.
In certain aforementioned scenarios above, if no action is taken, this will result in compute nodes in
unusable state after the maintenance window.

The following table describes recommended upgrade paths for select Marketplace images:

Original Image (Network Direct)	Recommended Target Image (SR-IOV)	Notes
microsoft-azure-batch centos-container-rdma 7-4	microsoft-azure-batch centos-container-rdma 7-7	Docker-compatible runtime, OFED and popular MPI runtimes pre-installed
microsoft-azure-batch ubuntu-server-container-rdma 16-04-lts	microsoft-azure-batch centos-container-rdma 7-7	The OS will change from Ubuntu to CentOS [1]
OpenLogic CentOS-HPC 7.1	OpenLogic CentOS-HPC 7.7	OFED and popular MPI runtimes pre-installed
OpenLogic CentOS-HPC 7.3	OpenLogic CentOS-HPC 7.7	OFED and popular MPI runtimes pre-installed
OpenLogic CentOS-HPC 7.4	OpenLogic CentOS-HPC 7.7	OFED and popular MPI runtimes pre-installed

Note that if you migrate from microsoft-azure-batch ubuntu-server-container-rdma 16-04-lts to microsoft-azure-batch centos-container-rdma 7-6 that the OS will change significantly. Because this image is intended for container-based workloads, there may not be many issues for such a migration. If your workload requires Ubuntu as the VM OS, then please continue reading below.

For Ubuntu-based images, you should create a custom image that is published to a Shared Image Gallery with the following recommendations. It is recommended to install the Mellanox OFED distribution that is compatible with your OS for the latest drivers and to enable the most functionality out of the IB/RDMA device. The following table shows the compatibility matrix:

Canonical UbuntuServer SKU	Minimum Required Kernel for SR-IOV compatible Inbox IB Driver	Minimum Compatible Mellanox OFED Package
16.04-LTS	4.15	4.6
18.04-LTS	5.0	4.7

Please follow this guide to apply the appropriate steps in addition to any MPI runtimes your workload requires.

Other Notes

Please stay tuned to your Azure communications for further notifications regarding upcoming maintenance or any changes. You can also refresh this issue as it will be updated with the latest schedules or changes to guidance.

More Information

The text was updated successfully, but these errors were encountered:

alfpark · 2020-08-07T18:52:58Z

Update: New information added regarding NCv2 and NDv1 maintenance.

alfpark changed the title ~~Placeholder~~ Batch Service Guidance for Azure Maintenance Enabling SR-IOV Oct 18, 2019

alfpark pinned this issue Oct 18, 2019

alfpark added the notice Notice label Oct 18, 2019

alfpark closed this as completed Oct 17, 2022

alfpark unpinned this issue Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Service Guidance for Azure Maintenance Enabling SR-IOV #73

Batch Service Guidance for Azure Maintenance Enabling SR-IOV #73

alfpark commented Oct 15, 2019 •

edited

Loading

alfpark commented Aug 7, 2020

Batch Service Guidance for Azure Maintenance Enabling SR-IOV #73

Batch Service Guidance for Azure Maintenance Enabling SR-IOV #73

Comments

alfpark commented Oct 15, 2019 • edited Loading

Impact Overview

Maintenance Schedule

Batch Service Guidance

Scenario 1: Batch Pools using IB/RDMA-capable VMs

Scenario 2: Batch Pools using non-IB/RDMA VM sizes, but on IB/RDMA-capable hosts

Linux OS Migration Recommendations for SR-IOV

Other Notes

More Information

alfpark commented Aug 7, 2020

alfpark commented Oct 15, 2019 •

edited

Loading