You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Azure platform will be performing a maintenance operation on select hosts with IB/RDMA capable hardware. This will upgrade the Infiniband RDMA components to enable SR-IOV which will allow the full use of all MPI stacks along with IPoIB. Unfortunately, this change is breaking with the current Network Direct-based RDMA components in-place today. You may have received a communication email describing the upcoming maintenance, the contents of notification can be read below.
Note that even if you do not use an IB/RDMA-capable VM size, e.g., STANDARD_H16R, but use a VM size that is hosted on an IB/RDMA-capable host, e.g., STANDARD_H16, you will be impacted.
Other VM families will be added to this timetable as the maintenance period approaches.
Batch Service Guidance
As the Batch service hosts many of these high performance workloads on GPUs and/or IB/RDMA hardware, we are providing guidance on how to minimize the impact of the upcoming maintenance windows for business continuity. Note that if you follow a recommendation for executing in an alternate region, you will need to ensure that you have all of the appropriate core and service quotas in-place well in advance of the maintenance window so you do not experience disruptions due to quota limitations.
There are three possible scenarios:
Scenario 1: Batch Pools using IB/RDMA-capable VMs
Applicable VM sizes: STANDARD_NC24RS_V3, STANDARD_NC24RS_V2, STANDARD_NC24R, STANDARD_ND24R, STANDARD_H16R, STANDARD_H16MR
The following will apply if no action is taken during the maintenance window:
Compute nodes will lose contact with the Batch service and move to unusable state.
Tasks which were executing on the compute node will be interrupted and requeued.
All existing task data on the compute node will be lost.
Once the maintenance window completes, the following will occur depending upon the pool OS VM configuration:
Windows: Batch should recover the unusable nodes and eventually transition to idle.
Linux Marketplace Images with built-in IB/RDMA support: Batch will not be able to recover the unusable nodes as the deployed OS configurations that did work with Network Direct are incompatible with the new SR-IOV IB/RDMA hardware.
Linux Custom Images: Batch may or may not recover depending upon if the image used can span both Network Direct and SR-IOV IB/RDMA support.
To mitigate interruptions in your workload, please action the following recommendations:
For GPU-based pools:
Create an execution environment on a different GPU-based VM size prior to the maintenance window for the impacted VM family. For example, if you are deployed on STANDARD_NC24RS_V3, you would create a new execution environment in STANDARD_NC24RS_V2 assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Additionally, for Linux pools, if the alternate execution environment for the region has already completed maintenance, then you will need to ensure that your pool OS VM configuration and tasks are compatible with the new SR-IOV IB/RDMA environment. Moreover keep in mind performance implications of moving between VM families with different generations of GPU hardware.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going. It is recommended that this is performed well in advance of the maintenance window and, for Linux pools, that the original execution environment is deleted. This is to ensure that you are not billed for unusable compute nodes once the maintenance window completes for the original execution environment.
Migrate your workload from the alternate environment back to the original environment after
the maintenance window completes. For Linux pools, you may need to re-create your original environment with updated pool OS VM configuration and ensure tasks are compatible with the new SR-IOV IB/RDMA environment.
Teardown the mirrored execution environment in the alternate environment.
For STANDARD_H16R and STANDARD_H16MR pools:
There are two alternate execution environment options:
Create an execution environment on STANDARD_HB or STANDARD_HC VM families. Because these VM families already support SR-IOV IB/RDMA, you will need to modify the pool OS VM configuration to a compatible setting for Linux pools. Note that this change must be done in any case if migrating back to the original VM size after the maintenance window completes.
Create an execution environment on STANDARD_A9. Keep in mind that performance may be reduced significantly in certain workloads due to hardware differences. The pool OS VM configuration can most likely be kept the same.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going. It is recommended that this is performed well in advance
of the maintenance window and, for Linux pools, that the original execution environment is deleted. This is to ensure that you are not billed for unusable compute nodes once the maintenance window completes for the original execution environment.
Migrate your workload from the alternate environment back to the original environment after the maintenance window completes. For Linux pools, you may need to re-create your original environment with updated pool OS VM configuration and ensure tasks are compatible with the new SR-IOV IB/RDMA environment.
Teardown the mirrored execution environment in the alternate environment.
Scenario 2: Batch Pools using non-IB/RDMA VM sizes, but on IB/RDMA-capable hosts
Applicable VM sizes: STANDARD_NCv3, STANDARD_NCv2, STANDARD_NC, STANDARD_ND and STANDARD_H with the exclusion of VM sizes from Scenario 1
The following will apply if no action is taken during the maintenance window:
Compute nodes will lose contact with the Batch service and move to unusable state.
Tasks which were executing on the compute node will be interrupted and requeued.
All existing task data on the compute node will be lost.
Once the maintenance window completes, the following will occur depending upon the pool OS VM configuration:
Windows: Batch should recover the unusable nodes and eventually transition to idle.
Linux: Batch should recover the unusable nodes and eventually transition to idle.
To mitigate interruptions in your workload, please action the following recommendations:
For GPU-based pools:
Create an execution environment on a different GPU-based VM size prior to the maintenance window for the impacted VM family. For example, if you are deployed on STANDARD_NC6S_V3, you would create a new execution environment in STANDARD_NC6S_V2 assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Moreover, keep in mind performance implications of moving between VM families with different generations of GPU hardware.
Plan to have your workload execute in the alternate environment while the affected
region's maintenance window is on-going.
Migrate your workload from the alternate environment back to the original environment after
the maintenance window completes.
Teardown the mirrored execution environment in the alternate environment.
For STANDARD_H pools, excluing IB/RDMA VM sizes:
Create an alternative execution environment with an acceptable performance and/or price profile for your scenario. Potential suitable VM family substitutes are HB, HC, Fv2, Dv2, Dv3, Ev2, Ev3, STANDARD_A10, and STANDARD_A11.
Plan to have your workload execute in the alternate environment while the affected region's maintenance window is on-going.
Migrate your workload from the alternate environment back to the original environment after the maintenance window completes.
Teardown the mirrored execution environment in the alternate environment.
Linux OS Migration Recommendations for SR-IOV
Certain Marketplace images will no longer be compatible once hosts have been upgraded to SR-IOV.
In certain aforementioned scenarios above, if no action is taken, this will result in compute nodes in unusable state after the maintenance window.
The following table describes recommended upgrade paths for select Marketplace images:
Original Image (Network Direct)
Recommended Target Image (SR-IOV)
Notes
microsoft-azure-batch centos-container-rdma 7-4
microsoft-azure-batch centos-container-rdma 7-7
Docker-compatible runtime, OFED and popular MPI runtimes pre-installed
Note that if you migrate from microsoft-azure-batch ubuntu-server-container-rdma 16-04-lts to microsoft-azure-batch centos-container-rdma 7-6 that the OS will change significantly. Because this image is intended for container-based workloads, there may not be many issues for such a migration. If your workload requires Ubuntu as the VM OS, then please continue reading below.
For Ubuntu-based images, you should create a custom image that is published to a Shared Image Gallery with the following recommendations. It is recommended to install the Mellanox OFED distribution that is compatible with your OS for the latest drivers and to enable the most functionality out of the IB/RDMA device. The following table shows the compatibility matrix:
Canonical UbuntuServer SKU
Minimum Required Kernel for SR-IOV compatible Inbox IB Driver
Minimum Compatible Mellanox OFED Package
16.04-LTS
4.15
4.6
18.04-LTS
5.0
4.7
Please follow this guide to apply the appropriate steps in addition to any MPI runtimes your workload requires.
Other Notes
Please stay tuned to your Azure communications for further notifications regarding upcoming maintenance or any changes. You can also refresh this issue as it will be updated with the latest schedules or changes to guidance.
Impact Overview
The Azure platform will be performing a maintenance operation on select hosts with IB/RDMA capable hardware. This will upgrade the Infiniband RDMA components to enable SR-IOV which will allow the full use of all MPI stacks along with IPoIB. Unfortunately, this change is breaking with the current Network Direct-based RDMA components in-place today. You may have received a communication email describing the upcoming maintenance, the contents of notification can be read below.
Note that even if you do not use an IB/RDMA-capable VM size, e.g.,
STANDARD_H16R
, but use a VM size that is hosted on an IB/RDMA-capable host, e.g.,STANDARD_H16
, you will be impacted.Maintenance Schedule
Other VM families will be added to this timetable as the maintenance period approaches.
Batch Service Guidance
As the Batch service hosts many of these high performance workloads on GPUs and/or IB/RDMA hardware, we are providing guidance on how to minimize the impact of the upcoming maintenance windows for business continuity. Note that if you follow a recommendation for executing in an alternate region, you will need to ensure that you have all of the appropriate core and service quotas in-place well in advance of the maintenance window so you do not experience disruptions due to quota limitations.
There are three possible scenarios:
Scenario 1: Batch Pools using IB/RDMA-capable VMs
Applicable VM sizes:
STANDARD_NC24RS_V3
,STANDARD_NC24RS_V2
,STANDARD_NC24R
,STANDARD_ND24R
,STANDARD_H16R
,STANDARD_H16MR
The following will apply if no action is taken during the maintenance window:
unusable
state.unusable
nodes and eventually transition toidle
.unusable
nodes as the deployed OS configurations that did work with Network Direct are incompatible with the new SR-IOV IB/RDMA hardware.To mitigate interruptions in your workload, please action the following recommendations:
For GPU-based pools:
STANDARD_NC24RS_V3
, you would create a new execution environment inSTANDARD_NC24RS_V2
assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Additionally, for Linux pools, if the alternate execution environment for the region has already completed maintenance, then you will need to ensure that your pool OS VM configuration and tasks are compatible with the new SR-IOV IB/RDMA environment. Moreover keep in mind performance implications of moving between VM families with different generations of GPU hardware.unusable
compute nodes once the maintenance window completes for the original execution environment.the maintenance window completes. For Linux pools, you may need to re-create your original environment with updated pool OS VM configuration and ensure tasks are compatible with the new SR-IOV IB/RDMA environment.
For
STANDARD_H16R
andSTANDARD_H16MR
pools:STANDARD_HB
orSTANDARD_HC
VM families. Because these VM families already support SR-IOV IB/RDMA, you will need to modify the pool OS VM configuration to a compatible setting for Linux pools. Note that this change must be done in any case if migrating back to the original VM size after the maintenance window completes.STANDARD_A9
. Keep in mind that performance may be reduced significantly in certain workloads due to hardware differences. The pool OS VM configuration can most likely be kept the same.of the maintenance window and, for Linux pools, that the original execution environment is deleted. This is to ensure that you are not billed for
unusable
compute nodes once the maintenance window completes for the original execution environment.Scenario 2: Batch Pools using non-IB/RDMA VM sizes, but on IB/RDMA-capable hosts
Applicable VM sizes:
STANDARD_NCv3
,STANDARD_NCv2
,STANDARD_NC
,STANDARD_ND
andSTANDARD_H
with the exclusion of VM sizes from Scenario 1The following will apply if no action is taken during the maintenance window:
unusable
state.unusable
nodes and eventually transition toidle
.unusable
nodes and eventually transition toidle
.To mitigate interruptions in your workload, please action the following recommendations:
For GPU-based pools:
STANDARD_NC6S_V3
, you would create a new execution environment inSTANDARD_NC6S_V2
assuming that family is not targeted for the maintenance window near the same date. You can, instead, opt to create a mirrored exeuction environment in an alternate region. Moreover, keep in mind performance implications of moving between VM families with different generations of GPU hardware.region's maintenance window is on-going.
the maintenance window completes.
For
STANDARD_H
pools, excluing IB/RDMA VM sizes:Linux OS Migration Recommendations for SR-IOV
Certain Marketplace images will no longer be compatible once hosts have been upgraded to SR-IOV.
In certain aforementioned scenarios above, if no action is taken, this will result in compute nodes in
unusable
state after the maintenance window.The following table describes recommended upgrade paths for select Marketplace images:
microsoft-azure-batch ubuntu-server-container-rdma 16-04-lts
tomicrosoft-azure-batch centos-container-rdma 7-6
that the OS will change significantly. Because this image is intended for container-based workloads, there may not be many issues for such a migration. If your workload requires Ubuntu as the VM OS, then please continue reading below.For Ubuntu-based images, you should create a custom image that is published to a Shared Image Gallery with the following recommendations. It is recommended to install the Mellanox OFED distribution that is compatible with your OS for the latest drivers and to enable the most functionality out of the IB/RDMA device. The following table shows the compatibility matrix:
Please follow this guide to apply the appropriate steps in addition to any MPI runtimes your workload requires.
Other Notes
Please stay tuned to your Azure communications for further notifications regarding upcoming maintenance or any changes. You can also refresh this issue as it will be updated with the latest schedules or changes to guidance.
More Information
The text was updated successfully, but these errors were encountered: