With a simple AzureML extension deployment, an AKS cluster or any Azure Arc enabled Kubernetes cluster, you can instantly onboard data science professionals to submit ML workload to the Kubernetes clusters by using existing Azure ML tools and service capabilities. You can deploy AzureML extension and enables Kubernetes cluster for following machine learning needs:
- Deploy AzureML extension for model training and batch inference
- Deploy AzureML extension for real-time inferencing only
- Deploy AzureML extension for both model training and inferencing
For a quick try and basic test scenario, please follow installation instructions in Happy path.
Upon AzureML extension deployment completes, it will create following resources in Azure cloud:
Resource name | Resource type | Description |
---|---|---|
Azure Service Bus | Azure resource | Used to sync nodes and cluster resource information to Azure Machine Learning services regularly. |
Azure Relay | Azure resource | Route traffic between Azure Machine Learning services and the Kubernetes cluster. |
Upon AzureML extension deployment completes, it will create following resources in Kubernetes cluster, depending on each AzureML extension deployment scenario:
Resource name | Resource type | Training | Inference | Training and Inference | Description | Communication with cloud service |
---|---|---|---|---|---|---|
relayserver | Kubernetes deployment | ✓ | ✓ | ✓ | The entry component to receive and sync the message with cloud. | Receive the request of job creation, model deployment from cloud service; sync the job status with cloud service. |
gateway | Kubernetes deployment | ✓ | ✓ | ✓ | The gateway to communicate and send data back and forth. | Send nodes and cluster resource information to cloud services. |
aml-operator | Kubernetes deployment | ✓ | N/A | ✓ | Manage the lifecycle of training jobs. | Token exchange with cloud token service for authentication and authorization of Azure Container Registry used by training job. |
metrics-controller-manager | Kubernetes deployment | ✓ | ✓ | ✓ | Manage the configuration for Prometheus | N/A |
{EXTENSION-NAME}-kube-state-metrics | Kubernetes deployment | ✓ | ✓ | ✓ | Export the cluster-related metrics to Prometheus. | N/A |
{EXTENSION-NAME}-prometheus-operator | Kubernetes deployment | ✓ | ✓ | ✓ | Provide Kubernetes native deployment and management of Prometheus and related monitoring components. | N/A |
amlarc-identity-controller | Kubernetes deployment | N/A | ✓ | ✓ | Request and renew Azure Blob/Azure Container Registry token through managed identity. | Token exchange with cloud token service for authentication and authorization of Azure Contianer Registry and Azure Blob used by inference/model deployment. |
amlarc-identity-proxy | Kubernetes deployment | N/A | ✓ | ✓ | Request and renew Azure Blob/Azure Container Registry token through managed identity. | Token exchange with cloud token service for authentication and authorization of Azure Contianer Registry and Azure Blob used by inference/model deployment. |
azureml-fe | Kubernetes deployment | N/A | ✓ | ✓ | The front-end component that routes incoming inference requests to deployed services. | azureml-fe service logs are sent to Azure Blob. |
inference-operator-controller-manager | Kubernetes deployment | N/A | ✓ | ✓ | Manage the lifecycle of inference endpoints. | N/A |
cluster-status-reporter | Kubernetes deployment | ✓ | ✓ | ✓ | Gather the cluster information, like cpu/gpu/memory usage, cluster healthness. | N/A |
nfd-master | Kubernetes deployment | ✓ | N/A | ✓ | Node Feature Discovery is a Kubernetes add-on. | N/A |
nfd-worker | Kubernetes daemonset | ✓ | N/A | ✓ | Node Feature Discovery is a Kubernetes add-on. | N/A |
csi-blob-controller | Kubernetes deployment | ✓ | N/A | ✓ | Azure Blob Storage Container Storage Interface(CSI) driver. | N/A |
csi-blob-node | Kubernetes daemonset | ✓ | N/A | ✓ | Azure Blob Storage Container Storage Interface(CSI) driver. | N/A |
fluent-bit | Kubernetes daemonset | ✓ | ✓ | ✓ | Gather the components' system log. | Upload the components' system log to cloud. |
k8s-host-device-plugin-daemonset | Kubernetes daemonset | ✓ | ✓ | ✓ | Expose fuse to pods on each node. | N/A |
prometheus-prom-prometheus | Kubernetes statefulset | ✓ | ✓ | ✓ | Gather and send job metrics to cloud. | Send job metrics like cpu/gpu/memory uitilization to cloud. |
volcano-admission | Kubernetes deployment | ✓ | N/A | ✓ | Volcano admission webhook. | N/A |
volcano-controllers | Kubernetes deployment | ✓ | N/A | ✓ | Manage the lifecycle of Azure Machine Learning training job pods. | N/A |
volcano-scheduler | Kubernetes deployment | ✓ | N/A | ✓ | Used to do in cluster job scheduling. | N/A |
alertmanager | Kubernetes statefulset | ✓ | N/A | ✓ | Handle alerts sent by client applications such as the Prometheus server. | N/A |
Important:
- Azure ServiceBus and Azure Relay resources are under the same resource group as the Arc cluster resource. These resources are used to communicate with the Kubernetes cluster and modifying them will break attached compute targets.
- By default, the deployed kubernetes deployment resourses are randomly deployed to 1 or more nodes of the cluster, and daemonset resource are deployed to ALL nodes. If you want to restrict the extension deployment to specific nodes, please use
nodeSelector
configuration setting described as below.
Notes:
- {EXTENSION-NAME}: This is the extension name specified with
az k8s-extension create --name
CLI command.
Use k8s-extension create
CLI command to deploy AzureML extension, review list of required and optional parameters for k8s-extension create
CLI command here. For AzureML extension deployment configurations, use --config
or --config-protected
to specify list of key=value
pairs. Following is the list of configuration settings available to be used for different AzureML extension deployment scenario ns.
Configuration Setting Key Name | Description | Training | Inference | Training and Inference |
---|---|---|---|---|
enableTraining |
True or False , default False . Must be set to True for AzureML extension deployment with Machine Learning model training support. |
✓ | N/A | ✓ |
enableInference |
True or False , default False . Must be set to True for AzureML extension deployment with Machine Learning inference support. |
N/A | ✓ | ✓ |
allowInsecureConnections |
True or False , default False. This must be set to True for AzureML extension deployment with HTTP endpoints support for inference, when sslCertPemFile and sslKeyPemFile are not provided. |
N/A | Optional | Optional |
inferenceRouterServiceType |
loadBalancer or nodePort . Must be set for enableInference=true . |
N/A | ✓ | ✓ |
internalLoadBalancerProvider |
This config is only applicable for Azure Kubernetes Service(AKS) cluster now. Must be set to azure to allow the inference router use internal load balancer. |
N/A | Optional | Optional |
sslSecret |
The Kubernetes secret name under azureml namespace to store cert.pem (PEM-encoded SSL cert) and key.pem (PEM-encoded SSL key), required for AzureML extension deployment with HTTPS endpoint support for inference, when allowInsecureConnections is set to False. Use this config or give static cert and key file path in configuration protected settings. See sample secret yaml file. |
N/A | Optional | Optional |
sslCname |
A SSL CName to use if enabling SSL validation on the cluster. | N/A | N/A | required when using HTTPS endpoint |
inferenceLoadBalancerHA |
True or False , default True . By default, AzureML extension will deploy 3 ingress controller replicas for high availability, which requires at least 3 workers in a cluster. Set this to False if you have fewer than 3 workers and want to deploy AzureML extension for development and testing only, in this case it will deploy one ingress controller replica only. |
N/A | Optional | Optional |
openshift |
True or False , default False . Set to True if you deploy AzureML extension on ARO or OCP cluster. The deployment process will automatically compile a policy package and load policy package on each node so AzureML services operation can function properly. |
Optional | Optional | Optional |
nodeSelector |
Set the node selector so the extension components and the training/inference workloads will only be deployed to the nodes with all specified selectors. Usage: nodeSelector.key=value , support multiple selectors. Example: nodeSelector.node-purpose=worker nodeSelector.node-region=eastus |
Optional | Optional | Optional |
installNvidiaDevicePlugin |
True or False , default False . Nvidia Device Plugin is required for ML workloads on Nvidia GPU hardware. By default, AzureML extension deployment will not install Nvidia Device Plugin regardless Kubernetes cluster has GPU hardware or not. User can specify this configuration setting to True , so the extension will install Nvidia Device Plugin, but make sure to have Prerequesities ready beforehand. |
Optional | Optional | Optional |
blobCsiDriverEnabled |
True or False , default True . Blob CSI driver is required for ML workloads. User can specify this configuration setting to False if it was installed already. |
Optional | Optional | Optional |
reuseExistingPromOp |
True or False , default False . AzureML extension needs prometheus operator to manage prometheus. Set to True to reuse existing prometheus operator. Compatible kube-prometheus-stack helm chart versions are from 9.3.4 to 30.0.1. |
Optional | Optional | Optional |
volcanoScheduler.enable |
True or False , default True . AzureML extension needs volcano scheduler to schedule the job. Set to False to reuse existing volcano scheduler. Supported volcano scheduler versions are 1.4, 1.5. |
Optional | N/A | Optional |
logAnalyticsWS |
True or False , default False . AzureML extension integrates with Azure LogAnalytics Workspace to provide log viewing and analysis capability through LogAalytics Workspace. This setting must be explicitly set to True if customer wants to use this capability. LogAnalytics Workspace cost may apply. |
N/A | Optional | Optional |
Configuration Protected Setting Key Name | Description | Training | Inference | Training and Inference |
---|---|---|---|---|
sslCertPemFile , sslKeyPemFile |
Path to SSL certificate and key file (PEM-encoded), required for AzureML extension deployment with HTTPS endpoint support for inference, when allowInsecureConnections is set to False. |
N/A | Optional | Optional |
-
For AzureML extension deployment on ARO or OCP cluster, grant privileged access to AzureML service accounts, run
oc edit scc privileged
command, and add following service accounts under "users:":system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
system:serviceaccount:azureml:{EXTENSION-NAME}-kube-state-metrics
system:serviceaccount:azureml:cluster-status-reporter
system:serviceaccount:azureml:prom-admission
system:serviceaccount:azureml:default
system:serviceaccount:azureml:prom-operator
system:serviceaccount:azureml:csi-blob-node-sa
system:serviceaccount:azureml:csi-blob-controller-sa
system:serviceaccount:azureml:load-amlarc-selinux-policy-sa
system:serviceaccount:azureml:azureml-fe
system:serviceaccount:azureml:prom-prometheus
system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default
system:serviceaccount:azureml:azureml-ingress-nginx
system:serviceaccount:azureml:azureml-ingress-nginx-admission
Notes
- {EXTENSION-NAME}: is the extension name specified with
az k8s-extension create --name
CLI command. - {KUBERNETES-COMPUTE-NAMESPACE}: is the namespace of kubernetes compute specified with
az ml compute attach --namespace
CLI command. Skip configuring 'system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default' if no namespace specified withaz ml compute attach
CLI command.
- If you use Azure Kubernetes Services(AKS) cluster and it's not connected to Azure Arc, please register below feature for your subscription.
az feature register --namespace Microsoft.ContainerService -n AKS-ExtensionManager
Following CLI command will deploy AzureML extension and enable Kubernetes cluster for model training and batch inference workload:
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
Notes:
- If you deploy AzureML extension on AKS directly without Azure Arc connection, please change
--cluster-type
parameter value tomanagedClusters
Depending on your network setup, Kubernetes distribution variant, and where your Kubernetes cluster is hosted (in cloud or on-premises), you can choose one of following options to deploy AzureML extension.
Notes:
- If you deploy AzureML extension on AKS directly without Azure Arc connection, please change
--cluster-type
parameter value tomanagedClusters
-
Public HTTPS endpoints support with public load balancer
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True inferenceRouterServiceType=loadBalancer sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-file> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
-
Private HTTPS endpoints support with internal load balancer
az k8s-extension create --name amlarc-compute --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True inferenceRouterServiceType=loadBalancer internalLoadBalancerProvider=azure sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-file> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
-
HTTPS endpoints support with NodePort
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group> --scope cluster --config enableInference=True inferenceRouterServiceType=nodePort sslCname=<cname> --config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-file> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --auto-upgrade-minor-version False
Note:
- Using a NodePort gives you the freedom to set up your own load-balancing solution, to configure environments that are not fully supported by Kubernetes, or even to expose one or more nodes' IPs directly.
- When you deploy with NodePort service, the scoring url (or swagger url) will be responded with one of Node IP (for example,
https://<NodeIP>:<NodePort>/<scoring_path>
) and remain unchanged even if the Node is unavailable. But you can replace it with any other Node IP.
-
Private HTTP endpoints support with internal load balancer
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True inferenceRouterServiceType=loadBalancer internalLoadBalancerProvider=azure allowInsecureConnections=True --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
-
HTTP endpoints support with NodePort
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True inferenceRouterServiceType=nodePort allowInsecureConnections=Ture --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
-
Public HTTP endpoints support with public load balancer - the least secure way, NOT recommended
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableInference=True inferenceRouterServiceType=loadBalancer allowInsecureConnections=True --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
To enable Kubernetes cluster for all kinds of ML workload, choose one of above inference deployment options and append config settings for training and batch inference. Following CLI command will enable cluster with real-time inference HTTPS endpoints support, training, and batch inference workload:
az k8s-extension create --name arcml-extension --extension-type Microsoft.AzureML.Kubernetes --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --config enableTraining=True enableInference=True inferenceRouterServiceType=loadBalancer--config-protected sslCertPemFile=<path-to-the-SSL-cert-PEM-file> sslKeyPemFile=<path-to-the-SSL-key-PEM-file> --resource-group <resource-group> --scope cluster --auto-upgrade-minor-version False
Notes:
- If you deploy AzureML extension on AKS directly without Azure Arc connection, please change
--cluster-type
parameter value tomanagedClusters
-
Run the following CLI command to check AzureML extension details:
az k8s-extension show --name arcml-extension --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group>
-
In the response, look for "name": "arcml-extension" and "provisioningState": "Succeeded". Note it might show "provisioningState": "Pending" for the first few minutes.
-
If the provisioningState shows Succeeded, run the following command on your machine with the kubeconfig file pointed to your cluster to check that all pods under "azureml" namespace are in 'Running' state:
kubectl get pods -n azureml
Use k8s-extension update
CLI command to update the mutable properties of AzureML extension, review list of required and optional parameters for k8s-extension update
CLI command here.
- Azure Arc supports update of
--auto-upgrade-minor-version
,--version
,--config
,--config-protected
. - For configurationSettings, only the settings that require update need to be provided. If the user provides all settings, they would be merged/overwritten with the provided values.
- For ConfigurationProtectedSettings, ALL settings should be provided. If some settings are omitted, those settings would be considered obsolete and deleted.
Important
Don't update following configs if you have active training workloads or real-time inference endpoints. Otherwise, the training jobs will be impacted and endpoints unavailable.
enableTraining
fromTrue
toFalse
installNvidiaDevicePlugin
fromTrue
toFalse
when using GPU.nodeSelector
. The update operation can't remove existing nodeSelectors. It can only update existing ones or add new ones.
Don't update following configs if you have active real-time inference endpoints, otherwise, the endpoints will be unavailable.
allowInsecureConnections
inferenceRouterServiceType
internalLoadBalancerProvider
- To update
logAnalyticsWS
fromTrue
toFalse
, provide all originalconfigurationProtectedSettings
. Otherwise, those settings are considered obsolete and deleted.
Use k8s-extension delete
CLI command to delete the existed AzureMl extension.
It takes around 10 minutes to delete all components deployed to the Kubernetes cluster. Run kubectl get pods -n azureml
to check if all components were deleted.