This document describes how to set up and manage the Oasis platform in Azure.
- Oasis on Azure
- 1 Requirements
- 2 Setup environment
- 3 Use the platform
- 6 Secure the platform
- 8 Questions about design
Before you begin, make sure you check the requirements.
Basic requirements:
- Azure subscription
- Azure account with enough privileges to create resources and assign roles
For DevOps pipeline:
- Azure DevOps (a free account works well to begin with)
In case you would like to deploy from your local system and/or debug Azure/Kubernetes:
An Azure DevOps pipeline is based on a repository and the first step is to clone/fork this repository and put it on a place Azure DevOps can access it such as GitHub or bitbucket (check Azure DevOps for more alternatives). This gives us a repository with everything we need to deploy Oasis and store our configurations.
git clone https://github.com/OasisLMF/OasisAzureDeployment.git
# Set origin and push to your repository
Then we need to set a few requires settings for our Azure environment. It is highly recommended going through all configuration files (especially to secure it) but to deploy the environment you must set the following as a minimum:
File | Setting | Description |
---|---|---|
settings/settings.sh | DNS_LABEL_NAME | A unique name used to build the DNS name for this plaform |
settings/settings.sh | LETSENCRYPT_EMAIL | Email used to generate the Let's Encrypt certificate. Will receive notifications if the certificate fail to be renewed. |
settings/azure/parameters.json | allowedCidrRanges | The IP address ranges that can access the platform. |
settings/azure/parameters.json | oasisServerAdminPassword | The database administrator password |
WARNING: Please note that if you want to change the workerNodesVm setting this is not supported at the moment through the automatic deployment, instead you need to manually change it in the Azure Portal. |
---|
More details about each setting is found below or in the files.
There are 3 types of settings files to configuration the environment even more. Default settings are usually set to resources of low/standard levels and can be updated to improve performance.
The file settings/settings.sh
contains variables used by the deploy.sh
script. Open the file for more details about
each variable and its values.
The file settings/azure/parameters.json
contains Azure specific parameters like CIDR ranges, node types etc.
A short summary of the most interesting ones:
Name | Description |
---|---|
allowedCidrRanges | Whitelisted CIDR ranges - only these will be able to connect over HTTPS. Make sure to include the range/IP from where you will access the azure domain. The adress provided by whatsmyip might work, but not always depending on your network access to Azure. |
oasisServerAdminPassword | Password for database administrator account |
platformNodeVm | Type of Virtual Machine to use for the AKS platform node (run everything except for worker pods) |
workerNodesVm | Type of Virtual Machine to use for AKS worker nodes. Warning: Please note this can't be changed after deployment due to lack of support in Azure deployments. You can still change it manually in the Azure Portal. |
workerNodesMaxCount | The limit of number of worker nodes to scale up. Will never exceed this number. |
oasisStorageAccountSKU | Storage account disk class. Use premium class to increase file share I/O speed. |
clusterName | AKS cluster name |
tags | Tags to attach to all resources created |
openHttpForAll | When set to true this will open up HTTP access without any filtering and redirect all requests to HTTPS. There are 2 reasons to have this open:1. It is required for letsencrypt to create a valid TLS certificate. 2. Help some browsers to find the HTTPS service. |
availabilityZones | List of availability zones to use in this Azure location. |
Helm settings are used to configure our Helm charts, more specifically settings related to kubernetes resources and deployed models. There is no need for changing them to deploy the platform, which will bring up a platform with a PiWind model. But you need to change them later to change default passwords and add your models.
You can set chart values in settings/helm
.
Pipelines can't be automatically setup. We need to set up a project, a pipeline and two service connections to be able to reach your repository and Azure. We also need to create our resource group and add permissions to it to let the pipeline change it, but this only takes a few minutes.
Once you have logged into Azure Devops:
- Create a project:
- Click
New project
- Give the project a name, such as Oasis platform, and click
Create
. - Select your repository location and
Next
. If this is the first time you add this location it will start a guide to add it and authenticate against it. - Select your repository and click
Next
. - On the
Review your pipeline YAML
page justSave
from the dropdown menu next to theRun
button. We do not want to run it yet.
- Click
- Set up Azure Service Connection:
- Click
Project settings
in bottom left corner. - Click
Service connections
. - Click
New service connection
. - Select
Azure Resource Manager
and clickNext
. - Select
Service principal (automatic)
and clickNext
. - Select your subscription, give the service connection name
Azure Connection
, checkGrant access permission to all pipelines
and clickSave
. - Repeat the same process to create a GitHub connection and name it "OasisLMF".
- Click
- Set up the pipeline service principal access:
- Open the default pipeline by clicking
Pipelines
in menu to the left and then the first pipeline in the list. - Click
Run pipeline
. - Select branch in case you are using another one than master.
- Select
resource-groups
from dropdown asDeploy
. - Click
Run
. This will create your resource groups set in yoursettings.sh
file. In case you get a permission denied error, click thePermission needed
link andPermit
. The pipeline will continue after this. - Wait for it to finish creating our resource group. You can check the progress and log by clicking the
Job
.
- Open the default pipeline by clicking
- Give pipeline ownership of the resource group:
- Once again go to
Project settings
,Service connections
and clickAzure Connection
. - Click the link
Manage Service Principal
which opens a new tab to the Azure portal. Copy theDisplay name
. - In the Azure Portal go to
Resource groups
- either by selecting it from the meny or search for it in the top. - Open the resource group name you defined in your configuration and just created.
- Click
Access control (IAM)
and thenAdd role assignment
inGrant access to this resource
. - Select
Owner
and clickNext
. - Click
+ Select members
. - Paste the service principal name you copied in former step into the search field.
- Select the name in the list and click
Select
. - Click
Next
and thenReview + assign
.
- Once again go to
We do now have our resource group created and a pipeline ready!
Let's deploy the infrastructure and the platform:
- In Azure DevOps click
Pipelines
in the menu to the left to bring up your pipelines. - The list should at this point only contain one, click that one.
- Click
Run pipeline
in the top right. - Select
base
asDeploy
and clickRun
.
This will deploy:
- Azure resources such as networks, Kubernetes cluster (AKS), key vault, databases etc.
- Cert-manager to automatically retrieve a valid TLS certificate from letsencrypt.
- Build and push oasis server and worker images from OasisPlatform/platform-2.0.
- Oasis Helm charts for the platform.
This initialization deployment may take up to 50 minutes to deploy (redis takes about 25-40 minutes for some reason).
You can follow the progress by opening up the job and view the Deploy
task output. The deployment of Azure resources
can be monitored in more detail in the Azure Portal under Deployments
in your resource group.
At the end it prints a summary of resource names and URLs. It might take a few minutes more before those URLs are accessible due to the time kubernetes needs to initialize the Oasis platform.
The front will be available when all pods are Running
. You can check this from your local machine (if you have Azure
cli and kubernetes installed):
# Update kubectl context
./deploy.sh update-kubectl
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
broker-5754b57b78-kh8ql 1/1 Running 0 4m7s
celery-beat-696ffc58f5-4c5zr 1/1 Running 0 4m7s
celery-db-64fdbc8447-p9279 1/1 Running 0 4m7s
channel-layer-66545898f7-ccqtq 1/1 Running 0 4m7s
keycloak-6f84587755-pclr2 1/1 Running 0 4m7s
keycloak-db-5cdd89bb8-5xj56 1/1 Running 0 4m7s
oasis-server-5956dbc659-spcdd 1/1 Running 0 4m7s
oasis-task-controller-548c74597b-jxlnd 1/1 Running 0 4m7s
oasis-ui-749447fb84-4sgmt 1/1 Running 0 4m7s
oasis-worker-controller-6675bc8c6f-zwhjq 1/1 Running 0 4m7s
oasis-worker-monitor-57ff989d5-j9k6b 1/1 Running 0 4m6s
platform-ingress-nginx-controller-96d6f4c68-84jh6 1/1 Running 0 4m7s
server-db-6c6f464787-8gtzl 1/1 Running 0 4m7s
Try to open the front URL from the summary
and you should see the Oasis Web Front.
The Oasis platform is deployed to Azure!
Can't access the front? Check out the Troubleshooting section for common errors.
At this point you can either deploy the PiWind model or your own to test the platform.
The deployment option piwind
lets you install the PiWind model, upload the data for it to run and have some analyses
created ready to be run.
- Click
Run pipeline
. - Select
piwind
as deployment. - Click
Run
.
You are now ready to run your first analysis!
Three steps are requires to install your own model:
- Upload model data to Azure (Azure Files).
- Upload docker image to Azure (ACR).
- Update
settings/helm/models-values.yaml
.
The platform has a specific file share for model data and requires a strict structure to automatically be found by worker pods:
<supplier>/<name>/<version>/model_settings.json
The file model_settings.json
is the only required file, but you do most likely want to put key/model data here as
well.
All files in this location will be mounted in the worker container on the path defined later in the helm chart values.
You can either upload your files directly in the Azure Portal, using the script in this repository or use the Azure CLI (if your IP is in the CIDR range).
The ./scripts/upload_model_data.sh
script can be used to upload model data to the file share in Azure. It will work
even if your IP is not whitelisted by upload the files through the kubernetes cluster.
./scripts/upload_model_data.sh -C ../OasisPiWind/ OasisLMF/PiWind/1 meta-data/model_settings.json oasislmf.json model_data/ keys_data/
It will overwrite files but not remove any files.
The Azure CLI can be used to upload files to the file share but requires your IP to be whitelisted in the CIDR range.
# Set name of key vault
You can lookup the name of the keyvault on the Azure portal, or alternatively use az cli as below:
KEY_VAULT_NAME=$(az keyvault list --resource-group <YOUR_RESOURCE_GROUP_NAME> | jq -r '.[0].name')
# Get name and key for storage account
OASIS_FS_ACCOUNT_NAME="$(az keyvault secret show --vault-name "$KEY_VAULT_NAME" --name oasisfs-name --query "value" -o tsv)"
OASIS_FS_ACCOUNT_KEY="$(az keyvault secret show --vault-name "$KEY_VAULT_NAME" --name oasisfs-key --query "value" -o tsv)"
# Create your directories for your model
az storage directory create --account-name "$OASIS_FS_ACCOUNT_NAME" --account-key "$OASIS_FS_ACCOUNT_KEY" --share-name models -n "OasisLMF"
az storage directory create --account-name "$OASIS_FS_ACCOUNT_NAME" --account-key "$OASIS_FS_ACCOUNT_KEY" --share-name models -n "OasisLMF/PiWind"
az storage directory create --account-name "$OASIS_FS_ACCOUNT_NAME" --account-key "$OASIS_FS_ACCOUNT_KEY" --share-name models -n "OasisLMF/PiWind/1"
# Upload your model_settings.json file
az storage file upload --account-name "$OASIS_FS_ACCOUNT_NAME" --account-key "$OASIS_FS_ACCOUNT_KEY" --share-name models \
--source model_settings.json --path "OasisLMF/PiWind/1/model_settings.json"
Checkout az storage file upload-batch
for uploading directories.
Build your image and upload it directly to ACR:
# Set your ACR
ACR=$(./deploy.sh get-acr)
# Login
az acr login --name $ACR
# Tag your images - the container registry path is all up to you to decided
docker tag myimage:v1 ${ACR}/workers/myimage:v1"
# And push
docker push ${ACR}/workers/myimage:v1"
The last step is to add the model to settings/helm/models-values.yaml
which keeps the list of all models we would like
to have deployed to our environment.
Modify the existing PiWind model or add a new one to the workers:
section. When ready, commit and push your changes to
your repository.
Open your web browser and go to Azure DevOps and run your pipeline with models
as Deploy
. This will register your
model with oasis.
The pipeline supports a set of deployment types. Some of them like resource-groups
and base
probably only needs to
be run once, but other should be run whenever a settings or model needs to be updated.
Deploy | Description |
---|---|
resource-groups | Creates a resource group to store all Oasis resources within |
base | Installs the infrastructure (Azure resources), cert-manager, oasis, grafana etc. and builds and push images to ACR. |
piwind | Uploads model data, registers the model and creates test analyses ready to run. |
models | Install/uninstalls models based on models defined in settings/helm/models-values.yaml |
azure | Updates Azure resources and settings. |
oasis | Updates Oasis chart settings. |
images | Builds latest images from OasisPlatform/platform-2.0 repository and uploads them to the ACR. |
Let's try to run something - the first analysis:
# List all analyses
./deploy.sh api ls
# Run one of them
./deploy.sh api run 1
deploy.sh api
is just a wrapper for the OasisPlatform/kubernetes/scripts/api/api.sh
script. The run should finish
with a RUN_COMPLETED
within a few minutes.
Prometheus, alert manager and Grafana are still available as monitoring tools, but Azure do also give us some additional ones based on metrics from the cluster and storage for log files.
You can view an overview of load, cpu usage, memory usage etc by opening Monitoring Insights
on
the Kubernetes service
:
- Open Azure Portal
- Search / go to
Kubernetes services
- Open your cluster
- Select
Insights
in the menu underMonitoring
The first view gives you a cluster overview, but you can narrow it down by Nodes, Pods and Containers by selecting another tab.
Azure supports two ways to investigate logs:
- Tail live logs.
- Query logs.
Live logs gives you no history but only new entries logged by the container from the time you start to view the live logs:
- In
Kubernetes service
for your cluster selectWorkloads
from the menu nderKubernetes resources
. - Find your pod either trom the
Deployments
orPods
tab. - On the pod page select
Live logs
from the menu.
You can view historical logs by following the link Looking for historical logs? View in Log Analytics
from
the Live logs
page. This will open up the log query tool and show you recent log entries for this pod.
You can also open the log query tool by selecting Logs
from the menu under Monitoring
on your Kubernetes service
cluster page. This will bring up the log query tool with some examples to chose from as good starting points before
building your owns. As any cloud service you won't be able to view historical logs in plain text mode. All logs lines
are split and displayed separately. It is bad for reading a section of the log, but great for finding and analyse logs.
The cluster will try to automatically recover from failures, but alerts can be used to inform you when this is not possible. Many alert conditions are available such as CPU, memory cluster health etc.
Alerts can be created from the Alert
page on your Kubernetes Service
page.
The number of workers started is controlled by the oasis-worker-controller
and is based on 3 configurations:
The models chunking_configuration
and scaling_configuration
configuration controls the number of chunks to split the
work into and how the oasis-worker-controller
should scale the number of workers for each model. A few alternatives
are supported from creating a fixed number of workers as soon as the model is needed to dynamically create workers
depending on the number of chunks available.
These settings can either be set manually through the API or automatically by creating json files next to
the model_settings.json
file. More details can be found in the OasisPlatform
repository.
This configuration is found in the settings/helm/models-values.yaml
file and controls the way Kubernetes schedules
worker pods to nodes. By setting the nodeAffinity
and podAntiAffinity
you control on which nodes to schedule worker
pods and if a node supports multiple or single workers. When a new worker pod is created and no node is available Azure
will create a new one.
The default settings is to only allow workers to be scheduled to nodes with the oasislmf/node-type=worker
label and
limit the number of workers to one per node.
More details about pod scheduling can be found in the Kubernetes documentation.
This configuration controls the hard limit of number of worker nodes allowed to be created. Even if a worker requests
100 nodes it will never create more nodes than specified by the number in workerNodesMaxCount
in settings/azure/parameters.json
.
The default node limit is 2
.
List resource groups and their status:
az group list --query '[].{name:name, state:properties.provisioningState}'
If you just want to delete one resource group:
az group delete --no-wait -yn <resource group>
# One additional is also created to group AKS resources
az group delete --no-wait -yn <resource group>-aks
This will do a "soft-delete" on the key vault, but to remove it permanently:
az keyvault list-deleted --query '[].name' -o tsv
az keyvault purge --name <key-vault-name>
az group list --tag oasis-enterprise=True --query [].name -o tsv | xargs -otl az group delete --no-wait -yn
az group list --query [].name -o tsv | xargs -otl az group delete --no-wait -yn
If you don't want to use a DevOps pipeline you can always use the deploy.sh script and deploy locally or use in another pipeline.
First make sure you are logged in with Azure CLI:
az login
Then run deploy.sh
script:
# Create the group to place our resources in
./deploy.sh resource-group
# Create the platform
./deploy.sh base
The argument base
will:
- Create necessary azure resources
- Build and push oasis/worker images
- Install cert-manager to automatically retrieve a valid certificate for TLS
- Install Oasis platform
- Write a summary of resource names and service urls
The summary lists the URL to each service as well as the kubectl command to update your cluster context to ACR.
Make sure all pods are started:
# Update your kubectl context to point to your new AKS cluster:
./deploy.sh update-kubectl
# List all pods and their status
kubectl get pods
Verify that our certificate is ready:
kubectl get certificate
If the certificate isn't ready within a few minutes please read the Troubleshooting to investigate the issue.
When the certificate is ready you should be able to access the front by pointing your web browser to the "Front" url listed in the deployment output.
You do now have a platform running in Azure, but without any models.
Models are installed in the same way as through the pipeline, but instead of running the pipeline you run deploy.sh
with either piwind
or model
.
To install PiWind with one portfolio and some analyses run:
./deploy.sh piwind
To install your own model read Manage your models and instead of running the pipeline
with models
as deploy, run the script instead:
./deploy.sh models
The platform will by default enforce TLS, encrypt data, generate database passwords etc. but there are still things to secure to make it production ready. The level depends on your organizations needs.
If passwords are set correctly the standard deployment will be well secured and also allow you to use and manage the platform easily through pipelines and scripts. But the bar can be raised if you want it to be even more secure.
One of the first thing to change are default passwords:
- Keycloak admin user - the master password for Keycloak. Change it in the Keycloak admin console (URL path /admin/master/console/)
- Oasis admin user - the default oasis user - change the password in Keycloak.
The default deployment includes Prometheus, Alert manager and Grafana. You can change the default password for Grafana but Prometheus and Alert manager does not have a login at all. But since Azure covers most functionality anyway we could install Prometheus, Alert manager and Grafana unless there isn't any specific feature or better graph we want from it.
How to uninstall:
-
Remove the monitoring deploy from the
deploy.sh
script:case "$deploy_type" in "base") $0 azure $0 db-init $0 images $0 cert-manager $0 oasis $0 monitoring <- Remove this line $0 summary ;; ... This will make sure it isn't installed.
-
If monitoring tools already are installed you can uninstall it by running
helm uninstall monitoring
The platform has a few points facing the internet:
- HTTP/HTTPS load balancer
- Key vault
- Storage account
- Container registry
Even if they are secured with encryption, IP filtering and Azure services there is always an elevated risk of having them facing the internet. The platform can be completely isolated from the internet depending on your needs but requires some changes beyond configuration in that case.
It is also worth noticing that if Azures key vault would be compromised you still would need to access the storage and databases from either the kubernetes network or from an IP accepted by the CIDR settings.
The load balancer has two internet facing service:
- HTTP - Open for all IPs. Does not provide any content or access to Oasis but is used to verify the domain and retrieve certificates from letsencrypt. Redirects all requests to HTTPS.
- HTTPS - Secured by valid certificate issued by letsencrypt and filters IP sources based on the
allowedCidrRanges
setting insettings/parameters.json
.
Depending on your needs and your organizations limits and policies you can either keep it as it is, close down HTTP and provide your own certificate or remove the internet completely and use private link and private dns to route your company network to the Azure network.
The cluster (oasis, celery, rabbitmq etc.) and databases runs isolated and is not accessible from internet. Key vault,
storage account and container registry is accessible from internet but requires credentials and encrypted connection.
The storage account is also IP filtered. The reason for not having a private link to these services is that
the deploy.sh
and Azure cli won't be able to access and set up the resources on deployment. This could be changes
however after the deployment but will disable things like generating passwords and deploy new oasis images. But that
might not be required later depending on how you would like to maintain your system.
Some changes are required to enable private endpoints:
- Call the
private_endpoint.bicep
module inkey_vault.bicep
,registry.bicep
andstorage_account.bicep
. public access to these services will isolate them from the internet. - Replace
get_or_generate_secret
function calls withget_secret
indeploy.sh
.
Make sure to verify your CIDR configuration and only accept sources you trust.
The pipeline in Azure DevOps has a time limit on 60 minutes and will be stopped if it takes longer than that.
##[error]The job running on agent Hosted Agent ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134
If this happens it is most likely due to either:
- The creation of Redis takes too long.
- Azure gets stuck on one of the deployments.
You will find more details if you to Deployments
in your resource group. Here you will find the list of deployments
and their statuses.
Even if the pipline fails due to the time limit the deployment may still finish successfully. If no deployment has failed, leave it for a while and see if it finishes successfully. If not, try to cancel it and run the pipeline again.
You can run the pipeline again once the deployment has finished or been cancelled. It will verify the Azure deployment and continue with the next step of setting up the environment.
Try to recreate the ingress service to request the domain name again:
kubectl delete service platform-ingress-nginx-controller
And then deploy oasis
.
Make sure the UI pod is running:
kubectl get pod -l app=oasis-ui
NAME READY STATUS RESTARTS AGE
oasis-ui-749447fb84-4sgmt 1/1 Running 0 5m
Also make sure to check your CIDR range configuration in settings/parameters.json
and verify your IP is set correctly.
Check the certificate status:
kubectl get certificate
The READY
column should be True
for oasis-ingress-tls
. If it still is in False
after a minute or so try to
delete it and retrieve a new certificate:
kubectl delete certificate oasis-ingress-tls
Wait another 30 seconds. If it still is False
read the troubleshooting guide
to investigate the request.
Go to the Azure Portal
-> Resource Groups
-> <your group>
-> Deployments
to look for details about the
deployment process.
Please note that redis deployment can take up to 40 minutes and this is expected by Microsoft.
I have encountered errors in the AKS deployment such as Gateway timeout
and Internal server error
. I suspect this is
some error on Azures side and not with the bicep script itself. Try change location if you can, that usually solves the
issue.
AKS will create its own resource group. More details can be found here and here.
The deploy.sh
script will create users with generated passwords, one for each database. Bicep/Azure does not support
this at deployment.
Celery is still using RabbitMQ as the broker running in the Kubernetes cluster. Celery has support (through kombu) for Azure Service Bus but not very well reviewed/tested and with known issues. Due to this and the low volume of data sent over celery RabbitMQ is kept for now.
More information can be found here ( the Celery project site is currently at this point down and has been for the last 2 weeks)
There is however RabbitMQ products available on Azure Marketplace that might be a good option, but as long as RabbitMQ runs fine in our cluster, I can't motivate the additional complexity and cost to use these.