Skip to content

Commit

Permalink
skip gpu driver installation (#91)
Browse files Browse the repository at this point in the history
  • Loading branch information
Johannesm299 authored Dec 19, 2023
1 parent 0bd27ef commit 5e4cb9f
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 4 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,13 @@ List with description of all mandatory and optional variables could be find in t
It is recommended to restrict the access to the Kubernetes API server using authorized IP address ranges by setting the variable `apiServerAuthorizedIpRanges`.
It is recommended to restrict the access to the Key Vault using authorized IP address ranges by setting the variable `keyVaultAuthorizedIpRanges`.

## GPU Usage

If you use AURELION with SIMPHERA then the AURELION Pods are executed in the GPU node pool. AURELION uses a specific OptiX Version and thus needs specific NVIDIA Drivers. NVIDIA provides the [gpu-operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html), a tool with which it is possible to use containerized drivers inside pods. This makes it possible to use the needed driver Versions independent of the default installation of the NVIDIA Drivers on the GPU node pool, which can only be not installed, selecting a version is not possible. [Further infomations](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster#use-nvidia-gpu-operator-with-aks)

### Scale Down Mode

If you use AURELION with SIMPHERA then the AURELION Pods are executed in the GPU node pool. Typically, you have autoscaling enabled for that pool so that VMs are scaled down if they are no longer needed. However, the AURELION container image is big and it takes time to download the image to the Kubernetes node. Depending on your location this can take more than 30 minutes. To shorten these times the _Scale Down Mode_ of the GPU node pool should be set to _Deallocate_. That means, that a GPU VM is not _deleted_ but only _deallocated_. So you no longer have to pay for the compute resources but only for the disk that will not be deleted when using this mode.
Typically, you have autoscaling enabled for the GPU node pool so that VMs are scaled down if they are no longer needed. However, the AURELION container image is big and it takes time to download the image to the Kubernetes node. Depending on your location this can take more than 30 minutes. To shorten these times the _Scale Down Mode_ of the GPU node pool should be set to _Deallocate_. That means, that a GPU VM is not _deleted_ but only _deallocated_. So you no longer have to pay for the compute resources but only for the disk that will not be deleted when using this mode.

You can enable and disable this mode using the variables `linuxExecutionNodeDeallocate` and `gpuNodeDeallocate`. That means, you can not only configure this for the GPU node pool but also for the Execution node pool. As a default _Deallocate_ is used for both node pools.

Expand Down
2 changes: 1 addition & 1 deletion modules/simphera_base/k8s.tf
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ resource "azurerm_kubernetes_cluster_node_pool" "gpu-execution-nodes" {
"purpose=gpu:NoSchedule"
]

tags = var.tags
tags = merge(var.tags, { SkipGPUDriverInstall = "true" })

lifecycle {
ignore_changes = [
Expand Down
4 changes: 2 additions & 2 deletions modules/simphera_base/modules/simphera_instance/postgresql.tf
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,14 @@ resource "azurerm_postgresql_flexible_server_database" "keycloak" {
name = "keycloak"
server_id = azurerm_postgresql_flexible_server.postgresql-flexible.id
charset = "UTF8"
collation = "en_US.UTF8"
collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_database" "simphera" {
name = "simphera"
server_id = azurerm_postgresql_flexible_server.postgresql-flexible.id
charset = "UTF8"
collation = "en_US.UTF8"
collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_configuration" "pgcrypto" {
Expand Down

0 comments on commit 5e4cb9f

Please sign in to comment.