[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

jancschaefer · 2020-10-23T11:49:53Z

Hi there,

Thank you for opening an issue. Please note that we try to keep the Databricks Provider issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

0.13.3
0.13.5

Affected Resource(s)

Please list the resources as a list, for example:

databricks_cluster

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Environment variable names

ARM_ACCESS_KEY
ARM_CLIENT_ID
ARM_CLIENT_SECRET
ARM_SUBSCRIPTION_ID
ARM_TENANT_ID

Terraform Configuration Files

resource "databricks_cluster" "this" {
  cluster_name            = "name"
  spark_version           = var.spark_version
  node_type_id            = var.node_type_id 
  autotermination_minutes = var.autotermination_minutes
  autoscale {
    min_workers = var.min_workers
    max_workers = var.max_workers
  }
}

Panic Output

Error: 1021-125542-novas905 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource 'ecb71b95f3004464b5f9b6d7ebde996d_OsDisk_1_1ba1df2e8bbf48709f3e3e.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details



Error: 1021-125542-using919 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource '27b239d8d75b43e985391ed241464c32_OsDisk_1_faae7e36850242c7a25f32.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details



Error: 1021-125542-top912 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource '7cecad0812ef4a95ab374aaa0a950351_disk1_f7db33e0703144dea9dca0c63.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details

Expected Behavior

Either:

Databricks cluster created and added to state (despite it not being able to start)
Rollback and cluster not being visible

Actual Behavior

Databricks cluster created and visible in the workspace, but not added to state because clusters failed to start up, due to an issue with an Azure Policy. The clusters were not added to the state, so once the policy was fixed and the terraform apply succeeded, we had duplicate clusters with the same name.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply

Important Factoids

We are automatically deploying databricks workspaces, clusters and notebooks
In one of our subscriptions, an azure policy caused that databricks was unable to provision disks ("Microsoft.Compute/disks")
When trying to start the cluster, databricks failed due to that

Azure policy:

{
    "if": {
        "not": {
            "field": "type",
            "in": [
                "Microsoft.KeyVault/vaults",
                "Microsoft.Databricks/workspaces",
                "Microsoft.Databricks/storageAccounts",
                "Microsoft.DataFactory/factories",
                "Microsoft.Compute/*",
                "Microsoft.Compute",
                "Microsoft.Compute/disks",
                "Microsoft.AlertsManagement/actionRules",
                "Microsoft.AlertsManagement/register/action"
            ]
        }
    },
    "then": {
        "effect": "deny"
    }
}

The text was updated successfully, but these errors were encountered:

nfx · 2020-10-23T12:01:11Z

@jancschaefer what do you think should be the expected behavior here?

jancschaefer · 2020-10-23T12:06:22Z

I would either expect the clusters not to show up in databricks, if the deployment fails or that the clusters are nontheless added to the state and the provider tries to start up the same existing cluster the next time.

This way, if terraform fails twice and then succeeds, I end up with three clusters, although only one was defined in terraform.

nfx · 2020-10-23T12:28:56Z

okay, then cluster should be deleted, if it cannot be started.

Change has to be added to ClustersAPI#Create to remove a cluster that was just created if it was not able to start and return previous error message for context.

This might qualify as behavior change and might be put on hold to 0.3.

stikkireddy · 2020-10-29T11:31:47Z

Hey adding my 2 cents here this is occurring because in the cluster create is waiting for the cluster to be in a running state before we register the ID. Unfortunately terraform will not be able to taint the resource if it is not aware of the id.

Typical behavior of terraform is:

Create object in the remote api
Set id
If the create fails or read fails in post set, the resource becomes tainted (delete and recreate on the next apply)

Another alternative would be to register the id right after the /create call is made before we wait for it to be in a running state.

* Pre-release fixing * Added NAT to BYOVPC terraform module * added instance profile locks * Added sync block for instance profiles integration tests * Fix #383 Cleaning up clusters that fail to start * Added log delivery use case docs * Fix #382 - ignore changes to deployment_name * Fix test and lints * Fix #382 by ignoring incoming prefix for deployment_name for databricks_mws_workspaces * Improve documentation to fix #368 * fix linting issues Co-authored-by: Serge Smertin <serge.smertin@databricks.com>

nfx added the Small Size label Nov 3, 2020

nfx self-assigned this Nov 6, 2020

nfx linked a pull request Nov 6, 2020 that will close this issue

Prepare 0.2.8 #400

Merged

nfx added this to the v0.2.8 milestone Nov 6, 2020

nfx added a commit that referenced this issue Nov 6, 2020

Fix #383 Cleaning up clusters that fail to start

3bc3667

nfx closed this as completed in #400 Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

jancschaefer commented Oct 23, 2020 •

edited

Loading

nfx commented Oct 23, 2020

jancschaefer commented Oct 23, 2020

nfx commented Oct 23, 2020

stikkireddy commented Oct 29, 2020

[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

Comments

jancschaefer commented Oct 23, 2020 • edited Loading

Terraform Version

Affected Resource(s)

Environment variable names

Terraform Configuration Files

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

Azure policy:

nfx commented Oct 23, 2020

jancschaefer commented Oct 23, 2020

nfx commented Oct 23, 2020

stikkireddy commented Oct 29, 2020

jancschaefer commented Oct 23, 2020 •

edited

Loading