Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] Clusters persist and not added to state after deployment fails due to provider issue #383

Closed
jancschaefer opened this issue Oct 23, 2020 · 4 comments · Fixed by #400
Assignees
Milestone

Comments

@jancschaefer
Copy link

jancschaefer commented Oct 23, 2020

Hi there,

Thank you for opening an issue. Please note that we try to keep the Databricks Provider issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

  • 0.13.3
  • 0.13.5

Affected Resource(s)

Please list the resources as a list, for example:

  • databricks_cluster

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Environment variable names

  • ARM_ACCESS_KEY
  • ARM_CLIENT_ID
  • ARM_CLIENT_SECRET
  • ARM_SUBSCRIPTION_ID
  • ARM_TENANT_ID

Terraform Configuration Files

resource "databricks_cluster" "this" {
  cluster_name            = "name"
  spark_version           = var.spark_version
  node_type_id            = var.node_type_id 
  autotermination_minutes = var.autotermination_minutes
  autoscale {
    min_workers = var.min_workers
    max_workers = var.max_workers
  }
}

Panic Output

Error: 1021-125542-novas905 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource 'ecb71b95f3004464b5f9b6d7ebde996d_OsDisk_1_1ba1df2e8bbf48709f3e3e.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details



Error: 1021-125542-using919 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource '27b239d8d75b43e985391ed241464c32_OsDisk_1_faae7e36850242c7a25f32.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details



Error: 1021-125542-top912 is not able to transition from TERMINATED to RUNNING: Could not launch cluster due to cloud provider failures. azure_error_code: RequestDisallowedByPolicy, azure_error_message: Resource '7cecad0812ef4a95ab374aaa0a950351_disk1_f7db33e0703144dea9dca0c63.... Please see https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate for more details

Expected Behavior

Either:

  • Databricks cluster created and added to state (despite it not being able to start)
  • Rollback and cluster not being visible

Actual Behavior

Databricks cluster created and visible in the workspace, but not added to state because clusters failed to start up, due to an issue with an Azure Policy. The clusters were not added to the state, so once the policy was fixed and the terraform apply succeeded, we had duplicate clusters with the same name.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Important Factoids

  • We are automatically deploying databricks workspaces, clusters and notebooks
  • In one of our subscriptions, an azure policy caused that databricks was unable to provision disks ("Microsoft.Compute/disks")
  • When trying to start the cluster, databricks failed due to that

Azure policy:

{
    "if": {
        "not": {
            "field": "type",
            "in": [
                "Microsoft.KeyVault/vaults",
                "Microsoft.Databricks/workspaces",
                "Microsoft.Databricks/storageAccounts",
                "Microsoft.DataFactory/factories",
                "Microsoft.Compute/*",
                "Microsoft.Compute",
                "Microsoft.Compute/disks",
                "Microsoft.AlertsManagement/actionRules",
                "Microsoft.AlertsManagement/register/action"
            ]
        }
    },
    "then": {
        "effect": "deny"
    }
}
@nfx
Copy link
Contributor

nfx commented Oct 23, 2020

@jancschaefer what do you think should be the expected behavior here?

@jancschaefer
Copy link
Author

I would either expect the clusters not to show up in databricks, if the deployment fails or that the clusters are nontheless added to the state and the provider tries to start up the same existing cluster the next time.

This way, if terraform fails twice and then succeeds, I end up with three clusters, although only one was defined in terraform.

@nfx
Copy link
Contributor

nfx commented Oct 23, 2020

okay, then cluster should be deleted, if it cannot be started.

Change has to be added to ClustersAPI#Create to remove a cluster that was just created if it was not able to start and return previous error message for context.

This might qualify as behavior change and might be put on hold to 0.3.

@stikkireddy
Copy link
Contributor

Hey adding my 2 cents here this is occurring because in the cluster create is waiting for the cluster to be in a running state before we register the ID. Unfortunately terraform will not be able to taint the resource if it is not aware of the id.

Typical behavior of terraform is:

  1. Create object in the remote api
  2. Set id
  3. If the create fails or read fails in post set, the resource becomes tainted (delete and recreate on the next apply)

Another alternative would be to register the id right after the /create call is made before we wait for it to be in a running state.

@nfx nfx added the Small Size label Nov 3, 2020
@nfx nfx self-assigned this Nov 6, 2020
@nfx nfx linked a pull request Nov 6, 2020 that will close this issue
@nfx nfx added this to the v0.2.8 milestone Nov 6, 2020
@nfx nfx closed this as completed in #400 Nov 6, 2020
nfx added a commit that referenced this issue Nov 6, 2020
* Pre-release fixing
* Added NAT to BYOVPC terraform module
* added instance profile locks
* Added sync block for instance profiles integration tests
* Fix #383 Cleaning up clusters that fail to start
* Added log delivery use case docs
* Fix #382 - ignore changes to deployment_name
* Fix test and lints
* Fix #382 by ignoring incoming prefix for deployment_name for databricks_mws_workspaces
* Improve documentation to fix #368
* fix linting issues

Co-authored-by: Serge Smertin <serge.smertin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants