[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

jiropardo · 2024-04-11T02:10:17Z

Configuration

# Copy-paste your Terraform configuration here

module "workspace" {
source = "./modules/workspace"
databricks_account_id = var.databricks_account_id
region = var.region
}

module "catalogs_binding" {
source = "./modules/catalog_bin"
workspace_id = module.workspace.workspace_id

depends_on = [ module.workspace ]

}

module "proxy" {

source = "./modules/proxy_cluster"
depends_on = [ module.workspace ]

}

and the cluster resource within proxy_cluster is

resource "databricks_cluster" "git_proxy" {
autotermination_minutes = 0
aws_attributes {
ebs_volume_count = 1
ebs_volume_size = 32
first_on_demand = 1
}
cluster_name = var.git_proxy_name
custom_tags = {
"ResourceClass" = "SingleNode"
}
provider = databricks.workspace
spark_version = data.databricks_spark_version.latest_lts.id
node_type_id = data.databricks_node_type.smallest.id
num_workers = 0
spark_conf = {
"spark.databricks.cluster.profile" : "singleNode",
"spark.master" : "local[*]",
}
spark_env_vars = {
"GIT_PROXY_ENABLE_SSL_VERIFICATION" : "False"
"GIT_PROXY_HTTP_PROXY" : "git_URL"
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}

Expected Behavior

Cluster should be created or a more verbose error could be displayed to explain that even though the API returns workspace is RUNNING, it is not yet fully operational

Actual Behavior

Workspace was running

2024-04-10T19:35:44.640-0600 [DEBUG] provider.terraform-provider-databricks_v1.39.0: GET /api/2.0/accounts/XXX/workspaces/XXX
< HTTP/2.0 200 OK
< {
...
< "workspace_status": "RUNNING",

but worker environment is not recognized so cluster creation fails

2024-04-10T19:36:17.232-0600 [ERROR] provider.terraform-provider-databricks_v1.39.0: Response contains error diagnostic: @module=sdk.proto diagnostic_detail="" diagnostic_severity=ERROR tf_proto_version=5.4 tf_req_id=XXX tf_rpc=ApplyResourceChange @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 diagnostic_summary="cannot create cluster: XXX is not able to transition from TERMINATED to RUNNING: worker env WorkerEnvId(workerenv-XXXXX) not found

*Note that I replaced the actual values in above logs with XXXXX

Steps to Reproduce

1.39

Is it a regression?

Debug Output

Important Factoids

Would you like to implement a fix?

Below dependency flow addresses this behavior. It seems this little time is enough for the workspace information to fully propagate

module "proxy" {

source = "./modules/proxy_cluster"

depends_on = [ module.workspace, module.catalogs_binding ] # fixes issue

}

…890) ## Changes Due to the eventually consistent nature of worker environment creation as a part of workspace setup, certain API requests can fail when made right after workspace creation. We have some exception handling for these in the SDKs, but a new case has appeared: "worker env WorkerEnvId(workerenv-XXXXX) not found" (see databricks/terraform-provider-databricks#3452). This PR addresses this issue. Furthermore, it moves the transient error messages into autogeneration so that there is a single source of truth that applies to all SDKs. One other small change: I removed running unit tests from the autogeneration flow. It removes a small amount of convenience (users need to run `make test` after regenerating the SDK), but it speeds up the devloop when iterating on code generation, and it allows the release flow to continue to make a PR before failing. This PR can be modified as needed to fix any test failures or compilation failures, as per usual. ## Tests Added a unit test to cover this. - [ ] `make test` passing - [ ] `make fmt` applied - [ ] relevant integration tests applied

mgyucht mentioned this issue Apr 12, 2024

Add retries for "worker env WorkerEnvId(workerenv-XXXXX) not found" databricks/databricks-sdk-go#890

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

jiropardo commented Apr 11, 2024 •

edited

Loading

[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

Comments

jiropardo commented Apr 11, 2024 • edited Loading

Configuration

Expected Behavior

Actual Behavior

Steps to Reproduce

Is it a regression?

Debug Output

Important Factoids

Would you like to implement a fix?

jiropardo commented Apr 11, 2024 •

edited

Loading