Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Remove single-node validation from interactive clusters #4222

Merged
merged 2 commits into from
Nov 14, 2024

Conversation

shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Nov 13, 2024

Changes

There were reports on databricks/cli#1546 about customers being unable to use cluster policies for single-node job clusters because Terraform was performing this validation. #4216 removed this validation for job clusters. This PR removes this validation for interactive clusters as well to keep them consistent and unblock the use of policy from interactive clusters.

Also, the clusters team is improving the API interface to simplify creating single-node clusters, significantly reducing the chances of users misconfiguring them.

Tests

Unit test and manually. Clusters are now successfully created and updated with num_workers=0

@shreyas-goenka shreyas-goenka requested review from a team as code owners November 13, 2024 13:36
@shreyas-goenka shreyas-goenka requested review from parthban-db and removed request for a team November 13, 2024 13:36
@shreyas-goenka shreyas-goenka changed the title Remove single-node validation from interactive clusters [Fix] Remove single-node validation from interactive clusters Nov 13, 2024
Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/terraform

Inputs:

  • PR number: 4222
  • Commit SHA: 237bb17059cc730bae96327360320ae3db8559f6

Checks will be approved automatically on success.

@eng-dev-ecosystem-bot
Copy link
Collaborator

Test Details: go/deco-tests/11818584690

@alexott alexott requested a review from tanmay-db November 13, 2024 15:16
@alexott
Copy link
Contributor

alexott commented Nov 13, 2024

@tanmay-db what is your opinion giving that you're porting clusters to the plugin framework?

@pietern
Copy link
Contributor

pietern commented Nov 13, 2024

@shreyas-goenka The PR you reference talks about job clusters. Am I missing the ref to interactive clusters?

Copy link
Contributor

@alexott alexott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok from the code perspective. Let's wait for Tanmay's opinion

@shreyas-goenka
Copy link
Contributor Author

@pietern, you are right. The issue only discusses job clusters. There's no reference for interactive clusters, but now that we support interactive clusters via DABs, we should solve this issue here as well and keep them consistent.

Updated the PR description to say as much.

@tanmay-db
Copy link
Contributor

@tanmay-db what is your opinion giving that you're porting clusters to the plugin framework?

Hi @alexott, it looks good, I will update the plugin framework implementation to not have this validation.

@shreyas-goenka shreyas-goenka added this pull request to the merge queue Nov 14, 2024
@shreyas-goenka
Copy link
Contributor Author

@tanmay-db We'll be adding this validation to DABs as a warning instead of a warning error. Maybe that's something we should do in the plugin framework as well if supported.

Merged via the queue into main with commit 6e7ca4c Nov 14, 2024
14 checks passed
@shreyas-goenka shreyas-goenka deleted the remove/validataion-single-node-interactive branch November 14, 2024 12:02
hectorcast-db added a commit that referenced this pull request Nov 20, 2024
### New Features and Improvements

 * Add `databricks_mws_network_connectivity_config` and `databricks_mws_network_connectivity_configs` data source ([#3665](#3665)).
 * Add support partitions in policy data sources ([#4181](#4181)).
 * Added `databricks_registered_model_versions` data source ([#4100](#4100)).
 * Update databricks_permissions resource to support vector-search-endpoints ([#4209](#4209)).
 * add `databricks_serving_endpoints` data source ([#4226](#4226)).

### Bug Fixes

 * Add validation for `run_as_mode` in `databricks_query` ([#4233](#4233)).
 * Correct handling of updates for empty comments and `force_destroy` in UC catalog, schema, registered models and volumes ([#4244](#4244)).
 * Fix deletion of dashboard if it was trashed out of band ([#4235](#4235)).
 * Fix waiting for `databricks_vector_search_index` readiness ([#4243](#4243)).
 * Remove single-node validation from interactive clusters ([#4222](#4222)).
 * Remove single-node validation from jobs clusters ([#4216](#4216)).
 * Use cluster list API to determine pinned cluster status ([#4203](#4203)).
 * fix issue cased by setting pause_status in update monitor  ([#4242](#4242)).

### Documentation

 * Clarify workspace provider config ([#4208](#4208)).
 * Update "Databricks Workspace Creator" permissions on gcp-workspace.md ([#4201](#4201)).
 * Update `grants.md` references ([#4246](#4246)).
 * Update description of `group_id` in `databricks_mws_ncc_private_endpoint_rule` ([#4238](#4238)).
 * remove subnet sharing limitation in AWS ([#4239](#4239)).

### Internal Changes

 * Bump Go SDK to latest and generate TF structs ([#4249](#4249)).
 * Mark TestUcAccModelServingProvisionedThroughput as flaky. to be rever… ([#4232](#4232)).
 * Rename resources directory to products in pluginframework ([#4139](#4139)).
 * Revert "mark TestUcAccModelServingProvisionedThroughput as flaky. to … ([#4240](#4240)).
 * Set user agent in some resources implemented in plugin framework ([#4187](#4187)).
 * make `ApplyAndExpectData` work with nested set ([#4237](#4237)).

### Dependency Updates

 * Bump dependencies for Plugin Framework and SDK v2 ([#4215](#4215)).
 * Bump github.com/hashicorp/hcl/v2 from 2.22.0 to 2.23.0 ([#4236](#4236)).
 * Bump github.com/hashicorp/terraform-plugin-testing from 1.10.0 to 1.11.0 ([#4247](#4247)).

### Exporter

 * Add `List` operation for `users` service ([#4204](#4204)).
 * Fix interactive selection of services ([#4245](#4245)).
hectorcast-db added a commit that referenced this pull request Nov 20, 2024
 * Add `databricks_mws_network_connectivity_config` and `databricks_mws_network_connectivity_configs` data source ([#3665](#3665)).
 * Add support partitions in policy data sources ([#4181](#4181)).
 * Added `databricks_registered_model_versions` data source ([#4100](#4100)).
 * Update databricks_permissions resource to support vector-search-endpoints ([#4209](#4209)).
 * add `databricks_serving_endpoints` data source ([#4226](#4226)).

 * Add validation for `run_as_mode` in `databricks_query` ([#4233](#4233)).
 * Correct handling of updates for empty comments and `force_destroy` in UC catalog, schema, registered models and volumes ([#4244](#4244)).
 * Fix deletion of dashboard if it was trashed out of band ([#4235](#4235)).
 * Fix waiting for `databricks_vector_search_index` readiness ([#4243](#4243)).
 * Remove single-node validation from interactive clusters ([#4222](#4222)).
 * Remove single-node validation from jobs clusters ([#4216](#4216)).
 * Use cluster list API to determine pinned cluster status ([#4203](#4203)).
 * fix issue cased by setting pause_status in update monitor  ([#4242](#4242)).

 * Clarify workspace provider config ([#4208](#4208)).
 * Update "Databricks Workspace Creator" permissions on gcp-workspace.md ([#4201](#4201)).
 * Update `grants.md` references ([#4246](#4246)).
 * Update description of `group_id` in `databricks_mws_ncc_private_endpoint_rule` ([#4238](#4238)).
 * remove subnet sharing limitation in AWS ([#4239](#4239)).

 * Bump Go SDK to latest and generate TF structs ([#4249](#4249)).
 * Mark TestUcAccModelServingProvisionedThroughput as flaky. to be rever… ([#4232](#4232)).
 * Rename resources directory to products in pluginframework ([#4139](#4139)).
 * Revert "mark TestUcAccModelServingProvisionedThroughput as flaky. to … ([#4240](#4240)).
 * Set user agent in some resources implemented in plugin framework ([#4187](#4187)).
 * make `ApplyAndExpectData` work with nested set ([#4237](#4237)).

 * Bump dependencies for Plugin Framework and SDK v2 ([#4215](#4215)).
 * Bump github.com/hashicorp/hcl/v2 from 2.22.0 to 2.23.0 ([#4236](#4236)).
 * Bump github.com/hashicorp/terraform-plugin-testing from 1.10.0 to 1.11.0 ([#4247](#4247)).

 * Add `List` operation for `users` service ([#4204](#4204)).
 * Fix interactive selection of services ([#4245](#4245)).
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
### New Features and Improvements

* Add `databricks_mws_network_connectivity_config` and
`databricks_mws_network_connectivity_configs` data source
([#3665](#3665)).
* Add support partitions in policy data sources
([#4181](#4181)).
* Added `databricks_registered_model_versions` data source
([#4100](#4100)).
* Update databricks_permissions resource to support
vector-search-endpoints
([#4209](#4209)).
* add `databricks_serving_endpoints` data source
([#4226](#4226)).


### Bug Fixes

* Add validation for `run_as_mode` in `databricks_query`
([#4233](#4233)).
* Correct handling of updates for empty comments and `force_destroy` in
UC catalog, schema, registered models and volumes
([#4244](#4244)).
* Fix deletion of dashboard if it was trashed out of band
([#4235](#4235)).
* Fix waiting for `databricks_vector_search_index` readiness
([#4243](#4243)).
* Remove single-node validation from interactive clusters
([#4222](#4222)).
* Remove single-node validation from jobs clusters
([#4216](#4216)).
* Use cluster list API to determine pinned cluster status
([#4203](#4203)).
* fix issue cased by setting pause_status in update monitor
([#4242](#4242)).


### Documentation

* Clarify workspace provider config
([#4208](#4208)).
* Update "Databricks Workspace Creator" permissions on gcp-workspace.md
([#4201](#4201)).
* Update `grants.md` references
([#4246](#4246)).
* Update description of `group_id` in
`databricks_mws_ncc_private_endpoint_rule`
([#4238](#4238)).
* remove subnet sharing limitation in AWS
([#4239](#4239)).


### Internal Changes

* Bump Go SDK to latest and generate TF structs
([#4249](#4249)).
* Mark TestUcAccModelServingProvisionedThroughput as flaky. to be rever…
([#4232](#4232)).
* Rename resources directory to products in pluginframework
([#4139](#4139)).
* Revert "mark TestUcAccModelServingProvisionedThroughput as flaky. to …
([#4240](#4240)).
* Set user agent in some resources implemented in plugin framework
([#4187](#4187)).
* make `ApplyAndExpectData` work with nested set
([#4237](#4237)).


### Dependency Updates

* Bump dependencies for Plugin Framework and SDK v2
([#4215](#4215)).
* Bump github.com/hashicorp/hcl/v2 from 2.22.0 to 2.23.0
([#4236](#4236)).
* Bump github.com/hashicorp/terraform-plugin-testing from 1.10.0 to
1.11.0
([#4247](#4247)).


### Exporter

* Add `List` operation for `users` service
([#4204](#4204)).
* Fix interactive selection of services
([#4245](#4245)).
github-merge-queue bot pushed a commit to databricks/cli that referenced this pull request Nov 22, 2024
## Changes
This PR adds a warning validating that the configuration for a single
node cluster is valid for interactive, job, job-task, and pipeline
clusters.

Note: We skip the validation if a cluster policy is configured because
the policy is likely to configure `spark_conf` / `custom_tags` itself.

Note: Terrform originally only had validation for interactive, job, and
job-task clusters. This PR adding the validation for pipeline clusters
as well is new.

This PR follows the same logic as we used to have in Terraform. The
validation was removed from Terraform because we had no way to demote
the error to a warning:
databricks/terraform-provider-databricks#4222

### Background
Single-node clusters require `spark_conf` and `custom_tags` to be
correctly set in the cluster definition for them to function optimally.
The cluster will be created even if incorrectly configured, but its
performance will not be great.

For example, if both `spark_conf` and `custom_tags` are not set and
`num_workers` is 0, then only the driver process will be launched on the
cluster compute instance thus leading to sub-optimal utilization of
available compute resources and no parallelization across worker
processes when processing a spark query.

### Issue

This PR addresses some issues reported in
#1546

## Tests
Unit tests and manually.

Example output of the warning:
```
➜  bundle-playground git:(master) ✗ cli bundle validate
Warning: Single node cluster is not correctly configured
  at resources.pipelines.bar.clusters[0]
  in databricks.yml:29:11

num_workers should be 0 only for single-node clusters. To create a
valid single node cluster please ensure that the following properties
are correctly set in the cluster specification:

  spark_conf:
    spark.databricks.cluster.profile: singleNode
    spark.master: local[*]

  custom_tags:
    ResourceClass: SingleNode
  

Name: foobar
Target: default
Workspace:
  User: shreyas.goenka@databricks.com
  Path: /Workspace/Users/shreyas.goenka@databricks.com/.bundle/foobar/default

Found 1 warning
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants