Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task #864

Merged
merged 20 commits into from
Feb 9, 2024

Conversation

prajin-29
Copy link
Contributor

@prajin-29 prajin-29 commented Jan 31, 2024

Changes

Checking pipeline cluster config and cluster policy in Crawl Pipeline

Linked issues

closes #844

Resolves #844

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs ucx ...
  • added a new workflow
  • modified existing workflow: ...
  • added a new table
  • modified existing table: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (baf3984) 86.55% compared to head (b771432) 86.55%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #864   +/-   ##
=======================================
  Coverage   86.55%   86.55%           
=======================================
  Files          41       41           
  Lines        5162     5171    +9     
  Branches      938      943    +5     
=======================================
+ Hits         4468     4476    +8     
+ Misses        481      480    -1     
- Partials      213      215    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -50,6 +50,14 @@ def _assess_pipelines(self, all_pipelines) -> Iterable[PipelineInfo]:
pipeline_config = pipeline_response.spec.configuration
if pipeline_config:
failures.extend(self.check_spark_conf(pipeline_config, "pipeline"))
pipeline_cluster = pipeline_response.spec.clusters[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you iterate over clusters instead of picking up the first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to iterate through the cluster.

@@ -21,14 +21,22 @@ def test_pipeline_assessment_with_config(mocker):
)
]

ws = Mock()
ws = MagicMock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create_autospec workspace client instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using create_autospec now

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to workspace_client_mock

config_dict = {
"spark.hadoop.fs.azure.account.auth.type.abcde.dfs.core.windows.net": "SAS",
"spark.hadoop.fs.azure.sas.token.provider.type.abcde.dfs."
"core.windows.net": "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider",
"spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
}
pipeline_cluster = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move out these long responses to a separate file, so that tests are more maintainable? see https://github.com/databrickslabs/ucx/blob/main/tests/unit/assessment/__init__.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separated out the long response outside.

ws.pipelines.get().spec.configuration = config_dict
ws.pipelines.get().spec.clusters = pipeline_cluster
ws.cluster_policies.get().definition = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with workspace_client_mock

config_dict = {}
pipeline_cluster = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with workspace_client_mock

@@ -112,9 +203,32 @@ def test_pipeline_without_owners_should_have_empty_creator_name():
)
]

ws = Mock()
ws = create_autospec(WorkspaceClient)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with workspace_client_mock

)
]

ws = create_autospec(WorkspaceClient)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with workspace_client_mock

ws.pipelines.get().spec.configuration = config_dict
ws.pipelines.get().spec.clusters = pipeline_cluster

crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't invoke private methods in unit tests! it's prohibited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the private method from unit test

ws.pipelines.get().spec.configuration = config_dict
ws.pipelines.get().spec.clusters = pipeline_cluster
crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't invoke private methods in unit tests! it's prohibited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the private method from unit test

'"hidden": true\n }\n}'
)
ws.workspace.export().content = "JXNoCmVjaG8gIj0="
ws.dbfs.read().data = "JXNoCmVjaG8gIj0="

crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't invoke private methods in unit tests! it's prohibited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the private method from unit test

config_dict = {
"spark.hadoop.fs.azure.account.auth.type.abcde.dfs.core.windows.net": "SAS",
"spark.hadoop.fs.azure.sas.token.provider.type.abcde.dfs."
"core.windows.net": "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider",
"spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
}
ws.pipelines.get().spec.configuration = config_dict
ws.pipelines.get().spec.clusters = mock_pipeline_cluster
ws.cluster_policies.get(policy_id="single-user-with-spn").definition = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at "init.py" in the same folder and do it like there. Modify that file if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code accordingly


@pytest.fixture(scope="function")
def mock_pipeline_cluster():
yield [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify workspace_client_mock to store json in a file for pipelines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified the file accordingly.

ws = create_autospec(WorkspaceClient)
ws.clusters.list.return_value = _load_list(ClusterDetails, f"../assessment/clusters/{clusters}")
ws.cluster_policies.get = _cluster_policy
ws.pipelines.get().spec.clusters = _load_list(PipelineCluster, f"clusters/{pipeline_cluster}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is silly and makes no sense at all. mock the entire response, not just a subset - ws.pipelines.get.return_value = _load_list(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now capturing the entire PipelineResponse instead of only mocking the cluster part.

@@ -24,8 +25,9 @@ def _cluster_policy(policy_id: str):
return Policy(description=definition, policy_family_definition_overrides=overrides)


def workspace_client_mock(clusters="no-spark-conf.json"):
def workspace_client_mock(clusters="no-spark-conf.json", pipeline_cluster="pipeline_cluster.json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes no sense - you're not mocking the list response. in this case - argument is not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was are having this parameter to give an option to specify the pipeline cluster json. Removed this argument now.

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you read the surrounding code before making changes?...

@@ -28,4 +29,5 @@ def workspace_client_mock(clusters="no-spark-conf.json"):
ws = create_autospec(WorkspaceClient)
ws.clusters.list.return_value = _load_list(ClusterDetails, f"../assessment/clusters/{clusters}")
ws.cluster_policies.get = _cluster_policy
ws.pipelines.get.return_value = _load_list(GetPipelineResponse, "clusters/pipeline_cluster.json")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prajin-29 , do you read the surrounding code before changing it? 🤦

why ws.pipelines.get is different than ws.cluster_policies.get? it's illogical!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated ws.peipeline.get similar to ws.cluster_policies.get . With this I have included the pipeline spec configuration also inside the json so that we can access everything in one shot.

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test implementation is incorrect and confusing

@@ -24,8 +25,14 @@ def _cluster_policy(policy_id: str):
return Policy(description=definition, policy_family_definition_overrides=overrides)


def _pipeline_cluster(pipeline_id: str):
pipeline_response = _load_list(GetPipelineResponse, f"clusters/{pipeline_id}.json")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it load_list and not load fixture? Why do you put pipeline test fixtures to a clusters folder? This will confuse people after you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx now the code is rebased with the main and using the pipeline test fixture from the pipeline folder.

"spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
}
ws.pipelines.get().spec.configuration = config_dict
ws = workspace_client_mock(clusters="job-source-cluster.json")
Copy link
Contributor

@nkvuong nkvuong Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prajin-29 I've extended workspace_client_mock in #923 to read pipeline spec from a json file in under tests/unit/assessment/pipelines, so you can rebase your code off of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased the code to use the change

Comment on lines 144 to 145
ws.workspace.export().content = "JXNoCmVjaG8gIj0="
ws.dbfs.read().data = "JXNoCmVjaG8gIj0="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these for pipeline tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I added this to cover the negative scenario for full coverage. But with the addition of the spec in json we can remove this. Removed this from test.

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@nfx nfx changed the title Checking pipeline cluster config and cluster policy in Crawl Pipeline Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task Feb 9, 2024
@nfx nfx merged commit 6496b3b into main Feb 9, 2024
6 of 7 checks passed
@nfx nfx deleted the feature/pipeline_conf_check branch February 9, 2024 09:46
nfx added a commit that referenced this pull request Feb 9, 2024
* Added CLI Command `databricks labs ucx save-uc-compatible-roles` ([#863](#863)).
* Added dashboard widget with table count by storage and format ([#852](#852)).
* Added verification of group permissions ([#841](#841)).
* Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task ([#864](#864)).
* Created cluster policy (ucx-policy) to be used by all UCX compute. This may require customers to reinstall UCX. ([#853](#853)).
* Skip scanning objects that were removed on platform side since the last scan time, so that integration tests are less flaky ([#922](#922)).
* Updated assessment documentation ([#873](#873)).

Dependency updates:

 * Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0 ([#930](#930)).
@nfx nfx mentioned this pull request Feb 9, 2024
nfx added a commit that referenced this pull request Feb 9, 2024
* Added CLI Command `databricks labs ucx save-uc-compatible-roles`
([#863](#863)).
* Added dashboard widget with table count by storage and format
([#852](#852)).
* Added verification of group permissions
([#841](#841)).
* Checking pipeline cluster config and cluster policy in
'crawl_pipelines' task
([#864](#864)).
* Created cluster policy (ucx-policy) to be used by all UCX compute.
This may require customers to reinstall UCX.
([#853](#853)).
* Skip scanning objects that were removed on platform side since the
last scan time, so that integration tests are less flaky
([#922](#922)).
* Updated assessment documentation
([#873](#873)).

Dependency updates:

* Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0
([#930](#930)).
dmoore247 pushed a commit that referenced this pull request Mar 23, 2024
* Added CLI Command `databricks labs ucx save-uc-compatible-roles`
([#863](#863)).
* Added dashboard widget with table count by storage and format
([#852](#852)).
* Added verification of group permissions
([#841](#841)).
* Checking pipeline cluster config and cluster policy in
'crawl_pipelines' task
([#864](#864)).
* Created cluster policy (ucx-policy) to be used by all UCX compute.
This may require customers to reinstall UCX.
([#853](#853)).
* Skip scanning objects that were removed on platform side since the
last scan time, so that integration tests are less flaky
([#922](#922)).
* Updated assessment documentation
([#873](#873)).

Dependency updates:

* Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0
([#930](#930)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Crawl pipeline should check pipeline cluster config and cluster policy
3 participants