Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task #864

prajin-29 · 2024-01-31T07:17:41Z

Changes

Checking pipeline cluster config and cluster policy in Crawl Pipeline

Linked issues

closes #844

Resolves #844

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

codecov · 2024-01-31T07:19:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (baf3984) 86.55% compared to head (b771432) 86.55%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #864   +/-   ##
=======================================
  Coverage   86.55%   86.55%           
=======================================
  Files          41       41           
  Lines        5162     5171    +9     
  Branches      938      943    +5     
=======================================
+ Hits         4468     4476    +8     
+ Misses        481      480    -1     
- Partials      213      215    +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nfx · 2024-01-31T17:40:49Z

src/databricks/labs/ucx/assessment/pipelines.py

@@ -50,6 +50,14 @@ def _assess_pipelines(self, all_pipelines) -> Iterable[PipelineInfo]:
            pipeline_config = pipeline_response.spec.configuration
            if pipeline_config:
                failures.extend(self.check_spark_conf(pipeline_config, "pipeline"))
+            pipeline_cluster = pipeline_response.spec.clusters[0]


Can you iterate over clusters instead of picking up the first?

Changed the code to iterate through the cluster.

nfx · 2024-01-31T17:41:47Z

tests/unit/assessment/test_pipelines.py

@@ -21,14 +21,22 @@ def test_pipeline_assessment_with_config(mocker):
        )
    ]

-    ws = Mock()
+    ws = MagicMock()


Can you create_autospec workspace client instead?

Using create_autospec now

nfx

refactor to workspace_client_mock

nfx · 2024-02-01T20:09:21Z

tests/unit/assessment/test_pipelines.py

    config_dict = {
        "spark.hadoop.fs.azure.account.auth.type.abcde.dfs.core.windows.net": "SAS",
        "spark.hadoop.fs.azure.sas.token.provider.type.abcde.dfs."
        "core.windows.net": "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider",
        "spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
    }
+    pipeline_cluster = [


can you move out these long responses to a separate file, so that tests are more maintainable? see https://github.com/databrickslabs/ucx/blob/main/tests/unit/assessment/__init__.py

separated out the long response outside.

nfx · 2024-02-01T20:10:25Z

tests/unit/assessment/test_pipelines.py

    ws.pipelines.get().spec.configuration = config_dict
+    ws.pipelines.get().spec.clusters = pipeline_cluster
+    ws.cluster_policies.get().definition = (


refactor to start using workspace_client_mock - see https://github.com/databrickslabs/ucx/blame/main/tests/unit/assessment/test_clusters.py#L22-L58

replaced it with workspace_client_mock

nfx · 2024-02-01T20:10:48Z

tests/unit/assessment/test_pipelines.py

    config_dict = {}
+    pipeline_cluster = [


refactor to start using workspace_client_mock:

https://github.com/databrickslabs/ucx/blame/main/tests/unit/assessment/test_clusters.py#L22-L58

replaced it with workspace_client_mock

nfx · 2024-02-01T20:10:53Z

tests/unit/assessment/test_pipelines.py

@@ -112,9 +203,32 @@ def test_pipeline_without_owners_should_have_empty_creator_name():
        )
    ]

-    ws = Mock()
+    ws = create_autospec(WorkspaceClient)


refactor to start using workspace_client_mock:

https://github.com/databrickslabs/ucx/blame/main/tests/unit/assessment/test_clusters.py#L22-L58

replaced it with workspace_client_mock

nfx · 2024-02-01T20:11:04Z

tests/unit/assessment/test_pipelines.py

+        )
+    ]
+
+    ws = create_autospec(WorkspaceClient)


refactor to start using workspace_client_mock:

https://github.com/databrickslabs/ucx/blame/main/tests/unit/assessment/test_clusters.py#L22-L58

replaced it with workspace_client_mock

nfx · 2024-02-01T20:11:29Z

tests/unit/assessment/test_pipelines.py

+    ws.pipelines.get().spec.configuration = config_dict
+    ws.pipelines.get().spec.clusters = pipeline_cluster
+
+    crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)


don't invoke private methods in unit tests! it's prohibited.

Removed the private method from unit test

nfx · 2024-02-01T20:12:03Z

tests/unit/assessment/test_pipelines.py

    ws.pipelines.get().spec.configuration = config_dict
+    ws.pipelines.get().spec.clusters = pipeline_cluster
    crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)


don't invoke private methods in unit tests! it's prohibited.

Removed the private method from unit test

nfx · 2024-02-01T20:12:09Z

tests/unit/assessment/test_pipelines.py

+        '"hidden": true\n  }\n}'
+    )
+    ws.workspace.export().content = "JXNoCmVjaG8gIj0="
+    ws.dbfs.read().data = "JXNoCmVjaG8gIj0="

    crawler = PipelinesCrawler(ws, MockBackend(), "ucx")._assess_pipelines(sample_pipelines)


don't invoke private methods in unit tests! it's prohibited.

Removed the private method from unit test

src/databricks/labs/ucx/assessment/pipelines.py

nfx · 2024-02-05T21:39:46Z

tests/unit/assessment/test_pipelines.py

    config_dict = {
        "spark.hadoop.fs.azure.account.auth.type.abcde.dfs.core.windows.net": "SAS",
        "spark.hadoop.fs.azure.sas.token.provider.type.abcde.dfs."
        "core.windows.net": "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider",
        "spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
    }
    ws.pipelines.get().spec.configuration = config_dict
+    ws.pipelines.get().spec.clusters = mock_pipeline_cluster
+    ws.cluster_policies.get(policy_id="single-user-with-spn").definition = (


Look at "init.py" in the same folder and do it like there. Modify that file if needed

Modified the code accordingly

nfx · 2024-02-05T21:41:06Z

tests/unit/assessment/conftest.py

+
+@pytest.fixture(scope="function")
+def mock_pipeline_cluster():
+    yield [


Modify workspace_client_mock to store json in a file for pipelines

modified the file accordingly.

nfx · 2024-02-07T15:07:13Z

tests/unit/assessment/__init__.py

    ws = create_autospec(WorkspaceClient)
    ws.clusters.list.return_value = _load_list(ClusterDetails, f"../assessment/clusters/{clusters}")
    ws.cluster_policies.get = _cluster_policy
+    ws.pipelines.get().spec.clusters = _load_list(PipelineCluster, f"clusters/{pipeline_cluster}")


this is silly and makes no sense at all. mock the entire response, not just a subset - ws.pipelines.get.return_value = _load_list(...

Now capturing the entire PipelineResponse instead of only mocking the cluster part.

nfx · 2024-02-07T15:07:50Z

tests/unit/assessment/__init__.py

@@ -24,8 +25,9 @@ def _cluster_policy(policy_id: str):
    return Policy(description=definition, policy_family_definition_overrides=overrides)


-def workspace_client_mock(clusters="no-spark-conf.json"):
+def workspace_client_mock(clusters="no-spark-conf.json", pipeline_cluster="pipeline_cluster.json"):


this makes no sense - you're not mocking the list response. in this case - argument is not required.

Was are having this parameter to give an option to specify the pipeline cluster json. Removed this argument now.

nfx

do you read the surrounding code before making changes?...

nfx · 2024-02-07T19:04:53Z

tests/unit/assessment/__init__.py

@@ -28,4 +29,5 @@ def workspace_client_mock(clusters="no-spark-conf.json"):
    ws = create_autospec(WorkspaceClient)
    ws.clusters.list.return_value = _load_list(ClusterDetails, f"../assessment/clusters/{clusters}")
    ws.cluster_policies.get = _cluster_policy
+    ws.pipelines.get.return_value = _load_list(GetPipelineResponse, "clusters/pipeline_cluster.json")[0]


@prajin-29 , do you read the surrounding code before changing it? 🤦

why ws.pipelines.get is different than ws.cluster_policies.get? it's illogical!

updated ws.peipeline.get similar to ws.cluster_policies.get . With this I have included the pipeline spec configuration also inside the json so that we can access everything in one shot.

nfx

Test implementation is incorrect and confusing

nfx · 2024-02-08T09:33:57Z

tests/unit/assessment/__init__.py

@@ -24,8 +25,14 @@ def _cluster_policy(policy_id: str):
    return Policy(description=definition, policy_family_definition_overrides=overrides)


+def _pipeline_cluster(pipeline_id: str):
+    pipeline_response = _load_list(GetPipelineResponse, f"clusters/{pipeline_id}.json")[0]


Why is it load_list and not load fixture? Why do you put pipeline test fixtures to a clusters folder? This will confuse people after you

@nfx now the code is rebased with the main and using the pipeline test fixture from the pipeline folder.

nkvuong · 2024-02-08T11:17:31Z

tests/unit/assessment/test_pipelines.py

-        "spark.hadoop.fs.azure.sas.fixed.token.abcde.dfs.core.windows.net": "{{secrets/abcde_access/sasFixedToken}}",
-    }
-    ws.pipelines.get().spec.configuration = config_dict
+    ws = workspace_client_mock(clusters="job-source-cluster.json")


@prajin-29 I've extended workspace_client_mock in #923 to read pipeline spec from a json file in under tests/unit/assessment/pipelines, so you can rebase your code off of that.

Rebased the code to use the change

nkvuong · 2024-02-08T11:19:04Z

tests/unit/assessment/test_pipelines.py

+    ws.workspace.export().content = "JXNoCmVjaG8gIj0="
+    ws.dbfs.read().data = "JXNoCmVjaG8gIj0="


do we need these for pipeline tests?

Yup I added this to cover the negative scenario for full coverage. But with the addition of the spec in json we can remove this. Removed this from test.

# Conflicts: # tests/unit/assessment/__init__.py

nfx

Lgtm

* Added CLI Command `databricks labs ucx save-uc-compatible-roles` ([#863](#863)). * Added dashboard widget with table count by storage and format ([#852](#852)). * Added verification of group permissions ([#841](#841)). * Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task ([#864](#864)). * Created cluster policy (ucx-policy) to be used by all UCX compute. This may require customers to reinstall UCX. ([#853](#853)). * Skip scanning objects that were removed on platform side since the last scan time, so that integration tests are less flaky ([#922](#922)). * Updated assessment documentation ([#873](#873)). Dependency updates: * Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0 ([#930](#930)).

…es' task (#864)

* Added CLI Command `databricks labs ucx save-uc-compatible-roles` ([#863](#863)). * Added dashboard widget with table count by storage and format ([#852](#852)). * Added verification of group permissions ([#841](#841)). * Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task ([#864](#864)). * Created cluster policy (ucx-policy) to be used by all UCX compute. This may require customers to reinstall UCX. ([#853](#853)). * Skip scanning objects that were removed on platform side since the last scan time, so that integration tests are less flaky ([#922](#922)). * Updated assessment documentation ([#873](#873)). Dependency updates: * Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0 ([#930](#930)).

Checking Pipeline cluster config and cluster Policy

309fe15

prajin-29 requested review from a team and stikkireddy January 31, 2024 07:17

prajin-29 temporarily deployed to account-admin January 31, 2024 07:17 — with GitHub Actions Inactive

nfx requested changes Jan 31, 2024

View reviewed changes

prajin-29 added 3 commits February 1, 2024 11:23

Updating the comments

d21fd97

Merge branch 'main' into feature/pipeline_conf_check

3297ee7

formatted the code

4be2bf8

prajin-29 had a problem deploying to account-admin February 1, 2024 06:10 — with GitHub Actions Failure

formatted the code

36970f2

prajin-29 had a problem deploying to account-admin February 1, 2024 06:41 — with GitHub Actions Failure

formatted the code

85bdc9b

prajin-29 had a problem deploying to account-admin February 1, 2024 06:51 — with GitHub Actions Failure

increased coverage of the code

3cdaf2e

prajin-29 had a problem deploying to account-admin February 1, 2024 13:04 — with GitHub Actions Failure

nfx requested changes Feb 1, 2024

View reviewed changes

prajin-29 added 3 commits February 5, 2024 11:01

Merge branch 'main' into feature/pipeline_conf_check

3964acb

Updated the unit test without calling the private method

1b7e88f

removed the long responses from the python file to seperate conftest

87a0aea

prajin-29 had a problem deploying to account-admin February 5, 2024 07:44 — with GitHub Actions Failure

nfx requested changes Feb 5, 2024

View reviewed changes

prajin-29 added 4 commits February 6, 2024 11:35

Refactored the code as per the comment

e553326

Refactored the code as per the comment

2427c7d

Refactored the code as per the comment

7ce7a31

Merge branch 'main' into feature/pipeline_conf_check

74b4bc9

prajin-29 had a problem deploying to account-admin February 7, 2024 07:39 — with GitHub Actions Failure

Refactored the code as per the comment

eef4e41

prajin-29 had a problem deploying to account-admin February 7, 2024 07:41 — with GitHub Actions Failure

nfx requested changes Feb 7, 2024

View reviewed changes

Refactored the code as per the comment

0d419eb

prajin-29 had a problem deploying to account-admin February 7, 2024 15:59 — with GitHub Actions Failure

nfx requested changes Feb 7, 2024

View reviewed changes

prajin-29 added 2 commits February 8, 2024 11:59

Merge branch 'main' into feature/pipeline_conf_check

51d4aae

Refactored the code as per the comment

124eab5

prajin-29 had a problem deploying to account-admin February 8, 2024 08:04 — with GitHub Actions Failure

nfx requested changes Feb 8, 2024

View reviewed changes

nkvuong reviewed Feb 8, 2024

View reviewed changes

prajin-29 added 2 commits February 8, 2024 19:15

Merge branch 'main' into feature/pipeline_conf_check

20f013d

# Conflicts: # tests/unit/assessment/__init__.py

Refactored the code as per the comment

b771432

prajin-29 had a problem deploying to account-admin February 8, 2024 14:17 — with GitHub Actions Failure

nfx approved these changes Feb 9, 2024

View reviewed changes

nfx changed the title ~~Checking pipeline cluster config and cluster policy in Crawl Pipeline~~ Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task Feb 9, 2024

nfx merged commit 6496b3b into main Feb 9, 2024
6 of 7 checks passed

nfx deleted the feature/pipeline_conf_check branch February 9, 2024 09:46

nfx mentioned this pull request Feb 9, 2024

Release v0.12.0 #931

Merged

dmoore247 pushed a commit that referenced this pull request Mar 23, 2024

Checking pipeline cluster config and cluster policy in 'crawl_pipelin…

0fd9114

…es' task (#864)

		ws.workspace.export().content = "JXNoCmVjaG8gIj0="
		ws.dbfs.read().data = "JXNoCmVjaG8gIj0="

Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task #864

Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task #864

Conversation

prajin-29 commented Jan 31, 2024 • edited Loading

Changes

Linked issues

Functionality

Tests

codecov bot commented Jan 31, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkvuong Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

prajin-29 commented Jan 31, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

nkvuong Feb 8, 2024 •

edited

Loading