Exclude UCX jobs from crawling #3733

JCZuurmond · 2025-02-21T13:59:41Z

Changes

Exclude UCX jobs from crawling to avoid confusing for users when they see UCX jobs in their assessment report.

Linked issues

Fixes #3656
Resolves #3722
Follow up on #3732
Relates to #3731

Functionality

modified JobsCrawler
modified existing workflow: assessment

Tests

added unit tests
added integration tests

PRs merged into this branch

Merged the following PRs into this branch in an attempt to let the CI pass. Those PRs contain fixes for integration tests

From #3767:

Scope linted dashboards on mock runtime context. We should use make_dashboard instead of the dashboard fixture directly _make_dashboard. Also changed one dashboard to a LakeviewDashboard so that we lint that too

From #3759

Add retry mechanism to wait for the grants to exists before crawling
Resolves #3758

modified integration tests: test_all_grant_types

github-actions · 2025-02-21T15:34:57Z

❌ 81/83 passed, 2 failed, 10 skipped, 6h58m48s total

❌ test_cluster_ownership: databricks.sdk.errors.platform.BadRequest: [INTERNAL_ERROR] Query could not be scheduled: HTTP Response code: 503. Please try again later. SQLSTATE: XX000 (22.878s)

databricks.sdk.errors.platform.BadRequest: [INTERNAL_ERROR] Query could not be scheduled: HTTP Response code: 503. Please try again later. SQLSTATE: XX000
[gw9] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python
12:24 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] ignoring any existing clusters inventory; refresh is forced.
12:24 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] crawling new set of snapshot data for clusters
12:25 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] found 994 new records for clusters
12:24 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] ignoring any existing clusters inventory; refresh is forced.
12:24 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] crawling new set of snapshot data for clusters
12:25 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sl9c5.clusters] found 994 new records for clusters
[gw9] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python

❌ test_all_grant_types: AssertionError: assert {('ANONYMOUS ...dummy_tbmy7')} == {('ANONYMOUS ..._fzvdx'), ...} (12m40.779s)

AssertionError: assert {('ANONYMOUS ...dummy_tbmy7')} == {('ANONYMOUS ..._fzvdx'), ...}
  
  Extra items in the right set:
  ('ANY FILE', None)
  
  Full diff:
    {
        (
            'ANONYMOUS FUNCTION',
  -         None,
  -     ),
  -     (
  -         'ANY FILE',
            None,
        ),
        (
            'CATALOG',
            'hive_metastore',
        ),
        (
            'DATABASE',
            'hive_metastore.dummy_sp8bj',
        ),
        (
            'TABLE',
            'hive_metastore.dummy_sp8bj.dummy_tlrdr',
        ),
        (
            'UDF',
            'hive_metastore.dummy_sp8bj.dummy_fzvdx',
        ),
        (
            'VIEW',
            'hive_metastore.dummy_sp8bj.dummy_tbmy7',
        ),
    }
12:25 INFO [databricks.labs.ucx.install] Creating ucx schemas...
[gw7] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python
12:25 INFO [databricks.labs.ucx.install] Creating ucx schemas...
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] fetching grants inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] crawling new set of snapshot data for grants
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] fetching tables inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] crawling new set of snapshot data for tables
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj] listing tables and views
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj.dummy_tbmy7] fetching table metadata
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj.dummy_tlrdr] fetching table metadata
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] found 2 new records for tables
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] fetching udfs inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] crawling new set of snapshot data for udfs
12:27 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_sp8bj] listing udfs
12:27 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_sp8bj.dummy_fzvdx] fetching udf metadata
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] found 1 new records for udfs
12:38 ERROR [databricks.labs.ucx.hive_metastore.grants] Couldn't fetch grants for object ANY FILE : TEMPORARILY_UNAVAILABLE: The service at /api/2.0/sql-acl/get-permissions is taking too long to process your request. Please try again later or try a faster operation. [TraceId: 00-155b6b90de3597c14999eeb0d4a5e7f2-5071026a3ad8ee42-00]
12:38 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] found 10 new records for grants
12:25 INFO [databricks.labs.ucx.install] Creating ucx schemas...
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] fetching grants inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] crawling new set of snapshot data for grants
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] fetching tables inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] crawling new set of snapshot data for tables
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj] listing tables and views
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj.dummy_tbmy7] fetching table metadata
12:27 DEBUG [databricks.labs.ucx.hive_metastore.tables] [hive_metastore.dummy_sp8bj.dummy_tlrdr] fetching table metadata
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.tables] found 2 new records for tables
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] fetching udfs inventory
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] crawling new set of snapshot data for udfs
12:27 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_sp8bj] listing udfs
12:27 DEBUG [databricks.labs.ucx.hive_metastore.udfs] [hive_metastore.dummy_sp8bj.dummy_fzvdx] fetching udf metadata
12:27 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.udfs] found 1 new records for udfs
12:38 ERROR [databricks.labs.ucx.hive_metastore.grants] Couldn't fetch grants for object ANY FILE : TEMPORARILY_UNAVAILABLE: The service at /api/2.0/sql-acl/get-permissions is taking too long to process your request. Please try again later or try a faster operation. [TraceId: 00-155b6b90de3597c14999eeb0d4a5e7f2-5071026a3ad8ee42-00]
12:38 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sh20d.grants] found 10 new records for grants
[gw7] linux -- Python 3.10.16 /home/runner/work/ucx/ucx/.venv/bin/python

_{Running from acceptance #8417}

FastLee

LGTM

FastLee · 2025-02-24T15:31:41Z

src/databricks/labs/ucx/assessment/jobs.py

+        """List the jobs.
+
+        If provided, excludes jobs with id in `exclude_job_ids`.
+        If provided, exclude jobs with id not in `include_job_ids`.


Suggested change

If provided, exclude jobs with id not in `include_job_ids`.

If provided, excludes jobs not in `include_job_ids`.

FastLee

LGTM a single nit

src/databricks/labs/ucx/assessment/jobs.py

pritishpai

LGTM!

JCZuurmond · 2025-02-27T07:02:04Z

Merging the following branches in this branch in an attempt to let the CI pass

Added to GlobalContext in https://github.com/databrickslabs/ucx/pull/2778/files#diff-f5067fb24dbf36380ff6250d2274083d96f167de7f32aed9fc993ebd4d0369a6

## Changes Scope linted dashboards on mock runtime context. We should use `make_dashboard` instead of the dashboard fixture directly `_make_dashboard`. Also changed one dashboard to a `LakeviewDashboard` so that we lint that too

…3759) ## Changes Add retry mechanism to wait for the grants to exists before crawling ### Linked issues Resolves #3758 ### Tests - [x] modified integration tests: `test_all_grant_types`

* Convert UCX job ids to `int` before passing to `JobsCrawler` ([#3816](#3816)). In this release, we have addressed issue [#3722](#3722) and improved the robustness of the open-source library by modifying the `jobs_crawler` method to handle job IDs more effectively. Previously, job IDs were passed directly to the `exclude_job_ids` parameter, which could cause issues if they were not integers. To address this problem, we have updated the `jobs_crawler` method to convert all job IDs to integers using a list comprehension before passing them to the method. This change ensures that only valid integer job IDs are used, thereby enhancing the reliability of the method. The commit includes a manual test to confirm the correct behavior of this modification. In summary, this modification improves the robustness of the code by ensuring that integer job IDs are utilized correctly in the `JobsCrawler` method. * Exclude UCX jobs from crawling ([#3733](#3733)). In this release, we have made modifications to the `JobsCrawler` and the existing `assessment` workflow to exclude UCX jobs from crawling, avoiding confusion for users when they appear in assessment reports. This change addresses issues [#3656](#3656) and [#3722](#3722), and is a follow-up to previous issue [#3732](#3732). We have also incorporated updates from pull requests [#3767](#3767) and [#3759](#3759) to improve integration tests and linting. Additionally, a retry mechanism has been added to wait for grants to exist before crawling, addressing issue [#3758](#3758). The changes include the addition of unit and integration tests to ensure the correctness of the modifications. A new `exclude_job_ids` parameter has been added to the `JobsCrawler` constructor, which is initialized with the list of UCX job IDs, ensuring that UCX jobs are not included in the assessment report. The `_list_jobs` method now excludes jobs based on the provided `exclude_job_ids` and `include_job_ids` arguments. The `_crawl` method now uses the `_list_jobs` method to list the jobs to be crawled. The `_assess_jobs` method has been updated to take into account the exclusion of specific job IDs. The `test_grant_detail` file, an integration test for the Hive Metastore grants functionality, has been updated to include a retry mechanism to wait for grants to exist before crawling and to check if the SELECT permission on ANY FILE is present in the grants. * Let `WorkflowLinter.refresh_report` lint jobs from `JobsCrawler` ([#3732](#3732)). In this release, the `WorkflowLinter.refresh_report` method has been updated to lint jobs from the `JobsCrawler` class, ensuring that only jobs within the scope of the crawler are processed. This change resolves issue [#3662](#3662) and progresses issue [#3722](#3722). The workflow linting code, the `assessment` workflow, and the `JobsCrawler` class have been modified. The `JobsCrawler` class now includes a `snapshot` method, which is used in the `WorkflowLinter.refresh_report` method to retrieve necessary data about jobs. Unit and integration tests have been updated correspondingly, with the integration test for workflows now verifying that all rows returned from a query to the `workflow_problems` table have a valid `path` field. The `WorkflowLinter` constructor now includes an instance of `JobsCrawler`, allowing for more targeted linting of jobs. The introduction of the `JobsCrawler` class enables more efficient and precise linting of jobs, improving the overall accuracy of workflow assessment. * Let dashboard name adhere to naming convention ([#3789](#3789)). In this release, the naming convention for dashboard names in the `ucx` library has been enforced, restricting them to alphanumeric characters, hyphens, and underscores. This change replaces any non-conforming characters in existing dashboard names with hyphens or underscores, addressing several issues ([#3761](#3761) through [#3788](#3788)). A temporary fix has been added to the `_create_dashboard` method to ensure newly created dashboard names adhere to the new naming convention, indicated by a TODO comment. This release also resolves a test failure in a specific GitHub Actions run and addresses a total of 29 issues. The specifics of the modification made to the `databricks labs install ucx` command and the changes to existing functionality are not detailed, making it difficult to assess their scope. The commit includes the deletion of a file called `02_0_owner.filter.yml`, and all changes have been manually tested. For future reference, it would be helpful to include more information about the changes made, their impact, and the reason for deleting the specified file. * Partial revert `Let dashboard name adhere to naming convention` ([#3794](#3794)). In this release, we have partially reverted a previous change to the migration progress dashboard, reintroducing the owner filter. This change was made in response to feedback from users who found the previous modification to the dashboard less intuitive. The new owner filter has been defined in a new file, '02_0_owner.filter.yml', which includes the title, column name, type, and width of the filter. To ensure proper functionality, this change requires the release of lsql after merging. The change has been thoroughly tested to guarantee its correct operation and to provide the best possible user experience. * Partial revert `Let dashboard name adhere to naming convention` ([#3795](#3795)). In this release, we have partially reversed a previous change that enforced a naming convention for dashboard names, allowing the use of special characters such as spaces and brackets again. The `_create_dashboard` method in the `install.py` file and the `_name` method in the `mixins.py` file have been updated to reflect this change, affecting the migration progress dashboard. The `display_name` attribute of the `metadata` object has been updated to use the original format, which may include special characters. The `reference` variable has also been updated accordingly. The functions `created_job_tasks` and `created_job` have been updated to use the new naming convention when retrieving installation jobs with specific names. These changes have been manually tested and the tests have been verified to work correctly after the reversion. This change is related to issues [#3799](#3799), [#3789](#3789), and reverts commit 048bc8f. * Put back dashboard names ([#3808](#3808)). In the lsql release v0.16.0, the naming convention for dashboards has been updated to support non-alphanumeric characters in the dashboard names. This change modifies the `_create_dashboard` function in `install.py` and the `_name` method in `mixins.py` to create dashboard names with a format like `[UCX] assessment (Main)`, which includes parent and child folder names. This update addresses issues reported in tickets [#3797](#3797) and [#3790](#3790), and partially reverses previous changes made in commits 4017a25 and 834ef14. The functionality of other methods remains unchanged. With this release, the `created_job_tasks` and `created_job` functions now accept dashboard names with non-alphanumeric characters as input. * Updated databricks-labs-lsql requirement from <0.15,>=0.14.0 to >=0.14.0,<0.17 ([#3801](#3801)). In this update, we have updated the required version of the `dat ab ricks-l abs-ls ql` package from a version greater than or equal to 0.15.0 and less than 0.16.0 to a version greater than or equal to 0.16.0 and less than 0.17.0. This change allows for the use of the latest version of the package, which includes various bug fixes and dependency updates. The package is utilized in the acceptance tests that are run as part of the CI/CD pipeline. With this update, the acceptance tests can now be executed using the most recent version of the package, resulting in enhanced functionality and reliability. * Updated databricks-sdk requirement from <0.42,>=0.40 to >=0.44,<0.45 ([#3686](#3686)). In this release, we have updated the version requirement for the `databricks-sdk` package to be greater than or equal to 0.44.0 and less than 0.45.0. This update allows for the use of the latest version of the `databricks-sdk`, which includes new methods, fields, and bug fixes. For instance, the `get_message_query_result_by_attachment` method has been added for the `w.genie.workspace_level_service`, and several fields such as `review_state`, `reviews`, and `runner_collaborators` have been removed for the `databricks.sdk.service.clean_rooms.CleanRoomAssetNotebook` object. Additionally, the `securable_kind` field has been removed for various objects such as `CatalogInfo` and `ConnectionInfo`. We recommend thoroughly testing this update to ensure compatibility with your project. The release notes for versions 0.44.0 and 0.43.0 can be found in the commit history. Please note that there are several backward-incompatible changes listed in the changelog for both versions. Dependency updates: * Updated databricks-labs-lsql requirement from <0.15,>=0.14.0 to >=0.14.0,<0.17 ([#3801](#3801)). * Updated databricks-sdk requirement from <0.42,>=0.40 to >=0.44,<0.45 ([#3686](#3686)).

JCZuurmond added step/assessment go/uc/upgrade - Assessment Step migrate/jobs Step 5 - Upgrading Jobs for External Tables labels Feb 21, 2025

JCZuurmond requested review from FastLee and pritishpai February 21, 2025 13:59

JCZuurmond self-assigned this Feb 21, 2025

JCZuurmond requested a review from a team as a code owner February 21, 2025 13:59

JCZuurmond had a problem deploying to account-admin February 21, 2025 13:59 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin February 21, 2025 14:03 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 24, 2025 07:33 — with GitHub Actions Error

JCZuurmond temporarily deployed to account-admin February 24, 2025 07:34 — with GitHub Actions Inactive

FastLee requested changes Feb 24, 2025

View reviewed changes

JCZuurmond temporarily deployed to account-admin February 25, 2025 08:31 — with GitHub Actions Inactive

JCZuurmond requested a review from FastLee February 25, 2025 08:43

FastLee approved these changes Feb 25, 2025

View reviewed changes

src/databricks/labs/ucx/assessment/jobs.py Outdated Show resolved Hide resolved

JCZuurmond had a problem deploying to account-admin February 25, 2025 12:44 — with GitHub Actions Error

JCZuurmond force-pushed the fix/exclude-ucx-jobs-from-crawling branch from 8608d7a to 29b985d Compare February 25, 2025 12:45

JCZuurmond enabled auto-merge February 25, 2025 12:45

JCZuurmond temporarily deployed to account-admin February 25, 2025 12:45 — with GitHub Actions Inactive

pritishpai approved these changes Feb 25, 2025

View reviewed changes

JCZuurmond had a problem deploying to account-admin February 25, 2025 15:50 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 25, 2025 19:08 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 26, 2025 07:45 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 26, 2025 13:19 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 26, 2025 15:22 — with GitHub Actions Error

JCZuurmond had a problem deploying to account-admin February 26, 2025 16:17 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin February 27, 2025 07:02 — with GitHub Actions Error

JCZuurmond force-pushed the fix/exclude-ucx-jobs-from-crawling branch from 2f10c8a to 2638c0f Compare February 27, 2025 07:02

JCZuurmond had a problem deploying to account-admin February 27, 2025 07:02 — with GitHub Actions Failure

JCZuurmond added 20 commits February 27, 2025 13:23

Introduce exclude job ids

042d41d

Add integration test for skipping job ids

c043520

Add unit tests for including job ids

ac51ce2

Implement getting a jobs with include_job_ids

5e08555

Test exclude job ids

52c09ba

Exclude jobs when job id in exclude_job_id

4ead4a6

Fix mock list jobs should bve an iterator

31a3b9b

Test exclude_job_id takes preference over include_job_ids

ef48c13

Update documentation about including and excluding jobs

ceb7f36

Ignore ucx jobs

ea3643c

Remove mocking get jobs

2fc4eff

Test precedence correctly

8ad10cc

Remove unused import

3be32ae

Fix typo

950e7a1

Remove filter on include job ids

16a1715

Rewrite job listing to for-loop

a7dbfce

Add state.json to mock workspace client

e747cb3

Remove JobsCrawler on RuntimeContext

475e627

Added to GlobalContext in https://github.com/databrickslabs/ucx/pull/2778/files#diff-f5067fb24dbf36380ff6250d2274083d96f167de7f32aed9fc993ebd4d0369a6

Scope linted dashboards on mock runtime context (#3767)

46c6b6a

## Changes Scope linted dashboards on mock runtime context. We should use `make_dashboard` instead of the dashboard fixture directly `_make_dashboard`. Also changed one dashboard to a `LakeviewDashboard` so that we lint that too

Add retry mechanism to wait for the grants to exists before crawling (#…

f6de817

…3759) ## Changes Add retry mechanism to wait for the grants to exists before crawling ### Linked issues Resolves #3758 ### Tests - [x] modified integration tests: `test_all_grant_types`

JCZuurmond force-pushed the fix/exclude-ucx-jobs-from-crawling branch from 2638c0f to f6de817 Compare February 27, 2025 12:23

JCZuurmond had a problem deploying to account-admin February 27, 2025 12:23 — with GitHub Actions Failure

gueniai disabled auto-merge February 27, 2025 16:52

gueniai merged commit 5998451 into main Feb 27, 2025
6 of 7 checks passed

gueniai deleted the fix/exclude-ucx-jobs-from-crawling branch February 27, 2025 16:53

gueniai mentioned this pull request Mar 5, 2025

Release v0.57.0 #3820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude UCX jobs from crawling #3733

Exclude UCX jobs from crawling #3733

JCZuurmond commented Feb 21, 2025 •

edited

Loading

github-actions bot commented Feb 21, 2025 •

edited

Loading

FastLee left a comment

FastLee Feb 24, 2025

FastLee left a comment

pritishpai left a comment

JCZuurmond commented Feb 27, 2025

	If provided, exclude jobs with id not in `include_job_ids`.
	If provided, excludes jobs not in `include_job_ids`.

Exclude UCX jobs from crawling #3733

Exclude UCX jobs from crawling #3733

Conversation

JCZuurmond commented Feb 21, 2025 • edited Loading

Changes

Linked issues

Functionality

Tests

PRs merged into this branch

github-actions bot commented Feb 21, 2025 • edited Loading

FastLee left a comment

Choose a reason for hiding this comment

FastLee Feb 24, 2025

Choose a reason for hiding this comment

FastLee left a comment

Choose a reason for hiding this comment

pritishpai left a comment

Choose a reason for hiding this comment

JCZuurmond commented Feb 27, 2025

JCZuurmond commented Feb 21, 2025 •

edited

Loading

github-actions bot commented Feb 21, 2025 •

edited

Loading