Skip to content

Commit

Permalink
rebase
Browse files Browse the repository at this point in the history
  • Loading branch information
HariGS-DB committed May 26, 2024
2 parents 8892060 + f01e559 commit 5bdc37e
Show file tree
Hide file tree
Showing 125 changed files with 520 additions and 2,189 deletions.
10 changes: 0 additions & 10 deletions .github/codecov.yml

This file was deleted.

3 changes: 1 addition & 2 deletions .github/workflows/acceptance.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@ jobs:
run: pip install hatch==1.9.4

- name: Run integration tests
# ...
uses: databrickslabs/sandbox/acceptance@acceptance/v0.2.2
uses: databrickslabs/sandbox/acceptance@acceptance/v0.2.1
with:
vault_uri: ${{ secrets.VAULT_URI }}
timeout: 45m
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
run: pip install hatch==1.9.4

- name: Run nightly tests
uses: databrickslabs/sandbox/acceptance@acceptance/v0.2.2
uses: databrickslabs/sandbox/acceptance@acceptance/v0.2.1
with:
vault_uri: ${{ secrets.VAULT_URI }}
timeout: 45m
Expand Down
72 changes: 0 additions & 72 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,7 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project.
* [Table Migration Workflow](#table-migration-workflow)
* [Dependency CLI commands](#dependency-cli-commands)
* [Table Migration Workflow Tasks](#table-migration-workflow-tasks)
* [Post Migration Data Reconciliation Task](#post-migration-data-reconciliation-task)
* [Other considerations](#other-considerations)
* [Jobs Static Code Analysis Workflow](#jobs-static-code-analysis-workflow)
* [Utility commands](#utility-commands)
* [`logs` command](#logs-command)
* [`ensure-assessment-run` command](#ensure-assessment-run-command)
Expand Down Expand Up @@ -68,7 +66,6 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project.
* [`move` command](#move-command)
* [`alias` command](#alias-command)
* [Code migration commands](#code-migration-commands)
* [`lint-local-code` command](#lint-local-code-command)
* [`migrate-local-code` command](#migrate-local-code-command)
* [`migrate-dbsql-dashboards` command](#migrate-dbsql-dashboards-command)
* [`revert-dbsql-dashboards` command](#revert-dbsql-dashboards-command)
Expand All @@ -87,7 +84,6 @@ See [contributing instructions](CONTRIBUTING.md) to help improve this project.

- Databricks CLI v0.213 or later. See [instructions](#authenticate-databricks-cli).
- Python 3.10 or later. See [Windows](https://www.python.org/downloads/windows/) instructions.
- Databricks Premium or Enterprise workspace.
- Network access to your Databricks Workspace used for the [installation process](#install-ucx).
- Network access to the Internet for [pypi.org](https://pypi.org) and [github.com](https://github.com) from machine running the installation.
- Databricks Workspace Administrator privileges for the user, that runs the installation. Running UCX as a Service Principal is not supported.
Expand Down Expand Up @@ -462,21 +458,6 @@ There are 3 main table migration workflows, targeting different table types. All
- Migrate tables using CTAS
- Experimentally migrate Delta and Parquet data found in dbfs mount but not registered as Hive Metastore table into UC tables.

### Post Migration Data Reconciliation Task
UCX also provides `migrate-data-reconciliation` workflow to validate the integrity of the migrated tables:
- Compare the schema of the source and target tables. The result is `schema_matches`, and column by column comparison
is stored as `column_comparison` struct.
- Compare the row counts of the source and target tables. If the row count is within the reconciliation threshold
(defaults to 5%), `data_matches` is True.
- Compare the content of individual row between source and target tables to identify any discrepancies (when `compare_rows`
flag is enabled). This is done using hash comparison, and number of missing rows are stored as `source_missing_count`
and `target_missing_count`

Once the workflow completes, the output will be stored in `$inventory_database.reconciliation_results` view, and displayed
in the Migration dashboard.

![reconciliation results](docs/recon_results.png)

### Other considerations
- You may need to run the workflow multiple times to ensure all the tables are migrated successfully in phases.
- If your Delta tables in DBFS root have a large number of files, consider:
Expand All @@ -485,40 +466,6 @@ in the Migration dashboard.
- Consider creating an instance pool, and setting its id when prompted during the UCX installation. This instance pool will be specified in the cluster policy used by all UCX workflows job clusters.
- You may also manually edit the job cluster configration per job or per task after the workflows are deployed.

### [EXPERIMENTAL] Scan tables in mounts Workflow
#### <b>Always run this workflow AFTER the assessment has finished</b>
- This experimental workflow attemps to find all Tables inside mount points that are present on your workspace.
- If you do not run this workflow, then `migrate-tables-in-mounts-experimental` won't do anything.
- It writes all results to `hive_metastore.<inventory_database>.tables`, you can query those tables found by filtering on database values that starts with `mounted_`
- This command is incremental, meaning that each time you run it, it will overwrite the previous tables in mounts found.
- Current format are supported:
- DELTA - PARQUET - CSV - JSON
- Also detects partitioned DELTA and PARQUET
- You can configure these workflows with the following options available on conf.yml:
- include_mounts : A list of mount points to scans, by default the workflow scans for all mount points
- exclude_paths_in_mount : A list of paths to exclude in all mount points
- include_paths_in_mount : A list of paths to include in all mount points

### [EXPERIMENTAL] Migrate tables in mounts Workflow
- An experimental workflow that migrates tables in mount points using a `CREATE TABLE` command, optinally sets a default tables owner if provided in `default_table_owner` conf parameter.
- You must do the following in order to make this work:
- run the Assessment [workflow](#assessment-workflow)
- run the scan tables in mounts [workflow](#EXPERIMENTAL-scan-tables-in-mounts-workflow)
- run the [`create-table-mapping` command](#create-table-mapping-command)
- or manually create a `mapping.csv` file in Workspace -> Applications -> ucx


[[back to top](#databricks-labs-ucx)]

## Jobs Static Code Analysis Workflow

> Please note that this is an experimental workflow.
The `experimental-workflow-linter` workflow lints accessible code belonging to all workflows/jobs present in the
workspace. The linting emits problems indicating what to resolve for making the code Unity Catalog compatible.

![code compatibility problems](docs/code_compatibility_problems.png)

[[back to top](#databricks-labs-ucx)]

# Utility commands
Expand Down Expand Up @@ -974,25 +921,6 @@ clusters to be UC compatible.

[[back to top](#databricks-labs-ucx)]

## `lint-local-code` command

```text
databricks labs ucx lint-local-code
```

At any time, you can run this command to assess all migrations required in a local directory or a file. It only takes seconds to run and it
gives you an initial overview of what needs to be migrated without actually performing any migration. A great way to start a migration!

This command detects all dependencies, and analyzes them. It is still experimental and at the moment only supports Python and SQL files.
We expect this command to run within a minute on code bases up to 50.000 lines of code.
Future versions of `ucx` will add support for more source types, and more migration details.

When run from an IDE terminal, this command generates output as follows:
![img.png](docs/lint-local-code-output.png)
With modern IDEs, clicking on the file link opens the file at the problematic line

[[back to top](#databricks-labs-ucx)]

## `migrate-local-code` command

```text
Expand Down
Binary file removed docs/code_compatibility_problems.png
Binary file not shown.
Binary file removed docs/lint-local-code-output.png
Binary file not shown.
Binary file removed docs/recon_results.png
Binary file not shown.
15 changes: 3 additions & 12 deletions labs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,19 +123,13 @@ commands:

- name: create-missing-principals
description: For AWS, this command identifies all the S3 locations that are missing a UC compatible role and
creates them. It accepts a number of optional parameters, i.e. KMS Key, Role Name, Policy Name, and whether to
create a single role for all the S3 locations.
creates them. It takes single-role optional parameter.
If set to True, it will create a single role for all the S3 locations.
flags:
- name: aws-profile
description: AWS Profile to use for authentication
- name: kms-key
description: (Optional) KMS Key to be specified for the UC roles.
- name: role-name
description: (Optional) IAM Role name to be specified for the UC roles. (default:UC_ROLE)
- name: policy-name
description: (Optional) IAM policy Name to be specified for the UC roles. (default:UC_POLICY)
- name: single-role
description: (Optional) Create a single role for all S3 locations. (default:False)
description: (Optional) Create a single role for all the S3 locations (default:True)

- name: create-uber-principal
description: For azure cloud, creates a service principal and gives STORAGE BLOB READER access on all the storage account
Expand Down Expand Up @@ -192,9 +186,6 @@ commands:
- name: migrate-local-code
description: (Experimental) Migrate files in the current directory to be more compatible with Unity Catalog.

- name: lint-local-code
description: (Experimental) Lint files in the current directory to highlight incompatibilities with Unity Catalog.

- name: show-all-metastores
is_account_level: true
description: Show all metastores available in the same region as the specified workspace
Expand Down
8 changes: 4 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,11 @@ classifiers = [
"Topic :: Utilities",
]

dependencies = ["databricks-sdk>=0.27,<0.29",
dependencies = ["databricks-sdk~=0.27.0",
"databricks-labs-lsql~=0.4.0",
"databricks-labs-blueprint>=0.6.0",
"PyYAML>=6.0.0,<7.0.0",
"sqlglot>=23.9,<24.1"]
"sqlglot>=23.9,<23.16"]

[project.entry-points.databricks]
runtime = "databricks.labs.ucx.runtime:main"
Expand Down Expand Up @@ -87,11 +87,11 @@ path = ".venv"
test = "pytest -n 4 --cov src --cov-report=xml --timeout 30 tests/unit --durations 20"
coverage = "pytest -n auto --cov src tests/unit --timeout 30 --cov-report=html --durations 20"
integration = "pytest -n 10 --cov src tests/integration --durations 20"
fmt = ["black . --extend-exclude 'tests/unit/source_code/samples/*'",
fmt = ["black .",
"ruff check . --fix",
"mypy --disable-error-code 'annotation-unchecked' --exclude 'tests/unit/source_code/samples/*' .",
"pylint --output-format=colorized -j 0 src tests"]
verify = ["black --check . --extend-exclude 'tests/unit/source_code/samples/*'",
verify = ["black --check .",
"ruff .",
"mypy --exclude 'tests/unit/source_code/samples/*' .",
"pylint --output-format=colorized -j 0 src tests"]
Expand Down
7 changes: 1 addition & 6 deletions src/databricks/labs/ucx/assessment/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,12 +330,7 @@ def update_uc_trust_role(self, role_name: str, external_id: str = "0000") -> str
return update_role["Role"]["Arn"]

def put_role_policy(
self,
role_name: str,
policy_name: str,
s3_prefixes: set[str],
account_id: str,
kms_key=None,
self, role_name: str, policy_name: str, s3_prefixes: set[str], account_id: str, kms_key=None
) -> bool:
if not self._run_command(
f"iam put-role-policy --role-name {role_name} --policy-name {policy_name} "
Expand Down
3 changes: 2 additions & 1 deletion src/databricks/labs/ucx/aws/access.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,14 @@ def __init__(
ws: WorkspaceClient,
aws_resources: AWSResources,
external_locations: ExternalLocations,
aws_account_id=None,
kms_key=None,
):
self._installation = installation
self._aws_resources = aws_resources
self._ws = ws
self._locations = external_locations
self._aws_account_id = aws_resources.validate_connection().get("Account")
self._aws_account_id = aws_account_id
self._kms_key = kms_key

def list_uc_roles(self, *, single_role=True, role_name="UC_ROLE", policy_name="UC_POLICY"):
Expand Down
6 changes: 2 additions & 4 deletions src/databricks/labs/ucx/aws/credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,12 +205,10 @@ def _print_action_plan(iam_list: list[AWSUCRoleCandidate]):
def save(self, migration_results: list[CredentialValidationResult]) -> str:
return self._installation.save(migration_results, filename=self._output_file)

def run(self, prompts: Prompts, *, single_role=False, role_name="UC_ROLE", policy_name="UC_POLICY"):
def run(self, prompts: Prompts, *, single_role=True, role_name="UC_ROLE", policy_name="UC_POLICY"):

iam_list = self._resource_permissions.list_uc_roles(
single_role=single_role,
role_name=role_name,
policy_name=policy_name,
single_role=single_role, role_name=role_name, policy_name=policy_name
)
if not iam_list:
logger.info("No IAM Role created")
Expand Down
22 changes: 5 additions & 17 deletions src/databricks/labs/ucx/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

from databricks.labs.ucx.config import WorkspaceConfig
from databricks.labs.ucx.contexts.account_cli import AccountContext
from databricks.labs.ucx.contexts.workspace_cli import WorkspaceContext, LocalCheckoutContext
from databricks.labs.ucx.contexts.workspace_cli import WorkspaceContext
from databricks.labs.ucx.hive_metastore.tables import What

ucx = App(__file__)
Expand Down Expand Up @@ -293,19 +293,17 @@ def create_missing_principals(
w: WorkspaceClient,
prompts: Prompts,
ctx: WorkspaceContext | None = None,
single_role: bool = False,
role_name="UC_ROLE",
policy_name="UC_POLICY",
single_role: bool = True,
**named_parameters,
):
"""Not supported for Azure.
For AWS, this command identifies all the S3 locations that are missing a UC compatible role and creates them.
By default, it will create a role per S3 location. Set the optional single_role parameter to True to create a single role for all S3 locations.
By default, it will create a single role for all S3. Set the optional single_role parameter to False, to create one role per S3 location.
"""
if not ctx:
ctx = WorkspaceContext(w, named_parameters)
if ctx.is_aws:
return ctx.iam_role_creation.run(prompts, single_role=single_role, role_name=role_name, policy_name=policy_name)
return ctx.iam_role_creation.run(prompts, single_role=single_role)
raise ValueError("Unsupported cloud provider")


Expand Down Expand Up @@ -398,7 +396,7 @@ def revert_cluster_remap(w: WorkspaceClient, prompts: Prompts):
@ucx.command
def migrate_local_code(w: WorkspaceClient, prompts: Prompts):
"""Fix the code files based on their language."""
ctx = LocalCheckoutContext(w)
ctx = WorkspaceContext(w)
working_directory = Path.cwd()
if not prompts.confirm("Do you want to apply UC migration to all files in the current directory?"):
return
Expand Down Expand Up @@ -471,15 +469,5 @@ def revert_dbsql_dashboards(w: WorkspaceClient, dashboard_id: str | None = None)
ctx.redash.revert_dashboards(dashboard_id)


@ucx.command
def lint_local_code(
w: WorkspaceClient, prompts: Prompts, path: str | None = None, ctx: LocalCheckoutContext | None = None
):
"""Lint local code files looking for problems."""
if ctx is None:
ctx = LocalCheckoutContext(w)
ctx.local_code_linter.lint(prompts, None if path is None else Path(path))


if __name__ == "__main__":
ucx()
4 changes: 2 additions & 2 deletions src/databricks/labs/ucx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ class WorkspaceConfig: # pylint: disable=too-many-instance-attributes
# List of workspace ids ucx is installed on, only applied to account-level installation
installed_workspace_ids: list[int] | None = None

# Threshold for row count comparison during data reconciliation, in percentage
recon_tolerance_percent: int = 5
# Whether to upload dependent libraries to the workspace
upload_dependencies: bool = False

# Whether to upload dependent libraries to the workspace
upload_dependencies: bool = False
Expand Down
Loading

0 comments on commit 5bdc37e

Please sign in to comment.