diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000000..127bf0a5f5 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,76 @@ +# Based on https://imfing.github.io/hextra/docs/guide/deploy-site/ +name: Deploy docs + +on: + # Runs on pushes targeting the default branch + push: + branches: ["main"] + + # Allows you to run this workflow manually from the Actions tab + workflow_dispatch: + +# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages +permissions: + contents: read + pages: write + id-token: write + +# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. +# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. +concurrency: + group: "pages" + cancel-in-progress: false + +# Default to bash +defaults: + run: + shell: bash + +jobs: + # Build job + build: + runs-on: ubuntu-latest + env: + HUGO_VERSION: 0.138.0 + steps: + - name: Checkout + uses: actions/checkout@v4 + with: + fetch-depth: 0 # fetch all history for .GitInfo and .Lastmod + submodules: recursive + - name: Setup Go + uses: actions/setup-go@v5 + with: + go-version: '1.22' + - name: Setup Pages + id: pages + uses: actions/configure-pages@v4 + - name: Setup Hugo + run: | + wget -O ${{ runner.temp }}/hugo.deb https://github.com/gohugoio/hugo/releases/download/v${HUGO_VERSION}/hugo_extended_${HUGO_VERSION}_linux-amd64.deb \ + && sudo dpkg -i ${{ runner.temp }}/hugo.deb + - name: Build with Hugo + env: + # For maximum backward compatibility with Hugo modules + HUGO_ENVIRONMENT: production + HUGO_ENV: production + run: | + cd docs && hugo \ + --gc --minify \ + --baseURL "${{ steps.pages.outputs.base_url }}/" + - name: Upload artifact + uses: actions/upload-pages-artifact@v3 + with: + path: docs/public + + # Deployment job + deploy: + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + runs-on: ubuntu-latest + needs: build + steps: + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 \ No newline at end of file diff --git a/.gitignore b/.gitignore index 8a030e881a..a62b36e9f0 100644 --- a/.gitignore +++ b/.gitignore @@ -152,5 +152,4 @@ dev/cleanup.py .python-version .databricks-login.json -*.out -foo +*.out \ No newline at end of file diff --git a/Makefile b/Makefile index 4b62910e46..caff5f58ea 100644 --- a/Makefile +++ b/Makefile @@ -31,4 +31,12 @@ known: solacc: hatch run python tests/integration/source_code/solacc.py -.PHONY: all clean dev lint fmt test integration coverage known solacc +docs: + cd docs && hugo server --buildDrafts --disableFastRender + +docs-test: + @echo 'Make sure to run `make docs` in another terminal' + cd docs && yarn linkinator http://localhost:1313/ucx/ + + +.PHONY: all clean dev lint fmt test integration coverage known solacc docs docs-test diff --git a/README.md b/README.md index c43c3f4c62..0f06d0b42d 100644 --- a/README.md +++ b/README.md @@ -1,2362 +1,42 @@ -Databricks Labs UCX -=== -![UCX by Databricks Labs](docs/logo-no-background.png) +# UCX by Databricks Labs -The companion for upgrading to Unity Catalog (UC). +
+
+
+
+
+ 🚀 UCX - Unity Catalog Migration Assistant +
-The [README notebook](#readme-notebook), which can be found in the installation folder contains further instructions and explanations of the different ucx workflows & dashboards. -Once the migration is scoped, you can start with the [table migration process](#Table-Migration). +UCX is a command line tool that helps you migrate your Databricks workspace to Unity Catalog. UCX provides a set of commands to help you migrate your tables, dashboards, and notebooks to be UC compatible. -More workflows, like notebook code migration are coming in future releases. - - -UCX also provides a number of command line utilities accessible via `databricks labs ucx`. - -For questions, troubleshooting or bug fixes, please see our [troubleshooting guide](docs/troubleshooting.md) or submit [an issue](https://github.com/databrickslabs/ucx/issues). -See [contributing instructions](CONTRIBUTING.md) to help improve this project. - [![build](https://github.com/databrickslabs/ucx/actions/workflows/push.yml/badge.svg)](https://github.com/databrickslabs/ucx/actions/workflows/push.yml) [![codecov](https://codecov.io/github/databrickslabs/ucx/graph/badge.svg?token=p0WKAfW5HQ)](https://codecov.io/github/databrickslabs/ucx) ![linesofcode](https://aschey.tech/tokei/github/databrickslabs/ucx?category=code) - -* [Databricks Labs UCX](#databricks-labs-ucx) -* [Installation](#installation) - * [Installation requirements](#installation-requirements) - * [Authenticate Databricks CLI](#authenticate-databricks-cli) - * [Install UCX](#install-ucx) - * [Installation resources](#installation-resources) - * [Installation folder](#installation-folder) - * [Readme notebook](#readme-notebook) - * [Debug notebook](#debug-notebook) - * [Debug logs](#debug-logs) - * [Installation configuration](#installation-configuration) - * [Advanced installation options](#advanced-installation-options) - * [Force install over existing UCX](#force-install-over-existing-ucx) - * [Installing UCX on all workspaces within a Databricks account](#installing-ucx-on-all-workspaces-within-a-databricks-account) - * [Installing UCX with company hosted PYPI mirror](#installing-ucx-with-company-hosted-pypi-mirror) - * [Upgrading UCX for newer versions](#upgrading-ucx-for-newer-versions) - * [Uninstall UCX](#uninstall-ucx) -* [Migration process](#migration-process) - * [Table migration process](#table-migration-process) - * [Table mapping](#table-mapping) - * [Step 1 : Create the mapping file](#step-1--create-the-mapping-file) - * [Step 2: Update the mapping file](#step-2-update-the-mapping-file) - * [Data access](#data-access) - * [Step 1 : Map cloud principals to cloud storage locations](#step-1--map-cloud-principals-to-cloud-storage-locations) - * [Step 2 : Create or modify cloud principals and credentials](#step-2--create-or-modify-cloud-principals-and-credentials) - * [Step 3: Create the "uber" Principal](#step-3-create-the-uber-principal) - * [New Unity Catalog resources](#new-unity-catalog-resources) - * [Step 0: Attach a metastore](#step-0-attach-a-metastore) - * [Step 1: Create external Locations](#step-1-create-external-locations) - * [Step 2: Create Catalogs and Schemas](#step-2-create-catalogs-and-schemas) - * [Migrate Hive metastore data objects](#migrate-hive-metastore-data-objects) - * [Odds and Ends](#odds-and-ends) - * [Skip migrating schemas, tables or views](#skip-migrating-schemas-tables-or-views) - * [Move data objects](#move-data-objects) - * [Alias data objects](#alias-data-objects) - * [Revert migrated data objects](#revert-migrated-data-objects) -* [Workflows](#workflows) - * [Assessment workflow](#assessment-workflow) - * [Group migration workflow](#group-migration-workflow) - * [Table migration workflows](#table-migration-workflows) - * [Migrate tables](#migrate-tables) - * [Migrate external Hive SerDe tables](#migrate-external-hive-serde-tables) - * [Migrate external tables CTAS](#migrate-external-tables-ctas) - * [Post-migration data reconciliation workflow](#post-migration-data-reconciliation-workflow) - * [[LEGACY] Scan tables in mounts Workflow](#legacy-scan-tables-in-mounts-workflow) - * [Always run this workflow AFTER the assessment has finished](#balways-run-this-workflow-after-the-assessment-has-finishedb) - * [[LEGACY] Migrate tables in mounts Workflow](#legacy-migrate-tables-in-mounts-workflow) - * [[EXPERIMENTAL] Migration Progress Workflow](#experimental-migration-progress-workflow) -* [Dashboards](#dashboards) -* [Linter message codes](#linter-message-codes) - * [`cannot-autofix-table-reference`](#cannot-autofix-table-reference) - * [`catalog-api-in-shared-clusters`](#catalog-api-in-shared-clusters) - * [`changed-result-format-in-uc`](#changed-result-format-in-uc) - * [`direct-filesystem-access-in-sql-query`](#direct-filesystem-access-in-sql-query) - * [`direct-filesystem-access`](#direct-filesystem-access) - * [`dependency-not-found`](#dependency-not-found) - * [`jvm-access-in-shared-clusters`](#jvm-access-in-shared-clusters) - * [`legacy-context-in-shared-clusters`](#legacy-context-in-shared-clusters) - * [`not-supported`](#not-supported) - * [`notebook-run-cannot-compute-value`](#notebook-run-cannot-compute-value) - * [`python-udf-in-shared-clusters`](#python-udf-in-shared-clusters) - * [`rdd-in-shared-clusters`](#rdd-in-shared-clusters) - * [`spark-logging-in-shared-clusters`](#spark-logging-in-shared-clusters) - * [`sql-parse-error`](#sql-parse-error) - * [`sys-path-cannot-compute-value`](#sys-path-cannot-compute-value) - * [`table-migrated-to-uc`](#table-migrated-to-uc) - * [`to-json-in-shared-clusters`](#to-json-in-shared-clusters) - * [`unsupported-magic-line`](#unsupported-magic-line) -* [Utility commands](#utility-commands) - * [`logs` command](#logs-command) - * [`ensure-assessment-run` command](#ensure-assessment-run-command) - * [`update-migration-progress` command](#update-migration-progress-command) - * [`repair-run` command](#repair-run-command) - * [`workflows` command](#workflows-command) - * [`open-remote-config` command](#open-remote-config-command) - * [`installations` command](#installations-command) - * [`report-account-compatibility` command](#report-account-compatibility-command) - * [`export-assessment` command](#export-assessment-command) -* [Metastore related commands](#metastore-related-commands) - * [`show-all-metastores` command](#show-all-metastores-command) - * [`assign-metastore` command](#assign-metastore-command) - * [`create-ucx-catalog` command](#create-ucx-catalog-command) -* [Table migration commands](#table-migration-commands) - * [`principal-prefix-access` command](#principal-prefix-access-command) - * [Access for AWS S3 Buckets](#access-for-aws-s3-buckets) - * [Access for Azure Storage Accounts](#access-for-azure-storage-accounts) - * [`create-missing-principals` command (AWS Only)](#create-missing-principals-command-aws-only) - * [`delete-missing-principals` command (AWS Only)](#delete-missing-principals-command-aws-only) - * [`create-uber-principal` command](#create-uber-principal-command) - * [`migrate-credentials` command](#migrate-credentials-command) - * [`validate-external-locations` command](#validate-external-locations-command) - * [`migrate-locations` command](#migrate-locations-command) - * [`create-table-mapping` command](#create-table-mapping-command) - * [`skip` command](#skip-command) - * [`unskip` command](#unskip-command) - * [`create-catalogs-schemas` command](#create-catalogs-schemas-command) - * [`assign-owner-group` command](#assign-owner-group-command) - * [`migrate-tables` command](#migrate-tables-command) - * [`revert-migrated-tables` command](#revert-migrated-tables-command) - * [`move` command](#move-command) - * [`alias` command](#alias-command) -* [Code migration commands](#code-migration-commands) - * [`lint-local-code` command](#lint-local-code-command) - * [`migrate-local-code` command](#migrate-local-code-command) - * [`migrate-dbsql-dashboards` command](#migrate-dbsql-dashboards-command) - * [`revert-dbsql-dashboards` command](#revert-dbsql-dashboards-command) -* [Cross-workspace installations](#cross-workspace-installations) - * [`sync-workspace-info` command](#sync-workspace-info-command) - * [`manual-workspace-info` command](#manual-workspace-info-command) - * [`create-account-groups` command](#create-account-groups-command) - * [`validate-groups-membership` command](#validate-groups-membership-command) - * [`validate-table-locations` command](#validate-table-locations-command) - * [`cluster-remap` command](#cluster-remap-command) - * [`revert-cluster-remap` command](#revert-cluster-remap-command) - * [`upload` command](#upload-command) - * [`download` command](#download-command) - * [`join-collection` command](#join-collection-command) - * [collection eligible command](#collection-eligible-command) -* [Common Challenges and the Solutions](#common-challenges-and-the-solutions) - * [Network Connectivity Issues](#network-connectivity-issues) - * [Insufficient Privileges](#insufficient-privileges) - * [Version Issues](#version-issues) - * [Authentication Issues](#authentication-issues) - * [Multiple Profiles in Databricks CLI](#multiple-profiles-in-databricks-cli) - * [Workspace has an external Hive Metastore (HMS)](#workspace-has-an-external-hive-metastore-hms) - * [Verify the Installation](#verify-the-installation) -* [Star History](#star-history) -* [Project Support](#project-support) - - -# Installation - -UCX installation is covered by this section. - -## Installation requirements - -UCX has the following installation requirements: -- Databricks CLI v0.213 or later. See [instructions](#authenticate-databricks-cli). -- Python 3.10 or later. See [Windows](https://www.python.org/downloads/windows/) instructions. -- Databricks Premium or Enterprise workspace. -- Network access to your Databricks Workspace used for the [installation process](#install-ucx). -- Network access to the Internet for [pypi.org](https://pypi.org) and [github.com](https://github.com) from machine running the installation. -- Databricks Workspace Administrator privileges for the user, that runs the installation. Running UCX as a Service Principal is not supported. -- Account level Identity Setup. See instructions for [AWS](https://docs.databricks.com/en/administration-guide/users-groups/best-practices.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/administration-guide/users-groups/best-practices), and [GCP](https://docs.gcp.databricks.com/administration-guide/users-groups/best-practices.html). -- Unity Catalog Metastore Created (per region). See instructions for [AWS](https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/create-metastore), and [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/create-metastore.html). -- If your Databricks Workspace relies on an external Hive Metastore (such as AWS Glue), make sure to read [this guide](docs/external_hms_glue.md). -- A PRO or Serverless SQL Warehouse to render the [report](docs/assessment.md) for the [assessment workflow](#assessment-workflow). - -Once you [install UCX](#install-ucx), you can proceed to the [assessment workflow](#assessment-workflow) to ensure -the compatibility of your workspace with Unity Catalog. - -[[back to top](#databricks-labs-ucx)] - -## Authenticate Databricks CLI - -We only support installations and upgrades through [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html), as UCX requires an installation script run -to make sure all the necessary and correct configurations are in place. Install Databricks CLI on macOS: -![macos_install_databricks](docs/macos_1_databrickslabsmac_installdatabricks.gif) - -Install Databricks CLI on Windows: -![windows_install_databricks.png](docs/windows_install_databricks.png) - -Once you install Databricks CLI, authenticate your current machine to a Databricks Workspace: - -```commandline -databricks auth login --host WORKSPACE_HOST -``` - -To enable debug logs, simply add `--debug` flag to any command. - -[[back to top](#databricks-labs-ucx)] - -## Install UCX - -Install UCX via Databricks CLI: - -```commandline -databricks labs install ucx -``` - -You'll be prompted to select a [configuration profile](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication) created by `databricks auth login` command. - -Once you install, proceed to the [assessment workflow](#assessment-workflow) to ensure the compatibility of your workspace with UCX. - -The `WorkspaceInstaller` class is used to create a new configuration for Unity Catalog migration in a Databricks workspace. -It guides the user through a series of prompts to gather necessary information, such as selecting an inventory database, choosing -a PRO or SERVERLESS SQL warehouse, specifying a log level and number of threads, and setting up an external Hive Metastore if necessary. -Upon the first installation, you're prompted for a workspace local [group migration strategy](docs/group_name_conflict.md). -Based on user input, the class creates a new cluster policy with the specified configuration. The user can review and confirm the configuration, -which is saved to the workspace and can be opened in a web browser. - -The [`WorkspaceInstallation`](src/databricks/labs/ucx/install.py) manages the installation and uninstallation of UCX in a workspace. It handles -the configuration and exception management during the installation process. The installation process creates dashboards, databases, and jobs. -It also includes the creation of a database with given configuration and the deployment of workflows with specific settings. The installation -process can handle exceptions and infer errors from job runs and task runs. The workspace installation uploads wheels, creates cluster policies, -and wheel runners to the workspace. It can also handle the creation of job tasks for a given task, such as job dashboard tasks, job notebook tasks, -and job wheel tasks. The class handles the installation of UCX, including configuring the workspace, installing necessary libraries, and verifying -the installation, making it easier for users to migrate their workspaces to UCX. -At the end of the installation, the user will be prompted if the current installation needs to join an existing collection (create new collection if none present). -For large organization with many workspaces, grouping workspaces into collection helps in managing UCX migration at collection level (instead of workspaces level) -User should be an account admin to be able to join a collection. - -After this, UCX will be installed locally and a number of assets will be deployed in the selected workspace. -These assets are available under the installation folder, i.e. `/Applications/ucx` is the default installation folder. Please check [here](#advanced-force-install-over-existing-ucx) for more details. - -You can also install a specific version by specifying it like `@v0.13.2` - `databricks labs install ucx@v0.13.2`. - -![macos_install_ucx](docs/macos_2_databrickslabsmac_installucx.gif) - -[[back to top](#databricks-labs-ucx)] - -## Installation resources - -The following resources are installed by UCX: - -| Installed UCX resources | Description | -|---------------------------------------------------|--------------------------------------------------------------------------------------------------| -| [Inventory database](./docs/table_persistence.md) | A Hive metastore database/schema in which UCX persist inventory required for the upgrade process | -| [Workflows](#workflows) | Workflows to execute UCX | -| [Dashboards](#dashboards) | Dashboards to visualize UCX outcomes | -| [Installation folder](#installation-folder) | A workspace folder containing UCX files in `/Applications/ucx/`. | - -## Installation folder - -UCX is in installed in the workspace folder `/Applications/ucx/`. This folder contains UCX's code resources, like the -[source code](./src) from this GitHub repository and the [dashboard](#dashboards). Generally, these resources are not -*directly* used by UCX users. Resources that can be of importance to users are detailed in the subsections below. - -### Readme notebook - -![readme](docs/readme-notebook.png) - -Every installation creates a `README` notebook with a detailed description of all deployed workflows and their tasks, -providing quick links to the relevant workflows and dashboards. - -[[back to top](#databricks-labs-ucx)] - -### Debug notebook - -![debug](docs/debug-notebook.png) - -Every installation creates a `DEBUG` notebook, that initializes UCX as a library for you to execute interactively. - -[[back to top](#databricks-labs-ucx)] - -### Debug logs - -![debug](docs/debug-logs.png) - -The [workflow](#workflows) runs store debug logs in the `logs` folder of the installation folder. The logs are flushed -every minute in a separate file. Debug logs for [the command-line interface](#authenticate-databricks-cli) are shown -by adding the `--debug` flag: - -```commandline -databricks --debug labs ucx