- A notebook linter to detect DBFS references within notebook cells (#1393). A new linter has been implemented in the open-source library to identify references to Databricks File System (DBFS) mount points or folders within SQL and Python cells of Notebooks, raising Advisory or Deprecated alerts when detected. This feature, resolving issue #1108, enhances code maintainability by discouraging DBFS usage, and improves security by avoiding hard-coded DBFS paths. The linter's functionality includes parsing the code and searching for Table elements within statements, raising warnings when DBFS references are found. Implementation changes include updates to the
NotebookLinter
class, a newfrom_source
class method, and anoriginal_offset
argument in theCell
class. The linter now also supports thedatabricks
dialect for SQL code parsing. This feature improves the library's security and maintainability by ensuring better data management and avoiding hard-coded DBFS paths. - Added CLI commands to trigger table migration workflow (#1511). A new
migrate_tables
command has been added to the 'databricks.labs.ucx.cli' module, which triggers themigrate-tables
workflow and, optionally, themigrate-external-hiveserde-tables-in-place-experimental
workflow. Themigrate-tables
workflow is responsible for managing table migrations, while themigrate-external-hiveserde-tables-in-place-experimental
workflow handles migrations for external hiveserde tables. The newWhat
class from the 'databricks.labs.ucx.hive_metastore.tables' module is used to identify hiveserde tables. If hiveserde tables are detected, the user is prompted to confirm running themigrate-external-hiveserde-tables-in-place-experimental
workflow. Themigrate_tables
command requires a WorkspaceClient and Prompts objects and accepts an optional WorkspaceContext object, which is set to the WorkspaceContext of the WorkspaceClient if not provided. Additionally, a newmigrate_external_hiveserde_tables_in_place
command has been added which will run themigrate-external-hiveserde-tables-in-place-experimental
workflow if it finds any hiveserde tables, making it easier to manage table migrations from the command line. - Added CSV, JSON and include path in mounts (#1329). In this release, the TablesInMounts function has been enhanced to support CSV and JSON file formats, along with the existing Parquet and Delta table formats. The new
include_paths_in_mount
parameter has been introduced, enabling users to specify a list of paths to crawl within all mounts. The WorkspaceConfig class in the config.py file has been updated to accommodate these changes. Additionally, a new_assess_path
method has been introduced to assess the format of a given file and return aTableInMount
object accordingly. Several existing methods, such as_find_delta_log_folders
,_is_parquet
,_is_csv
,_is_json
, and_path_is_delta
, have been updated to reflect these improvements. Furthermore, two new unit tests,test_mount_include_paths
andtest_mount_listing_csv_json
, have been added to ensure the proper functioning of the TablesInMounts function with the new file formats and theinclude_paths_in_mount
parameter. These changes aim to improve the functionality and flexibility of the TablesInMounts library, allowing for more precise crawling and identification of tables based on specific file formats and paths. - Added CTAS migration workflow for external tables cannot be in place migrated (#1510). In this release, we have added a new CTAS (Create Table As Select) migration workflow for external tables that cannot be migrated in-place. This feature includes a
MigrateExternalTablesCTAS
class with three tasks to migrate non-SYNC supported and non-HiveSerde external tables, migrate HiveSerde tables, and migrate views from the Hive Metastore to the Unity Catalog. We have also added new methods for managed and external table migration, deprecated old methods, and added a new test function to ensure proper CTAS migration for external tables using HiveSerDe. This change also introduces a new JSON file for external table configurations and a mock backend to simulate the Hive Metastore and test the migration process. Overall, these changes improve the migration capabilities for external tables and ensure a more flexible and reliable migration process. - Added Python linter for table creation with implicit format (#1435). A new linter has been added to the Python library to advise on implicit table formats when the 'writeTo', 'table', 'insertInto', or
saveAsTable
methods are invoked without an explicit format specified in the same chain of calls. This feature is useful for software engineers working with Databricks Runtime (DBR) v8.0 and later, where the default table format changed fromparquet
to 'delta'. The linter, implemented in 'table_creation.py', utilizes reusable AST utilities from 'python_ast_util.py' and is not automated, providing advice instead of fixing the code. The linter skips linting when a DRM version of 8.0 or higher is passed, as the default format change only applies to versions prior to 8.0. Unit tests have been added for both files as part of the code migration workflow. - Added Support for Migrating Table ACL of Interactive clusters using SPN (#1077). This change introduces support for migrating table Access Control Lists (ACLs) of interactive clusters using a Security Principal Name (SPN) for Azure Databricks environments in the UCX project. It includes modifications to the
hive_metastore
andworkspace_access
modules, as well as the addition of new classes, methods, and import statements for handling ACLs and grants. This feature enables more secure and granular control over table permissions when using SPN authentication for interactive clusters in Azure. This will benefit software engineers working with interactive clusters in Azure Databricks by enhancing security and providing more control over data access. - Added Support for migrating Schema/Catalog ACL for Interactive cluster (#1413). This commit adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, with partial fixes for issues #1192 and #1193. The changes identify and filter database ACL grants, create mappings from Hive metastore schema to Unity Catalog schema and catalog, and replace Hive metastore actions with equivalent Unity Catalog actions for both schema and catalog. External location permission is not included in this commit and will be addressed separately. New methods for creating mappings, updating principal ACLs, and getting catalog schema grants have been added, and existing functionalities have been modified to handle both AWS and Azure. The code has undergone manual testing and passed unit and integration tests. The changes are targeted towards software engineers who adopt the project.
- Added
databricks labs ucx logs
command (#1350). A new command, 'databricks labs ucx logs', has been added to the open-source library to enhance logging and debugging capabilities. This command allows developers and administrators to view logs from the latest job run or specify a particular workflow name to display its logs. By default, logs with levels of INFO, WARNING, and ERROR are shown, but the --debug flag can be used for more detailed DEBUG logs. This feature utilizes the relay_logs method from the deployed_workflows object in the WorkspaceContext class and addresses issue #1282. The addition of this command aims to improve the usability and maintainability of the framework, making it easier for users to diagnose and resolve issues. - Added check for DBFS mounts in SQL code (#1351). A new feature has been introduced to check for Databricks File System (DBFS) mounts within SQL code, enhancing data management and accessibility in the Databricks environment. The
dbfsqueries.py
file in thedatabricks/labs/ucx/source_code
directory now includes a function that verifies the presence of DBFS mounts in SQL queries and returns appropriate messages. TheLanguages
class in the__init__
method has been updated to incorporate a new class,FromDbfsFolder
, which replaces the existingfrom_table
linter with a new linter,DBFSUsageLinter
, for handling DBFS usage in SQL code. In addition, a Staff Software Engineer has improved the functionality of a DBFS usage linter tool by adding new methods to check for deprecated DBFS mounts in SQL code, returning deprecation warnings as needed. These enhancements ensure more robust handling of DBFS mounts throughout the system, allowing for better integration and management of DBFS-related issues in SQL-based operations. - Added check for circular view dependency (#1502). A circular view dependency check has been implemented to prevent issues caused by circular dependencies in views. This includes a new test for chained circular dependencies (A->B, B->C, C->A) and an update to the existing circular view dependency test. The checks have been implemented through modifications to the tests in
test_views_sequencer.py
, including a new test method and an update to the existing test method. If any circular dependencies are encountered during migration, a ValueError with an error message will be raised. These changes include updates to thetables_and_views.json
file, with the addition of a new viewv12
that depends onv11
, creating a circular dependency. The changes have been tested through the addition of unit tests and are expected to function as intended. No new methods have been added, but changes have been made to the existing_next_batch
method and two new methods,_check_circular_dependency
and_get_view_instance
, have been introduced. - Added commands for metastores listing & assignment (#1489). This commit introduces new commands for handling metastores in the Databricks Labs Unity Catalog (UCX) tool, which enables more efficient management of metastores. The
databricks labs ucx assign-metastore
command automatically assigns a metastore to a specified workspace when possible, while thedatabricks labs ucx show-all-metastores
command displays all possible metastores that can be assigned to a workspace. These changes include new methods for handling metastores in the account and workspace classes, as well as new user documentation, manual testing, and unit tests. The new functionality is added to improve the usability and efficiency of the UCX tool in handling metastores. Additional information on the UCX metastore commands is provided in the README.md file. - Added functionality to migrate external tables using Create Table (No Sync) (#1432). A new feature has been implemented for migrating external tables in Databricks' Hive metastore using the "Create Table (No Sync)" method. This feature includes the addition of two new methods,
_migrate_non_sync_table
and_get_create_in_place_sql
, for handling migration and SQL query generation. The existing methods_migrate_dbfs_root_table
and_migrate_acl
have also been updated. A test case has been added to demonstrate migration of external tables while preserving their location and properties. This new functionality provides more flexibility in managing migrations for specific use cases. The SQL parsing library sqlglot has been utilized to replace the current table name with the updated catalog and change the CREATE statement to CREATE IF NOT EXISTS. This increases the efficiency and security of migrating external tables in the Databricks' Hive metastore. - Added initial version of account-level installer (#1339). A new account-level installer has been added to the UCX library, allowing account administrators to install UCX on all workspaces within an account in a single operation. The installer authenticates to the account, prompts the user for configuration of the first workspace, and then runs the installation and offers to repeat the process for all remaining workspaces. This is achieved through the creation of a new
prompt_for_new_installation
method which saves user responses to a newInstallationConfig
data class, allowing for reuse in other workspaces. The existingdatabricks labs install ucx
command now supports account-level installation when theUCX_FORCE_INSTALL
environment variable is set to 'account'. The changes have been manually tested and include updates to documentation and error handling forPermissionDenied
,NotFound
, andValueError
exceptions. Additionally, a newAccountInstaller
class has been added to manage the installation process at the account level. - Added linting for DBFS usage (#1341). A new linter, "DBFSUsageLinter", has been added to our open-source library to check for deprecated file system paths in Python code, specifically for Database File System (DBFS) usage. Implemented as part of the "databricks.labs.ucx.source_code" package in the "languages.py" file, this linter defines a visitor, "DetectDbfsVisitor", that detects file system paths in the code and checks them against a list of known deprecated paths. If a match is found, it creates a Deprecation or Advisory object with information about the deprecated code, including the line number and column offset, and adds it to a list. This feature will assist in identifying and removing deprecated file system paths from the codebase, ensuring consistent and proper use of DBFS within the project.
- Added log task to parse logs and store the logs in the ucx database (#1272). A new log task has been added to parse logs and store them in the ucx database, added as a log crawler task to all workflows after other tasks have completed. The LogRecord has been updated to include all necessary fields, and logs below a certain minimum level will no longer be stored. A new CLI command to retrieve errors and warnings from the latest workflow run has been added, while existing commands and workflows have been modified. User documentation has been updated, and new methods have been added for log parsing and storage. A new table called
logs
has been added to the database, and unit and integration tests have been added to ensure functionality. This change also resolves issues #1148 and #1283, with modifications to existing classes such as RuntimeContext, TaskRunWarningRecorder, and LogRecord, and the addition of new classes and methods including HiveMetastoreLineageEnabler and LogRecord in the logs.py file. The deploy_schema function has been updated to include the new table, and the existing commanddatabricks labs ucx
has been modified to accommodate the new log functionality. Existing workflows have been updated and a new workflow has been added, all of which are tested through unit tests, integration tests, and manual testing. TheTaskLogger
class andTaskRunWarningRecorder
class are used to log and record task run data, with theparse_logs
method used to parse log files into partial log records, which are then used to create snapshot rows in thelogs
table. - Added migration for non delta dbfs tables using Create Table As Select (CTAS). Convert such tables to Delta tables (#1434). In this release, we've developed new methods to migrate non-Delta DBFS root tables to managed Delta tables, enhancing compatibility with various table formats and configurations. We've added support for safer SQL statement generation in our Create Table As Select (CTAS) functionality and incorporated new creation methods. Additionally, we've introduced grant assignments during the migration process and updated integration tests. The changes include the addition of a
TablesMigrator
class with an updatedmigrate_tables
method, a newPrincipalACL
parameter, and thetest_dbfs_non_delta_tables_should_produce_proper_queries
function to test the migration of non-Delta DBFS tables to managed Delta tables. These improvements promote safer CTAS functionality and expanded compatibility for non-Delta DBFS root tables. - Added support for %pip cells (#1401). A new cell type, %pip, has been introduced to the notebook interface, allowing for the execution of pip commands within the notebook. The new class, PipCell, has been added with several methods, including is_runnable, build_dependency_graph, and migrate_notebook_path, enabling the notebook interface to recognize and handle pip cells differently from other cell types. This allows for the installation of Python packages directly within a notebook setting, enhancing the notebook environment and providing users with the ability to dynamically install necessary packages as they work. The new sample notebook file demonstrates the installation of a package using the %pip install command. The implementation includes modifying the notebook runtime to recognize and execute %pip cells, and installing packages in a manner consistent with standard pip installation processes. Additionally, a new tuple, PIP_NOTEBOOK_SAMPLE, has been added to the existing test notebook sample tuple list, enabling testing the handling of %pip cells during notebook splitting.
- Added support for %sh cells (#1400). A new
SHELL
CellLanguage has been implemented to support %sh cells, enabling the execution of shell commands directly within the notebook interface. This enhancement, addressing issue #1400 and linked to #1399 and #1202, streamlines the process of running shell scripts in the notebook, eliminating the need for external tools. The new SHELL_NOTEBOOK_SAMPLE tuple, part of the updated test suite, demonstrates the feature's functionality with a shell cell, while the new methods manage the underlying mechanics of executing these shell commands. These changes not only extend the platform's capabilities by providing built-in support for shell commands but also improve productivity and ease-of-use for teams relying on shell commands as part of their data processing and analysis pipelines. - Added support for migrating Table ACL for interactive cluster in AWS using Instance Profile (#1285). This change adds support for migrating table access control lists (ACLs) for interactive clusters in AWS using an Instance Profile. A new method
get_iam_role_from_cluster_policy
has been introduced in theAwsACL
class, which replaces the static method_get_iam_role_from_cluster_policy
. Thecreate_uber_principal
method now uses this new method to obtain the IAM role name from the cluster policy. Additionally, the project now includes AWS Role Action and AWS Resource Permissions to handle permissions for migrating table ACLs for interactive clusters in AWS. New methods and classes have been added to support AWS-specific functionality and handle AWS instance profile information. Two new tests have been added to tests/unit/test_cli.py to test various scenarios for interactive clusters with and without ACL in AWS. A new argumentis_gcp
has been added to WorkspaceContext to differentiate between Google Cloud Platform and other cloud providers. - Added support for views in
table-migration
workflow (#1325). A newMigrationStatus
class has been added to track the migration status of tables and views in a Hive metastore, and aMigrationIndex
class has been added to check if a table or view has been migrated or not. TheMigrationStatusRefresher
class has been updated to use a new approach for migrating tables and views, and is now responsible for refreshing the migration status of tables and indexing it using theMigrationIndex
class. AViewsMigrationSequencer
class has also been introduced to sequence the migration of views based on dependencies. These changes improve the migration process for tables and views in thetable-migration
workflow. - Added workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables (#1412). This change introduces a new workflow,
MigrateHiveSerdeTablesInPlace
, for in-place upgrading external Parquet, Orc, and Avro hiveserde tables to the Unity Catalog. The workflow includes new functions to describe the table and extract hiveserde details, update the DDL fromshow create table
, and replace the old table name with the migration target and DBFS mount table location if any. A new function_migrate_external_table_hiveserde
has been added totable_migrate.py
, and two new arguments,mounts
andhiveserde_in_place_migrate
, have been added to theTablesMigrator
class. These arguments control which hiveserde to migrate and replace the DBFS mnt table location if any, enabling multiple tasks to run in parallel and migrate only one type of hiveserde at a time. This feature does not include user documentation, new CLI commands, or changes to existing commands, but it does add a new workflow and modify the existingmigrate_tables
function intable_migrate.py
. The changes have been manually tested, but no unit tests, integration tests, or staging environment verification have been provided. - Build dependency graph for local files (#1462). This commit refactors dependency classes to distinguish between resolution and loading, and introduces new classes to handle different types of dependencies. A new method,
LocalFileMigrator.build_dependency_graph
, is implemented, following the pattern ofNotebookMigrator
, to build a dependency graph for local files. This resolves issue [#1202](databrickslabs#1202) and addresses issue [#1360](databrickslabs#1360). While the refactoring and implementation of new methods improve the accuracy of dependency graphs and ensure that dependencies are correctly registered based on the file's language, there are no user-facing changes, such as new or modified CLI commands, tables, or workflows. Unit tests are added to ensure that the new changes function as expected. - Build dependency graph for site packages (#1504). This commit introduces changes to the dependency graph building process for site packages within the ucx project. When a package is not recognized, package files are added as dependencies to prevent errors during import dependency determination, thereby fixing an infinite loop issue when encountering cyclical graphs. This resolves issues #1427 and is related to #1202. The changes include adding new methods for handling package files as dependencies and preventing infinite loops when visiting cyclical graphs. The
SitePackage
class in thesite_packages.py
file has been updated to handle package files more accurately, with the__init__
method now acceptingmodule_paths
as a list of Path objects instead of a list of strings. A new method,module_paths
, has also been introduced. Unit tests have been added to ensure the correct functionality of these changes, and a hack in the PR will be removed once issue #1421 is implemented. - Build notebook dependency graph for
%run
cells (#1279). A newNotebook
class has been developed to parse source code and split it into cells, and aNotebookDependencyGraph
class with related utilities has been added to discover dependencies in%run
cells, addressing issue #1201. The new functionality enhances the management and tracking of dependencies within notebooks, improving code organization and efficiency. The commit includes updates to existing notebooks to utilize the new classes and methods, with no impact on existing functionality outside of the%run
context. - Create UC External Location, Schema, and Table Grants based on workspace-wide Azure SPN mount points (#1374). This change adds new functionality to create Unity Catalog (UC) external location, schema, and table grants based on workspace-wide Azure Service Principal Names (SPN) mount points. The majority of the work was completed in a previous pull request. The main change in this pull request is the addition of a new test function,
test_migrate_external_tables_with_principal_acl_azure
, which tests the migration of tables with principal ACLs in an Azure environment. This function includes the creation of a new user with cluster access, another user without cluster access, and a new group with cluster access to validate the migration of table grants to these entities. Themake_cluster_permissions
method now accepts aservice_principal_name
parameter, and after migrating the tables with theacl_strategy
set toPRINCIPAL
, the function checks if the appropriate grants have been assigned to the Azure SPN. This change is part of an effort to improve the integration of Unity Catalog with Azure SPNs and is accessible through the UCX CLI command. The changes have been tested through manual testing, unit tests, and integration tests and have been verified in a staging environment. - Detect DBFS use in SQL statements in notebooks (#1372). A new linter has been added to detect and discourage the use of DBFS (Databricks File System) in SQL statements within notebooks. This linter raises deprecated advisories for any identified DBFS folder or mount point references in SQL statements, encouraging the use of alternative storage options. The change is implemented in the
NotebookLinter
class of the 'notebook_linter.py' file, and is tested through unit tests to ensure proper functionality. The target audience for this update includes software engineers who use Databricks or similar platforms, as the new linter will help users transition away from using DBFS in their SQL statements and adopt alternative storage methods. - Detect
sys.path
manipulation (#1380). A change has been introduced to the Python linter to detect manipulation ofsys.path
. New classes, AbsolutePath and RelativePath, have been added as subclasses of SysPath. The SysPathVisitor class has been implemented to track additions to sys.path and the visit_Call method in SysPathVisitor checks for 'sys.path.append' and 'os.path.abspath' calls. The new functionality includes a new method, collect_appended_sys_paths in PythonLinter, and a static method, list_appended_sys_paths, to retrieve the appended paths. Additionally, new tests have been added to the PythonLinter to detect manipulation of thesys.path
variable, specifically thelist_appended_sys_paths
method. The new test cases include using aliases forsys
,os
, andos.path
, and using both absolute and relative paths. This improvement will enhance the linter's ability to detect potential issues related to manipulation of thesys.path
variable. The change resolves issue #1379 and is linked to issue #1202. No user documentation or CLI commands have been added or modified, and no manual testing has been performed. Unit tests for the new functionality have been added. - Detect direct access to cloud storage and raise a deprecation warning (#1506). In this release, the Pyspark linter has been enhanced to detect and issue deprecation warnings for direct access to cloud storage. This change, which resolves issue #1133, introduces new classes
AstHelper
andTableNameMatcher
to determine the fully-qualified name of functions and replace instances of direct cloud storage access with migration index table names. Instances of direct access using 'dbfs:/', 'dbfs://', and default 'dbfs:' references will now be detected and flagged with a deprecation warning. The test filetest_pyspark.py
has been updated to include new tests for detecting direct cloud storage access. Users should be aware of these changes when updating their code to avoid deprecation warnings. - Detect imported files and packages (#1362). This commit introduces functionality to parse Python code for
import
andimport from
processing instructions, enabling the detection and management of imported files and packages. It includes a new CLI command, modifications to existing commands, new and updated workflows, and additional tables. The code modifications include new methods for visiting Import and ImportFrom nodes, and the addition of unit tests to ensure correctness. Relevant user documentation has been added, and the new functionality has been tested through manual testing, unit tests, and verification on a staging environment. This comprehensive update enhances dependency management, code organization, and understanding for a more streamlined user experience. - Enhanced migrate views task to support views created with explicit column list (#1375). The commit enhances the migrate views task to better support handling of views with an explicit column list, improving overall compatibility. A new lookup based on
SHOW CREATE TABLE
has been added to extract the column list from the create script, ensuring accurate migration. The_migrate_view_table
method has been refactored, and a new_sql_migrate_view
method is added to fetch the create statement of the view. TheViewToMigrate
class has been updated with a new_view_dependencies
method to determine view dependencies in the new SQL text. Additionally, new methodssafe_sql_key
andadd_table
have been introduced, and thesqlglot.parse
method is used to parse the code withdatabricks
as the read argument. A new test for migrating views with an explicit column list has been added, along with theupgraded_from
andupgraded_to
table properties, and the migration status is updated to reflect successful migration. New test functions have also been added to test the migration of views with columns and ACLs. Dependency sqlglot has been updated to version ~=23.9.0, enhancing the overall functionality and compatibility of the migrate views task. - Ensure that USE statements are recognized and apply to table references without a qualifying schema in SQL and pyspark (#1433). This commit enhances the library's functionality in handling
USE
statements in both SQL and PySpark by ensuring they are recognized and applied to table references without a qualifying schema. A newCurrentSessionState
class is introduced to manage the current schema of a session, and existing classes such asFromTable
andTableNameMatcher
are updated to use this new class. Additionally, thelint
andapply
methods have been updated to handleUSE
statements and improve the precision of table reference handling. These changes are particularly useful when working with tables in different schemas, ensuring the library can manage table references more accurately in SQL and PySpark. A new fixture, 'extended_test_index', has been added to support unit tests, and the test file 'test_notebook.py' has been updated to better reflect the intended schema for each table reference. - Expand documentation for end to end workflows with external HMS (#1458). The UCX toolkit has been updated to support integration with an external Hive Metastore (HMS), in addition to the default workspace HMS. This feature allows users to easily set up UCX to work with an existing external HMS, providing greater flexibility in managing and accessing data. During installation, UCX will scan for evidence of an external HMS in the cluster policies and Spark configurations. If found, UCX will prompt the user to connect to the external HMS, create a new policy with the necessary Spark and data access configurations, and set up job clusters accordingly. However, users will need to manually update the data access configuration for SQL Warehouses that are not configured for external HMS. Users can also create a cluster policy with appropriate Spark configurations and data access for external HMS, or edit existing policies in specified UCX workflows. Once set up, the assessment workflow will scan tables and views from the external HMS, and the table migration workflow will upgrade tables and views from the external HMS to the Unity Catalog. Users should note that if the external HMS is shared between multiple workspaces, a different inventory database name should be specified for each UCX installation. It is important to plan carefully when setting up a workspace with multiple external HMS, as the assessment dashboard will fail if the SQL warehouse is not configured correctly. Users can have multiple UCX installations in a workspace, each set up with a different external HMS, or manually modify the cluster policy and SQL data access configuration to point to the correct external HMS after UCX has been installed.
- Extend service principal migration with option to create access connectors with managed identity for each storage account (#1417). This commit extends the service principal migration feature to create access connectors with managed identities for each storage account, enhancing security and isolation by preventing cross-account access. A new CLI command has been added, and an existing command has been modified. The
create_access_connectors_for_storage_accounts
method creates access connectors with the required permissions for each storage account used in external tables. The_apply_storage_permission
method has also been updated. New unit and integration tests have been included, covering various scenarios such as secret value decoding, secret read exceptions, and single storage account testing. The necessary permissions for these connectors will be set in a subsequent pull request. Additionally, a new method,azure_resources_list_access_connectors
, andazure_resources_get_access_connector
have been introduced to ensure access connectors are returned as expected. This change has been tested manually and through automated tests, ensuring backward compatibility while providing improved security features. - Fixed UCX policy creation when instance pool is specified (#1457). In this release, we have made significant improvements to the handling of instance pools in UCX policy creation. The
policy.py
file has been updated to properly handle the case when an instance pool is specified, by setting theinstance_pool_id
attribute and removing thenode_type_id
attribute in the policy definition. Additionally, the availability attribute has been removed for all cloud providers, including AWS, Azure, and GCP, when an instance pool ID is provided. A newpop
method call has also been added to remove thegcp_attributes.availability
attribute when an instance pool ID is provided. These changes ensure consistency in the policy definition across all cloud providers. Furthermore, tests for this functionality have been updated in the 'test_policy.py' file, specifically thetest_cluster_policy_instance_pool
function, to check the correct addition of the instance pool to the cluster policy. The purpose of these changes is to improve the reliability and functionality of UCX policy creation, specifically when an instance pool is specified. - Fixed
migrate-credentials
command on aws (#1501). In this release, themigrate-credentials
command for thelabs.yml
configuration file has been updated to include new flags for specifying a subscription ID and AWS profile. This allows users to scan a specific storage account and authenticate using a particular AWS profile when migrating credentials for storage access to UC storage credentials. Thecreate-account-groups
command remains unchanged. Additionally, several issues related to themigrate-credentials
command for AWS have been addressed, such as hallucinating the presence of a--profile
flag, using a monotonically increasing role ID, and not handling cases where there are no IAM roles to migrate. Therun
method of theAwsUcStorageCredentials
class has been updated to handle these cases, and several test functions have been added or updated to ensure proper functionality. These changes improve the functionality and robustness of themigrate-credentials
command for AWS. - Fixed edge case for
RegexSubStrategy
(#1561). In this release, we have implemented fixes for theRegexSubStrategy
class within theGroupMigrationStrategy
, addressing an issue where matching account groups could not be found using the display name. Thegenerate_migrated_groups
function has been updated to include a check for account groups with matching external IDs when either the display name or regex substitution of the display name fails to yield a match. Additionally, we have expanded testing for theGroupManager
class, which handles group management. This includes new tests using regular expressions to match groups, and ensuring that theGroupManager
class can correctly identify and manage groups based on different criteria such as the group's ID, display name, or external ID. These changes improve the robustness of theGroupMigrationStrategy
and ensure the proper functioning of theGroupManager
class when using regular expression substitution and matching. - Fixed table in mount partition scans for JSON and CSV (#1437). This release introduces a fix for an issue where table scans on partitioned CSV and JSON files were not being correctly identified. The
TablesInMounts
scan function has been updated to accurately detect these files, addressing the problem reported in issue #1389 and linked issue #1437. To ensure functionality, new private methods_find_partition_file_format
and_assess_path
have been introduced, with the latter updated to handle partitioned directories. Additionally, unit tests have been added to test partitioned CSVs and JSONs, simulating the file system's response to various calls. These changes provide enhanced detection and handling of partitioned CSVs and JSONs in theTablesInMounts
scan function. - Forward remote logs on
run_workflow
and removeddestroy-schema
workflow in favour ofdatabricks labs uninstall ucx
(#1349). In this release, thedestroy-schema
workflow has been removed and replaced with thedatabricks labs uninstall ucx
command, addressing issue #1186. Therun_workflow
function has been updated to forward remote logs, and therun_task
function now accepts a new argumentsql_backend
. TheTask
class includes a new methodis_testing()
and has been updated to useRuntimeBackend
beforeSqlBackend
in thedatabricks.labs.lsql.backends
module. TheTaskLogger
class has been modified to include a new argumentattempt
and a new class methodlog_path()
. Theverify_metastore
method in theverification.py
file has been updated to handlePermissionDenied
exceptions more gracefully. ThedestroySchema
class and itsdestroy_schema
method have been removed. Theworkflow_task.py
file has been updated to include a new argumentattempt
in thetask_run_warning_recorder
method. These changes aim to improve the system's efficiency, error handling, and functionality. - Give all access connectors
Storage Blob Data Contributor
role (#1425). A new change has been introduced to grant theStorage Blob Data Contributor
role, which provides the highest level of data access, to all access connectors for each storage account in the system. This adjustment, part of issue #142 - Grant uber principal write permissions so that SYNC command will succeed (#1505). A change has been implemented to modify the
databricks labs ucx create-uber-principal
command, granting the uber principal write permissions on Azure Blob Storage. This aligns with the existing implementation on AWS where the uber principal has write access to all S3 buckets. The modification includes the addition of a new role, "STORAGE_BLOB_DATA_CONTRIBUTOR", to the_ROLES
dictionary in theresources.py
file. A new method,clean_up_spn
, has also been added to clear ucx uber service principals. This change resolves issue #939 and ensures consistent behavior with AWS, enabling the uber principal to have write permissions on all Azure blob containers and ensuring the success of theSYNC
command. The changes have been manually tested but not yet verified on a staging environment. - Handled new output format of
SHOW TBLPROPERTIES
command (#1381). A recent commit has been made to address an issue with thetest_revert_migrated_table
test failing due to the new output format of theSHOW TBLPROPERTIES
command in the open-source library. Previously, the output was blank if a table property was missing, but now it shows a message indicating that the table does not have the specified property. The commit updates theis_migrated
method in themigration_status.py
file to handle this new output format, where the method now uses thefetch
method to retrieve theupgraded_to
property for a given schema and table. If the property is missing, the method will continue to the next table. The commit also updates tests for the changes, including a manual test that has not been verified on a staging environment. Changes have been made in thetest_table_migrate.py
file, where rows with table properties have been updated to return new data, and thetimestamp
function now sets thedatetime.datetime
to aFakeDate
. No new methods have been added, and existing functionality related toSHOW TBLPROPERTIES
command output handling has been changed in scope. - Ignore whitelisted imports (#1367). This commit introduces a new class
DependencyResolver
that filters Python import dependencies based on a whitelist, and updates to theDependencyGraph
class to support this new resolver. A new optional parameterresolver
has been added to theNotebookMigrator
class constructor and theDependencyGraph
constructor. A new filewhitelist.py
has been added, introducing classes and functions for defining and managing a whitelist of Python packages based on their name and version. These changes aim to improve control over which dependencies are included in the dependency graph, contributing to a more modular and maintainable codebase. - Increased memory for ucx clusters (#1366). This release introduces an update to enhance memory configuration for UCX clusters, addressing issue #1366. The main change involves a new method for selecting a node type with a minimum of 16GB of memory and local disk enabled, implemented in the policy.py file of the installer module. This modification results in the
node_type_id
parameter for creating clusters, instance pools, and pipelines now requiring a minimum memory of 16 GB. This change is reflected in the fixtures.py file,ws.clusters.select_node_type()
,ws.instance_pools.create()
, andpipelines.PipelineCluster
method calls, ensuring that any newly created clusters, instance pools, and pipelines benefit from the increased memory allocation. This update aims to improve user experience by offering higher memory configurations out-of-the-box for UCX-related workloads. - Integrate detection of notebook dependencies (#1338). In this release, the NotebookMigrator has been updated to integrate dependency graph construction for detecting notebook dependencies, addressing issues 1204, 1286, and 1326. The changes include modifying the NotebookMigrator class to include the dependency graph and updating relevant tests. A new file, python_linter.py, has been added for linting Python code, which now detects calls to "dbutils.notebook.run" with dynamic paths. The linter uses the ast module to parse the code and locate nodes matching the specified criteria. The NotebookMigrator's apply method has been updated to check for ObjectType.NOTEBOOK, loading the notebook using the new _load_notebook method, and incorporating a new _apply method for modifying the code in the notebook based on applicable fixes. A new DependencyGraph class has been introduced to build a graph of dependencies within the notebook, and several new methods have been added, including _load_object, _load_notebook_from_path, and revert. This release is co-authored by Cor and aims to improve dependency management in the notebook system.
- Isolate grants computation when migrating tables (#1233). In this release, we have implemented a change to improve the reliability of table migrations. Previously, grants to migrate were computed and snapshotted outside the loop that iterates through tables to migrate, which could lead to inconsistencies if the grants or migrated groups changed during migration. Now, grants are re-computed for each table, reducing the chance of such issues. We have introduced a new method
_compute_grants
that takes in the table to migrate, ACL strategy, and snapshots of all grants to migrate, migrated groups, and principal grants. Ifacl_strategy
isNone
, it defaults to an empty list. The method checks each strategy in the ACL strategy list, extending thegrants
list if the strategy isAclMigrationWhat.LEGACY_TACL
orAclMigrationWhat.PRINCIPAL
. Themigrate_tables
method has been updated to use this new method to compute grants. It first checks ifacl_strategy
isNone
, and if so, sets it to an empty list. It then calls_compute_grants
with the current table,acl_strategy
, and the snapshots of all grants to migrate, migrated groups, and principal grants. The computed grants are then used to migrate the table. This change enhances the robustness of the migration process by isolating grants computation for each table. - Log more often from workflows (#1348). In this update, the log formatting for the debug log file in the "tasks.py" file of the "databricks/labs/ucx/framework" module has been modified. The
TimedRotatingFileHandler
function has been adjusted to rotate the log file every minute, increasing the frequency of log file rotation from every 10 minutes. Furthermore, the logging format has been enhanced to include the time, level name, name, thread name, and message. These improvements are in response to issue #1171 and the implementation of more frequent logging as per issue #1348, ensuring more detailed and up-to-date logs for debugging and analysis purposes. - Make
databricks labs ucx assign-metastore
prompt for workspace if no workspace id provided (#1500). Thedatabricks labs ucx assign-metastore
command has been updated to allow for a optionalworkspace_id
parameter, with a prompt for the workspace ID displayed if it is not provided. Both theassign-metastore
andshow-all-metastores
commands have been made account-level only. The functionality of themigrate_local_code
function remains unchanged. Error handling for etag issues related to default catalog settings has been implemented. Unit tests and manual testing have been conducted on a staging environment to verify the changes. Theshow_all_metastores
andassign_metastore
commands have been updated to accept an optionalworkspace_id
parameter. The unit tests cover various scenarios, including cases where a user has multiple metastores and needs to select one, as well as cases where a default catalog name is provided and needs to be selected. If no metastore is found, aValueError
will be raised. Themetastore_id
andworkspace_id
flags in the yml file have been renamed tometastore-id
andworkspace-id
, respectively, and a newdefault-catalog
flag has been added. - Modified update existing role to amend the AssumeRole statement rather than rewriting it (#1423). The
_aws_role_trust_doc
method of theaws.py
file has been updated to return a dictionary object instead of a JSON string for the AWS IAM role trust policy document. This change allows for more fine-grained control when updating the trust relationships of an existing role in AWS IAM. Thecreate_uc_role
method has been updated to pass the role trust document to the_create_role
method using the_get_json_for_cli
method. Theupdate_uc_trust_role
method has been refactored to retrieve the existing role's trust policy document, modify itsStatement
field, and replace it with the returned value of the_aws_role_trust_doc
method with the specifiedexternal_id
. Additionally, thetest_update_uc_trust_role
function in thetest_aws.py
file has been updated to provide more detailed and realistic mocked responses for thecommand_call
function, including handling the case where theiam update-assume-role-policy
command is called and returning a mocked response with a modified assume role policy document that includes a new principal with an external ID condition. These changes improve the testing capabilities of thetest_update_uc_trust_role
function and provide more comprehensive testing of the assume role statement and role update functionality. - Modifies dependency resolution logic to detect deprecated use of s3fs package (#1395). In this release, the dependency resolution logic has been enhanced to detect and handle deprecated usage of the s3fs package. A new function,
_download_side_effect
, has been implemented to mock the download behavior of theworkspace_client_mock
function, allowing for more precise control during testing. TheDependencyResolver
class now includes a list ofAdvice
objects to inform developers about the use of deprecated dependencies, without modifying theDependencyGraph
class. This change also introduces a new import statement for the s3fs package, encouraging the adoption of up-to-date packages and practices for improved system compatibility and maintainability. Additionally, a unit test file, test_s3fs.py, has been added with test cases for various import scenarios of s3fs to ensure proper detection and issuance of deprecation warnings. - Prompt for warehouse choice in uninstall if the original chosen warehouse does not exist anymore (#1484). In this release, we have added a new method
_check_and_fix_if_warehouse_does_not_exists()
to theWorkspaceInstaller
class, which checks if the specified warehouse in the configuration still exists. If it doesn't, the method generates a new configuration using a newWorkspaceInstaller
object, saves it, and updates the_sql_backend
attribute with the new warehouse ID. This change ensures that if the original chosen warehouse no longer exists, the user will be prompted to choose a new one during uninstallation. Additionally, we have added a new import statement forResourceDoesNotExist
exception and introduced a new functiontest_uninstallation_after_warehouse_is_deleted
, which simulates a scenario where a warehouse has been manually deleted and checks if the uninstallation process correctly resets the warehouse. TheStatementExecutionBackend
object is initialized with a non-existent warehouse ID, and the configuration and sql_backend objects are updated accordingly. This test case ensures that the uninstallation process handles the scenario where a warehouse has been manually deleted. - Propagate source location information within the import package dependency graph (#1431). This change modifies the dependency graph build logic within several modules of the
databricks.labs.ucx
package to propagate source location information within the import package dependency graph. A newImportDependency
class now represents import sources, and alist_import_sources
method returns a list ofImportDependency
objects, which include import string and original source code file path. A newIncompatiblePackage
class is added to theWhitelist
class, returningUCCompatibility.NONE
when checking for compatibility. TheImportChecker
class checks for deprecated imports and returnsAdvice
orDeprecation
objects with location information. Unit tests have been added to ensure the correct behavior of these changes. Additionally, theLocation
class and a new test function for invalid processors have been introduced. - Scan
site-packages
(#1411). A SitePackages scanner has been implemented, enhancing the linkage of module root names with the actual Python code within installed packages using metadata. This development addresses issue #1410 and is connected to #1202. New functionalities include user documentation, a CLI command, a workflow, and a table, accompanied by modifications to an existing command and workflow, as well as alterations to another table. Unit tests have been added to ensure the feature's proper functionality. In the diff, a new unit test file forsite_packages.py
has been added, checking fordatabrix
compatibility, which returns as uncompatible. This enhancement aims to bolster the user experience by providing more detailed insights into installed packages. - Select DISTINCT job_run_id (#1352). A modification has been implemented to optimize the SQL query for accessing log data, now retrieving distinct job_run_ids instead of a single one, nested in a subquery. The enhanced query selects the message field from the inventory.logs table, filtering based on job_run_id matches with the latest timestamp within the same table. This change enables multiple job_run_ids to correlate with the same timestamp, delivering a more holistic perspective of logs at a given moment. By upgrading the query functionality to accommodate multiple job run IDs, this improvement ensures more precise and detailed retrieval of log data.
- Support table migration to Unity Catalog in Python code (#1210). This release introduces changes to the Python codebase that enhance the SparkSql linter/fixer to support migrating Spark SQL table references to Unity Catalog. The release includes modifications to existing commands, specifically
databricks labs ucx migrate_local_code
, and the addition of unit tests. TheSparkSql
class has been updated to support a newindex
parameter, allowing for migration support. New classes includingQueryMatcher
,TableNameMatcher
,ReturnValueMatcher
, andSparkMatchers
have been added to hold various matchers for different spark methods. The release also includes modifications to existing methods for caching, creating, getting, refreshing, and un-caching tables, as well as updates to thelistTables
method to reflect the new format. ThesaveAsTable
andregister
methods have been updated to handle variable and f-string arguments for the table name. Thedatabricks labs ucx migrate_local_code
command has been modified to handle spark.sql function calls that include a table name as a parameter and suggest necessary changes to migrate to the new Unity Catalog format. Integration tests are still needed. - When building dependency graph, raise problems with problematic dependencies (#1529). A new
DependencyProblem
class has been added to the databricks.labs.ucx.source_code.dependencies module to handle issues encountered during dependency graph construction. This class is used to raise issues when problematic dependencies are encountered during the build of the dependency graph. Thebuild_dependency_graph
method of theSourceContainer
abstract class now accepts aproblem_collector
parameter, which is a callable function that collects and handles dependency problems. Instead of raisingValueError
exceptions, theDependencyProblem
class is used to collect and store information about the issues. This change improves error handling and diagnostic information during dependency graph construction. Relevant user documentation, a new CLI command, and a new workflow have been added, along with modifications to existing commands and workflows. Unit tests have been added to verify the new functionality. - WorkspacePath to implement
pathlib.Path
API (#1509). A new file, 'wspath.py', has been added to themixins
directory of the 'databricks.labs.ucx' package, implementing the custom Path object 'WorkspacePath'. This subclass of 'pathlib.Path' provides additional methods and functionality for the Databricks Workspace, including 'cwd()', 'home()', 'scandir()', and 'listdir()'.WorkspacePath
interacts with the Databricks Workspace API for operations such as checking if a file/directory exists, creating and deleting directories, and downloading files. TheWorkspacePath
class has been updated to implement 'pathlib.Path' API for a more intuitive and consistent interface when working with file and directory paths. The class now includes methods like 'absolute()', 'exists()', 'joinpath()', 'parent', and supports thewith
statement for thread-safe code. A new test file 'test_wspath.py' has been added for the WorkspacePath mixin. New methods like 'expanduser()', 'as_fuse()', 'as_uri()', 'replace()', 'write_text()', 'write_bytes()', 'read_text()', and 'read_bytes()' have also been added. 'mkdir()' and 'rmdir()' now raise errors when called on non-absolute paths and non-empty directories, respectively.
Dependency updates:
- Bump actions/checkout from 3 to 4 (#1191).
- Bump actions/setup-python from 4 to 5 (#1189).
- Bump codecov/codecov-action from 1 to 4 (#1190).
- Bump softprops/action-gh-release from 1 to 2 (#1188).
- Bump databricks-sdk from 0.23.0 to 0.24.0 (#1223).
- Updated databricks-labs-lsql requirement from ~=0.3.0 to >=0.3,<0.5 (#1387).
- Updated sqlglot requirement from ~=23.9.0 to >=23.9,<23.11 (#1409).
- Updated sqlglot requirement from <23.11,>=23.9 to >=23.9,<23.12 (#1486).
- Ensure proper sequencing of view migrations (#1157). In this release, we have introduced a
views_migrator
module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue #1132. The module contains two main classes:ViewToMigrate
andViewsMigrator
. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method,__hash__
, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new filetables_and_views.json
for unit testing and added aviews_migrator
module that takes aTablesCrawler
object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests. - Experimental support for scanning Delta Tables inside Mount Points (#1095). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new
TablesInMounts
crawler. Users can now scan specific mount points using the--include-mounts
flag and include Parquet files in the scan results with the--include-parquet-files
flag. Additionally, the--filter-paths
flag allows for filtering paths in a mount point and the--max-depth
flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to usedatabricks-labs-lsql~=0.3.0
. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases. - Fixed
NULL
values inucx.views.table_format
to haveUNKNOWN
value instead (#1156). This commit includes a fix for handling NULL values in thetable_format
column of Views in theucx.views.table_format
module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue #115 - Fixing run_workflow functionality for better error handling (#1159). In this release, the
run_workflow
method in theworkflows.py
file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method,job_initial_run
, has been added to initiate a job run and return the run ID, raising aNotFound
exception if the job run is not found. Therun_workflow
functionality in theWorkflowsInstall
module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when anOperationFailed
exception is raised during the installation process. These changes improve the robustness and stability of the system. - Use experimental Permissions Migration API also for Legacy Table ACLs (#1161). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The
TableAclSupport
import and class have been removed, as they are no longer needed. A newapply_to_renamed_groups
method has been added for production usage, and aapply_to_groups_with_different_names
method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. Thepermission_manager
parameter has been removed from several test functions in thetest_generic.py
file and replaced with theMigrationState
class, which is used directly with theWorkspaceClient
object to apply permissions to groups with different names. Thetest_some_entitlements
function in thetest_scim.py
file has also been updated to use theMigratedGroup
class and theMigrationState
class'sapply_to_groups_with_different_names
method. Finally, new tests for the Permissions Migration API have been added to thetest_tacl.py
file in thetests/integration/workspace_access
directory to verify the behavior of the Permissions Migration API when migrating different grants.
- Added ACL migration to
migrate-tables
workflow (#1135). - Added AVRO to supported format to be upgraded by SYNC (#1134). In this release, the
hive_metastore
package'stables.py
file has been updated to add AVRO as a supported format for the SYNC upgrade functionality. This change includes AVRO in the list of supported table formats in theis_format_supported_for_sync
method, which checks if the table format is notNone
and if the format's uppercase value is one of the supported formats. The addition of AVRO enables it to be upgraded using the SYNC functionality. Moreover, a new format called BINARYFILE has been introduced, which is not supported for SYNC upgrade. This release is part of the implementation of issue #1134, improving the compatibility of the SYNC upgrade functionality with various data formats. - Added
is_partitioned
column (#1130). A new column,is_partitioned
, has been added to theucx.tables
table in the assessment module, indicating whether the table is partitioned or not with valuesYes
or "No". This change addresses issue #871 and has been manually tested. The commit also includes updated documentation for the modified table. No new methods, CLI commands, workflows, or tests (unit, integration) have been introduced as part of this change. - Added assessment of interactive cluster usage compared to UC compute limitations (#1123).
- Added external location validation when creating catalogs with
create-catalogs-schemas
command (#1110). - Added flag to Job to identify Job submitted by jar (#1088). The open-source library has been updated with several new features aimed at enhancing user functionality and convenience. These updates include the addition of a new sorting algorithm, which provides users with an efficient and customizable method for organizing data. Additionally, a new caching mechanism has been implemented, improving the library's performance and reducing the amount of time required to access frequently used data. Furthermore, the library now supports multi-threading, enabling users to perform multiple operations simultaneously and increase overall productivity. Lastly, a new error handling system has been developed, providing users with more informative and actionable feedback when unexpected issues arise. These changes are a significant step forward in improving the library's performance, functionality, and usability for all users.
- Bump databricks-sdk from 0.22.0 to 0.23.0 (#1121). In this version update,
databricks-sdk
is upgraded from 0.22.0 to 0.23.0, introducing significant changes to the handling of AWS and Azure identities. TheAwsIamRole
class is replaced withAwsIamRoleRequest
in thedatabricks.sdk.service.catalog
module, affecting the creation of AWS storage credentials using IAM roles. Thecreate
function insrc/databricks/labs/ucx/aws/credentials.py
is updated to accommodate this modification. Additionally, theAwsIamRole
argument in thecreate
function offixtures.py
in thedatabricks/labs/ucx/mixins
directory is replaced withAwsIamRoleRequest
. The tests intests/integration/aws/test_access.py
are also updated to utilizeAwsIamRoleRequest
, andStorageCredentialInfo
intests/unit/azure/test_credentials.py
now usesAwsIamRoleResponse
instead ofAwsIamRole
. The new classes,AwsIamRoleRequest
andAwsIamRoleResponse
, likely include new features or bug fixes for AWS IAM roles. These changes require software engineers to thoroughly assess their codebase and adjust any relevant functions accordingly. - Deploy static views needed by #1123 interactive dashboard (#1139). In this update, we have added two new views,
misc_patterns_vw
andcode_patterns_vw
, to theinstall.py
script in thedatabricks/labs/ucx
directory. These views were originally intended to be deployed with a previous update (#1123) but were inadvertently overlooked. The addition of these views addresses issues with queries in theinteractive
dashboard. Thedeploy_schema
function has been updated with two new lines,deployer.deploy_view("misc_patterns", "queries/views/misc_patterns.sql")
anddeployer.deploy_view("code_patterns", "queries/views/code_patterns.sql")
, to deploy the new views using their respective SQL files from thequeries/views
directory. No other modifications have been made to the file. - Fixed Table ACL migration logic (#1149). The open-source library has been updated with several new features, providing enhanced functionality for software engineers. A new utility class has been added to simplify the process of working with collections, offering methods to filter, map, and reduce elements in a performant manner. Additionally, a new configuration system has been implemented, allowing users to easily customize library behavior through a simple JSON format. Finally, we have added support for asynchronous processing, enabling efficient handling of I/O-bound tasks and improving overall application performance. These features have been thoroughly tested and are ready for use in your projects.
- Fixed
AssertionError: assert '14.3.x-scala2.12' == '15.0.x-scala2.12'
from nightly integration tests (#1120). In this release, the open-source library has been updated with several new features to enhance functionality and provide more options to users. The library now supports multi-threading, allowing for more efficient processing of large datasets. Additionally, a new algorithm for data compression has been implemented, resulting in reduced memory usage and faster data transfer. The library API has also been expanded, with new methods for sorting and filtering data, as well as improved error handling. These changes aim to provide a more robust and performant library, making it an even more valuable tool for software engineers. - Increase code coverage by 1 percent (#1125).
- Skip installation if remote and local version is the same, provide prompt to override (#1084). In this release, the
new_installation
workflow in the open-source library has been enhanced to include a new use case for handling identical remote and local versions of UCX. When the remote and local versions are the same, the user is now prompted and if no override is requested, a RuntimeWarning is raised. Additionally, users are now prompted to update the existing installation and if confirmed, the installation proceeds. These modifications include manual testing and new unit tests to ensure functionality. These changes provide users with more control over their installation process and address a specific use case for handling identical UCX versions. - Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137). The open-source library has been updated with several new features to enhance usability and functionality. Firstly, we have added support for asynchronous processing, allowing for more efficient handling of large data sets and improving overall performance. Additionally, a new configuration system has been implemented, which simplifies the setup process for users and increases customization options. We have also included a new error handling mechanism that provides more detailed and actionable information, making it easier to diagnose and resolve issues. Lastly, we have made significant improvements to the library's documentation, including updated examples, guides, and an expanded API reference. These changes are part of our ongoing commitment to improving the library and providing the best possible user experience.
- [Experimental] Add support for permission migration API (#1080).
Dependency updates:
- Updated databricks-labs-lsql requirement from ~=0.2.2 to >=0.2.2,<0.4.0 (#1137).
- Added instance pool id to WorkspaceConfig (#1087). In this release, the
create
method of the_policy_installer
object has been updated to return an additional value,instance_pool_id
, which is then assigned and passed as an argument to theWorkspaceConfig
object in the_configure_new_installation
method. TheClusterPolicyInstaller
class in thev0.15.0_added_cluster_policy.py
file has also been updated to return a fourth value,instance_pool_id
, from thecreate
method, allowing for more flexibility in future enhancements. Additionally, the test functiontest_table_migration_job
in thetest_installation.py
file has been updated to skip when the script is not being run as part of a nightly test job or in debug mode, and the test functions in thetest_policy.py
file have been updated to reflect the new return value in thecreate
method. These changes enable better management and scaling of resources through instance pools, provide more granular control in the WorkspaceConfig, and improve testing efficiency. - Added more cross-linking between CLI commands (#1091). In this release, we have introduced several enhancements to our open-source library's Command Line Interface (CLI) and documentation. Specifically, we have added more cross-linking between CLI commands to improve navigation and usability. The documentation has been updated to include a new step in the UCX installation process, where users are required to run the assessment workflow after installing UCX. This workflow is the first step in the migration process and checks the compatibility of the user's workspace with Unity Catalog. Additionally, we have added new commands for principal-prefix-access, migrate-credentials, and migrate-locations, which are part of the table migration process. These new commands require the assessment workflow and group migration workflow to be completed before they can be executed. Overall, these changes aim to provide a more streamlined and detailed installation and migration process, improving the user experience for software engineers.
- Fixed command references in README.md (#1093). In this release, we have made improvements to the command references in the README.md file to enhance the overall readability and usability of the documentation for software engineers. Specifically, we have updated the links for the
migrate-locations
andvalidate_external_locations
commands to use the correct syntax, enclosing them in backticks to denote code. This change ensures that the links are correctly interpreted as commands and addresses any issues that may have arisen with their previous formatting. It is important to note that no new methods have been added in this release, and the existing functionality of the commands has not been changed in scope or functionality. - Fixing the issue in workspace id flag in create-account-group command (#1094). In this update, we have improved the
create_account_group
command related to theworkspace_ids
flag in our open-source library. Theworkspace_ids
flag's type has been changed fromlist[int] | None
tostr | None
, allowing for easier input of multiple workspace IDs as a string of comma-separated integers. Thecreate_account_level_groups
function in theAccountWorkspaces
class has been updated to accept this string and convert it to a list of integers before proceeding. To ensure proper functioning, we added a new test casetest_create_account_groups_with_id()
to check if the command handles the case when no workspace IDs are provided in the configuration. Thecreate_account_groups()
method now checks for this condition and raises aValueError
. Furthermore, themanual_workspace_info()
method has been updated to handle workspace name input by the user, receiving thews
object, along with prompts that contain the user input for the workspace name and the next workspace ID. - Rely UCX on the latest 14.3 LTS DBR instead of 15.x (#1097). In this release, we have implemented a quick fix to rely on the Long Term Support (LTS) version 14.3 of the Databricks Runtime (DBR) instead of 15.x for UCX, addressing issue #1096. This change affects the
_definition
function, which has been modified to use the latest LTS DBR instead of the latest Spark version. Thelatest_lts_dbr
variable is now assigned the value returned by theselect_spark_version
method with thelatest=True
andlong_term_support=True
parameters. Thespark_version
key in thepolicy_definition
dictionary is set to the value returned by the_policy_config
method withlatest_lts_dbr
as the argument. Additionally, in thetests/unit/installer/test_policy.py
file, theselect_spark_version
method of theclusters
object has been updated to accept any number of arguments and consistently return the string "14.2.x-scala2.12", allowing for greater flexibility. This is a temporary solution, with a more comprehensive fix being tracked in issue #1098. Developers should be aware of how theclusters
object is used in the codebase when adopting this project.
- Added Legacy Table ACL grants migration (#1054). This commit introduces a legacy table ACL grants migration to the
migrate-tables
workflow, resolving issue #340 and paving the way for follow-up PRs #887 and #907. A newGrantsCrawler
class is added for crawling grants, along with aGroupManager
class to manage groups during migration. TheTablesMigrate
class is updated to accept an instance ofGrantsCrawler
andGroupManager
in its constructor. The migration process has been thoroughly tested with unit tests, integration tests, and manual testing on a staging environment. The changes include the addition of a new Enum classAclMigrationWhat
and updates to theTable
dataclass, and affect the way tables are selected for migration based on rules. The logging and error handling have been improved in theskip_schema
function. - Added
databricks labs ucx cluster-remap
command to remap legacy cluster configurations to UC-compatible (#994). In this open-source library update, we have developed and added thedatabricks labs ucx cluster-remap
command, which facilitates the remapping of legacy cluster configurations to UC-compatible ones. This new CLI command comes with user documentation to guide the cluster remapping process. Additionally, we have expanded the functionality of creating and managing UC external catalogs and schemas with the inclusion ofcreate-catalogs-schemas
andrevert-cluster-remap
commands. This change does not modify existing commands or workflows and does not introduce new tables. Thedatabricks labs ucx cluster-remap
command allows users to re-map and revert the re-mapping of clusters from Unity Catalog (UC) using the CLI, ensuring compatibility and streamlining the migration process. The new command and associated functions have been manually tested for functionality. - Added
migrate-tables
workflow (#1051). Themigrate-tables
workflow has been added, which allows for more fine-grained control over the resources allocated to the workspace. This workflow includes two new instance variablesmin_workers
andmax_workers
in theWorkspaceConfig
class, with default values of 1 and 10 respectively. A newtrigger
function has also been introduced, which initializes a configuration, SQL backend, and WorkspaceClient based on the provided configuration file. Therun_task
function has been added, which looks up the specified task, logs relevant information, and runs the task's function with the provided arguments. TheTask
class'sfn
attribute now includes anInstallation
object as a parameter. Additionally, a newmigrate-tables
workflow has been added for migrating tables from the Hive Metastore to the Unity Catalog, along with new classes and methods for table mapping, migration status refreshing, and migrating tables. Themigrate_dbfs_root_delta_tables
andmigrate_external_tables_sync
methods perform migrations for Delta tables located in the DBFS root and synchronize external tables, respectively. These functions use the workspace client to access the catalogs and ensure proper migration. Integration tests have also been added for these new methods to ensure their correct operation. - Added handling for
SYNC
command failures (#1073). This pull request introduces changes to improve handling ofSYNC
command failures during external table migrations in the Hive metastore. Previously, theSYNC
command's result was not checked, and failures were not logged. Now, the_migrate_external_table
method intable_migrate.py
fetches the result of theSYNC
command execution, logs a warning message for failures, and returnsFalse
if the command fails. A new integration test has been added to simulate a failedSYNC
command due to a non-existent catalog and schema, ensuring the migration tool handles such failures. A new test case has also been added to verify the handling ofSYNC
command failures during external table migrations, using a mock backend to simulate failures and checking for appropriate log messages. These changes enhance the reliability and robustness of the migration process, providing clearer error diagnosis and handling for potentialSYNC
command failures. - Added initial version of
databricks labs ucx migrate-local-code
command (#1067). A newdatabricks labs ucx migrate-local-code
command has been added to facilitate migration of local code to a Databricks environment, specifically targeting Python and SQL files. This initial version is experimental and aims to help users and administrators manage code migration, maintain consistency across workspaces, and enhance compatibility with the Unity Catalog, a component of Databricks' data and AI offerings. The command introduces a newFiles
class for applying migrations to code files, considering their language. It also updates the.gitignore
file and the pyproject.toml file to ensure appropriate version control management. Additionally, new classes and methods have been added to support code analysis, transformation, and linting for various programming languages. These improvements will aid in streamlining the migration process and ensuring compatibility with Databricks' environment. - Added instance pool to cluster policy (#1078). A new field,
instance_pool_id
, has been added to the cluster policy configuration inpolicy.py
, allowing users to specify the ID of an instance pool to be applied to all workflow clusters in the policy. This ID can be manually set or automatically retrieved by the system. A new private method,_get_instance_pool_id()
, has been added to handle the retrieval of the instance pool ID. Additionally, a new test for table migration jobs has been added totest_installation.py
to ensure the migration job is correctly configured with the specified parallelism, minimum and maximum number of workers, and instance pool ID. A new test case for creating a cluster policy with an instance pool has also been added totests/unit/installer/test_policy.py
to ensure the instance pool is added to the cluster policy during creation. These changes provide users with more control over instance pools and cluster policies, and improve the overall functionality of the library. - Fixed
ucx move
logic forMANAGED
&EXTERNAL
tables (#1062). Theucx move
command has been updated to allow for the movement of UC tables/views after the table upgrade process, providing flexibility in managing catalog structure. The command now supports moving multiple tables simultaneously, dropping managed tables/views upon confirmation, and deep-cloning managed tables while dropping and recreating external tables. A refactoring of theTableMove
class has improved code organization and readability, and the associated unit tests have been updated to reflect these changes. This feature is targeted towards developers and administrators seeking to adjust their catalog structure after table upgrades, with the added ability to manage exceptional conditions gracefully. - Fixed integration testing with random product names (#1074). In the recent update, the
trigger
function in thetasks.py
module of theucx
framework has undergone modification to incorporate a new argument,install_folder
, within theInstallation
object. This object is now generated locally within thetrigger
function and subsequently passed to therun_task
function. Theinstall_folder
is determined by obtaining the parent directory of theconfig_path
variable, transforming it into a POSIX-style path, and eliminating the leading "/Workspace" prefix. This enhancement guarantees that therun_task
function acquires the correct installation folder for theucx
framework, thereby improving the overall functionality and precision of the framework. Furthermore, theInstallation.current
method has been supplanted with the newly formedInstallation
object, which now encompasses theinstall_folder
argument. - Refactor installer to separate workflows methods from the installer class (#1055). In this release, the installer in the
cli.py
file has been refactored to improve modularity and maintainability. The installation and workflow functionalities have been separated by importing a new class calledWorkflowsInstallation
fromdatabricks.labs.ucx.installer.workflows
. TheWorkspaceInstallation
class is no longer used in various functions, and the newWorkflowsInstallation
class is used instead. Additionally, a new mixin class calledInstallationMixin
has been introduced, which includes methods for uninstalling UCX, removing jobs, and validating installation steps. TheWorkflowsInstallation
class now inherits from this mixin class. A new file,workflows.py
, has been added to thedatabricks/labs/ucx/installer
directory, which contains methods for managing Databricks jobs. The newWorkflowsInstallation
class is responsible for deploying workflows, uploading wheels to DBFS or WSFS, and creating debug notebooks. The refactoring also includes the addition of new methods for handling specific workflows, such asrun_workflow
,validate_step
, andrepair_run
, which are now contained in theWorkflowsInstallation
class. Thetest_install.py
file in thetests/unit
directory has also been updated to include new imports and test functions to accommodate these changes. - Skip unsupported locations while migrating to external location in Azure (#1066). In this release, we have updated the functionality of migrating to an external location in Azure. A new private method
_filter_unsupported_location
has been added to thelocations.py
file, which checks if the location URLs are supported and removes the unsupported ones from the list. Only locations starting with "abfss://" are considered supported. Unsupported locations are logged with a warning message. Additionally, a new testtest_skip_unsupported_location
has been introduced to verify that thelocation_migration
function correctly skips unsupported locations during migration to external locations in Azure. The test checks if the correct log messages are generated for skipped unsupported locations, and it mocks various scenarios such as crawled HMS external locations, storage credentials, UC external locations, and installation with permission mapping. The mock crawled HMS external locations contain two unsupported locations:adl://
andwasbs://
. This ensures that the function handles unsupported locations correctly, avoiding any unnecessary errors or exceptions during migration. - Triggering Assessment Workflow from Installer based on User Prompt (#1007). A new functionality has been added to the installer that allows users to trigger an assessment workflow based on a prompt during the installation process. The
_trigger_workflow
method has been implemented, which can be initiated with a step string argument. This method retrieves the job ID for the specified step from the_state.jobs
dictionary, generates the job URL, and triggers the job using therun_now
method from thejobs
class of the Workspace object. Users will be asked to confirm triggering the assessment workflow and will have the option to open the job URL in a web browser after triggering it. A new unit test,test_triggering_assessment_wf
, has been introduced to thetest_install.py
file to verify the functionality of triggering an assessment workflow based on user prompt. This test uses existing classes and functions, such asMockBackend
,MockPrompts
,WorkspaceConfig
, andWorkspaceInstallation
, to run theWorkspaceInstallation.run
method with a mockedWorkspaceConfig
object and a mock installation. The test also includes a user prompt to confirm triggering the assessment job and opening the assessment job URL. The new functionality and test improve the installation process by enabling users to easily trigger the assessment workflow based on their specific needs. - Updated README.md for Service Principal Installation Limit (#1076). This release includes an update to the README.md file to clarify that installing UCX with a Service Principal is not supported. Previously, the file indicated that Databricks Workspace Administrator privileges were required for the user running the installation, but did not explicitly state that Service Principal installation is not supported. The updated text now includes this information, ensuring that users have a clear understanding of the requirements and limitations of the installation process. The rest of the file remains unchanged and continues to provide instructions for installing UCX, including required software and network access. No new methods or functionality have been added, and no existing functionality has been changed beyond the addition of this clarification. The changes in this release have been manually tested to ensure they are functioning as intended.
- Added AWS IAM role support to
databricks labs ucx create-uber-principal
command (#993). Thedatabricks labs ucx create-uber-principal
command now supports AWS Identity and Access Management (IAM) roles for external table migration. This new feature introduces a CLI command to create anuber-IAM
profile, which checks for the UCX migration cluster policy and updates or adds the migration policy to provide access to the relevant table locations. If no IAM instance profile or role is specified in the cluster policy, a new one is created and the new migration policy is added. This change includes new methods and functions to handle AWS IAM roles, instance profiles, and related trust policies. Additionally, new unit and integration tests have been added and verified on the staging environment. The implementation also identifies all S3 buckets used by the Instance Profiles configured in the workspace. - Added Dashboard widget to show the list of cluster policies along with DBR version (#1013). In this code revision, the
assessment
module of the 'databricks/labs/ucx' package has been updated to include a newPoliciesCrawler
class, which fetches, assesses, and snapshots cluster policies. This class extendsCrawlerBase
andCheckClusterMixin
and introduces the '_crawl', '_assess_policies', '_try_fetch', andsnapshot
methods. ThePolicyInfo
dataclass has been added to hold policy information, with a structure similar to theClusterInfo
dataclass. TheClusterInfo
dataclass has been updated to includespark_version
andpolicy_id
attributes. A new table for policies has been added, and cluster policies along with the DBR version are loaded into this table. Relevant user documentation, tests, and a Dashboard widget have been added to support this feature. Thecreate
function in 'fixtures.py' has been updated to enable a Delta preview feature in Spark configurations, and a new SQL file has been included for querying cluster policies. Additionally, a newcrawl_cluster_policies
method has been added to scan and store cluster policies with matching configurations. - Added
migration_status
table to capture a snapshot of migrated tables (#1041). Amigration_status
table has been added to track the status of migrated tables in the database, enabling improved management and tracking of migrations. The newMigrationStatus
class, which is a dataclass that holds the source and destination schema, table, and updated timestamp, is added. TheTablesMigrate
class now has a new_migration_status_refresher
attribute that is an instance of the newMigrationStatusRefresher
class. This class crawls themigration_status
table and returns a snapshot of the migration status, which is used to refresh the migration status and check if the table is upgraded. Additionally, the_init_seen_tables
method is updated to get the seen tables from the_migration_status_refresher
instead of fetching from the table properties. TheMigrationStatusRefresher
class fetches the migration status table and returns a snapshot of the migration status. This change also adds new test functions in the test file for the Hive metastore, which covers various scenarios such as migrating managed tables with and without caching, migrating external tables, and reverting migrated tables. - Added a check for existing inventory database to avoid losing existing, inject installation objects in tests and try fetching existing installation before setting global as default (#1043). In this release, we have added a new method,
_check_inventory_database_exists
, to theWorkspaceInstallation
class, which checks if an inventory database with a given name already exists in the Workspace. This prevents accidental overwriting of existing data and improves the robustness of handling inventory databases. Thevalidate_and_run
method has been updated to callapp.current_installation(workspace_client)
, allowing for a more flexible handling of installations. TheInstallation
class import has been updated to includeSerdeError
, and the test suite has been updated to inject installation objects and check for existing installations before setting the global installation as default. A new argumentinventory_schema_suffix
has been added to thefactory
method for customization of the inventory schema name. We have also added a new methodcheck_inventory_database_exists
to theWorkspaceInstaller
class, which checks if an inventory database already exists for a given installation type and raises anAlreadyExists
error if it does. The behavior of thedownload
method in theWorkspaceClient
class has been mocked, and theget_status
method has been updated to returnNotFound
in certain tests. These changes aim to improve the robustness, flexibility, and safety of the installation process in the Workspace. - Added a check for external metastore in SQL warehouse configuration (#1046). In this release, we have added new functionality to the Unity Catalog (UCX) installation process to enable checking for and connecting to an external Hive metastore configuration. A new method,
_get_warehouse_config_with_external_hive_metastore
, has been introduced to retrieve the workspace warehouse config and identify if it is set up for an external Hive metastore. If so, and the user confirms the prompt, UCX will be configured to connect to the external metastore. Additionally, new methods_extract_external_hive_metastore_sql_conf
andtest_cluster_policy_definition_<cloud_provider>_hms_warehouse()
have been added to handle the external metastore configuration for Azure, AWS, and GCP, and to handle the case when the data_access_config is empty. These changes provide more flexibility and ease of use when installing UCX with external Hive metastore configurations. The new importsEndpointConfPair
,GetWorkspaceWarehouseConfigResponse
from thedatabricks.sdk.service.sql
package are used to handle the endpoint configuration of the SQL warehouse. - Added integration tests for AWS - create locations (#1026). In this release, we have added comprehensive integration tests for AWS resources and their management in the
tests/unit/assessment/test_aws.py
file. TheAWSResources
class has been updated with new methods (AwsIamRole, add_uc_role, add_uc_role_policy, and validate_connection) and the regular expression for matching S3 resource ARN has been modified. Thecreate_external_locations
method now allows for creating external locations without validating them, and the_identify_missing_external_locations
function has been enhanced to match roles with a wildcard pattern. The new tests include validating the integration of AWS services with the system, testing the CLI's behavior when it is missing, and introducing new configuration scenarios with the addition of a Key Management Service (KMS) key during the creation of IAM roles and policies. These changes improve the robustness and reliability of AWS resource integration and handling in our system. - Bump Databricks SDK to v0.22.0 (#1059). In this release, we are bumping the Databricks SDK version to 0.22.0 and upgrading the
databricks-labs-lsql
package to ~0.2.2. The new dependencies for this release includedatabricks-sdk==0.22.0
,databricks-labs-lsql~=0.2.2
,databricks-labs-blueprint~=0.4.3
, andPyYAML>=6.0.0,<7.0.0
. In thefixtures.py
file, we have addedPermissionLevel.CAN_QUERY
to theCAN_VIEW
andCAN_MANAGE
permissions in the_path
function, allowing users to query the endpoint. Additionally, we have updated thetest_endpoints
function in thetest_generic.py
file as part of the integration tests for workspace access. This change updates the permission level for creating a serving endpoint fromCAN_MANAGE
toCAN_QUERY
, meaning that the assigned group can now only query the endpoint. We have also included thetest_feature_tables
function in the commit, which tests the behavior of feature tables in the Databricks workspace. This change only affects thetest_endpoints
function and its assert statements, and does not impact the functionality of thetest_feature_tables
function. - Changed default UCX installation folder to
/Applications/ucx
from/Users/<me>/.ucx
to allow multiple users users utilising the same installation (#854). In this release, we've added a new advanced feature that allows users to force the installation of UCX over an existing installation using theUCX_FORCE_INSTALL
environment variable. This variable can take two valuesglobal
and 'user', providing more control and flexibility in installing UCX. The default UCX installation folder has been changed to /Applications/ucx from /Users//.ucx to enable multiple users to utilize the same installation. A table detailing the expected install location,install_folder
, and mode for each combination of global and user values has been added to the README file. We've also added user prompts to confirm the installation if UCX is already installed and theUCX_FORCE_INSTALL
variable is set to 'user'. This feature is useful when users want to install UCX in a specific location or force the installation over an existing installation. However, it is recommended to use this feature with caution, as it can potentially break existing installations if not used correctly. Additionally, several changes to the implementation of the UCX installation process have been made, as well as new tests to ensure that the installation process works correctly in various scenarios. - Fix: Recover lost fix for
webbrowser.open
mock (#1052). A fix has been implemented to address an issue related to the mock forwebbrowser.open
in the teststest_repair_run
andtest_get_existing_installation_global
. This change prevents thewebbrowser.open
function from being called during these tests, which helps improve test stability and consistency. No new methods have been added, and the existing functionality of these tests has only been modified to include thewebbrowser.open
mock. This modification aims to enhance the reliability and predictability of these specific tests, ensuring accurate and consistent results. - Improved table migrations logic (#1050). This change introduces improvements to table migrations logic by refactoring unit tests to load table mappings from JSON instead of inline structs, adding an
escape_sql_identifier
function where missing, and preparing for ACLs migration. Theuc_grant_sql
method ingrants.py
has been updated to accept optionalobject_type
andobject_key
parameters, and the hive-to-UC mapping has been expanded to include mappings for views. Additionally, new JSON files for external source table configuration have been added, and new functions have been introduced for loading fixture data from JSON files and creating mockedWorkspaceClient
andTableMapping
objects for testing. The changes improve the maintainability and security of the codebase, prepare it for future migration tasks, and ensure that the code is more adaptable and robust. The changes have been manually tested and verified on the staging environment. - Moved
SqlBackend
implementation todatabricks-labs-lsql
dependency (#1042). In this change, theSqlBackend
implementation, including classes such asStatementExecutionBackend
andRuntimeBackend
, has been moved to a separate library,databricks-labs-lsql
, which is managed at https://github.com/databrickslabs/lsql. This refactoring simplifies the current repository, promotes code reuse, and improves modularity by leveraging an external dependency. The modification includes adding a new line in the .gitignore file to exclude*.out
files from version control. - Prepare for a PyPI release (#1038). In preparation for a PyPI release, this change introduces a new GitHub Actions workflow that automates the package release process and ensures the integrity of the released packages by signing them with Sigstore. When a new git tag starting with
v
is pushed, this workflow is triggered, building wheels using hatch, drafting a new GitHub release, publishing the package distributions to PyPI, and signing the artifacts with Sigstore. Thepyproject.toml
file is now used for metadata, replacingsetup.cfg
andsetup.py
, and is cached to improve build performance. In addition, thepyproject.toml
file has been updated with recent metadata in preparation for the release, including updates to the package's authors, development status, classifiers, and dependencies. - Prevent fragile
mock.patch('databricks...')
in the test code (#1037). This change introduces a custompylint
checker to improve code flexibility and maintainability by preventing fragilemock.patch
designs in test code. The new checker discourages the use ofMagicMock
and encourages the use ofcreate_autospec
to ensure that mocks have the same attributes and methods as the original class. This change has been implemented in multiple test files, includingtest_cli.py
,test_locations.py
,test_mapping.py
,test_table_migrate.py
,test_table_move.py
,test_workspace_access.py
,test_redash.py
,test_scim.py
, andtest_verification.py
, to improve the robustness and maintainability of the test code. Additionally, the commit removes theverification.py
file, which contained aVerificationManager
class for verifying applied permissions, scope ACLs, roles, and entitlements for various objects in a Databricks workspace. - Removed
mocker.patch("databricks...)
fromtest_cli
(#1047). In this release, we have made significant updates to the library's handling of Azure and AWS workspaces. We have added new parametersazure_resource_permissions
andaws_permissions
to the_execute_for_cloud
function incli.py
, which are passed to thefunc_azure
andfunc_aws
functions respectively. Thecreate_uber_principal
andprincipal_prefix_access
commands have also been updated to include these new parameters. Additionally, the_azure_setup_uber_principal
and_aws_setup_uber_principal
functions have been updated to accept the newazure_resource_permissions
andaws_resource_permissions
parameters. The_azure_principal_prefix_access
and_aws_principal_prefix_access
functions have also been updated similarly. We have also introduced a newaws_resources
parameter in themigrate_credentials
command, which is used to migrate Azure Service Principals in ADLS Gen2 locations to UC storage credentials. In terms of testing, we have replaced themocker.patch
calls with the creation ofAzureResourcePermissions
andAWSResourcePermissions
objects, improving the code's readability and maintainability. Overall, these changes significantly enhance the library's functionality and maintainability in handling Azure and AWS workspaces. - Require Hatch v1.9.4 on build machines (#1049). In this release, we have updated the Hatch package version to 1.9.4 on build machines, addressing issue #1049. The changes include updating the toolchain dependencies and setup in the
.codegen.json
file, which simplifies the setup process and now relies on a pre-existing Hatch environment and Python 3. The acceptance workflow has also been updated to use the latest version of Hatch and thedatabrickslabs/sandbox/acceptance
GitHub action versionv0.1.4
. Hatch is a Python package manager that simplifies package development and management, and this update provides new features and bug fixes that can help improve the reliability and performance of the acceptance workflow. This change requires version 1.9.4 of the Hatch package on build machines, and it will affect the build process for the project but will not have any impact on the functionality of the project itself. As a software engineer adopting this project, it's important to note this change to ensure that the build process runs smoothly and takes advantage of any new features or improvements in Hatch 1.9.4. - Set acceptance tests to timeout after 45 minutes (#1036). As part of issue #1036, the acceptance tests in this open-source library now have a 45-minute timeout configured, improving the reliability and stability of the testing environment. This change has been implemented in the
.github/workflows/acceptance.yml
file by adding thetimeout
parameter to the step where thedatabrickslabs/sandbox/acceptance
action is called. This ensures that the acceptance tests will not run indefinitely and prevents any potential issues caused by long-running tests. By adopting this project, software engineers can now benefit from a more stable and reliable testing environment, with acceptance tests that are guaranteed to complete within a maximum of 45 minutes. - Updated databricks-labs-blueprint requirement from ~0.4.1 to ~0.4.3 (#1058). In this release, the version requirement for the
databricks-labs-blueprint
library has been updated from ~0.4.1 to ~0.4.3 in the pyproject.toml file. This change is necessary to support issues #1056 and #1057. The code has been manually tested and is ready for further testing to ensure the compatibility and smooth functioning of the software. It is essential to thoroughly test the latest version of thedatabricks-labs-blueprint
library with the existing codebase before deploying it to production. This includes running a comprehensive suite of tests such as unit tests, integration tests, and verification on the staging environment. This modification allows the software to use the latest version of the library, improving its functionality and overall performance. - Use
MockPrompts.extend()
functionality in test_install to supply multiple prompts (#1057). This diff introduces theMockPrompts.extend()
functionality in thetest_install
module to enable the supplying of multiple prompts for testing purposes. A newbase_prompts
dictionary with default prompts has been added and is extended with additional prompts for specific test cases. This allows for the testing of various scenarios, such as when UCX is already installed on the workspace and the user is prompted to choose between global or user installation. Additionally, newforce_user_environ
andforce_global_env
dictionaries have been added to simulate different installation environments. The functionality of theWorkspaceInstaller
class and mocking ofwebbrowser.open
are also utilized in the test cases. These changes aim to ensure the proper functioning of the configuration process for different installation scenarios.
- Added AWS IAM roles support to
databricks labs ucx migrate-credentials
command (#973). This commit adds AWS Identity and Access Management (IAM) roles support to thedatabricks labs ucx migrate-credentials
command, resolving issue #862 and being related to pull request #874. It includes the addition of aload
function toAWSResourcePermissions
to return identified instance profiles and the creation of anIamRoleMigration
class underaws/credentials.py
to migrate identified AWS instance profiles. Additionally, user documentation and a new CLI commanddatabricks labs ucx migrate-credentials
have been added, and the changes have been thoroughly tested with manual, unit, and integration tests. The functionality additions include new methods such asadd_uc_role_policy
andupdate_uc_trust_role
, among others, designed to facilitate the migration process for AWS IAM roles. - Added
create-catalogs-schemas
command to prepare destination catalogs and schemas before table migration (#1028). The Databricks Labs Unity Catalog (UCX) tool has been updated with a newcreate-catalogs-schemas
command to facilitate the creation of destination catalogs and schemas prior to table migration. This command should be executed after thecreate-table-mapping
command and is designed to prepare the workspace for migrating tables to UC. Additionally, a newCatalogSchema
class has been added to thehive_metastore
package to manage the creation of catalogs and schemas in the Hive metastore. This new functionality simplifies the process of preparing the destination Hive metastore for table migration, reducing the likelihood of user errors and ensuring that the metastore is properly configured. Unit tests have been added to thetests/unit/hive_metastore
directory to verify the behavior of theCatalogSchema
class and the newcreate-catalogs-schemas
command. This command is intended for use in contexts where GCP is not supported. - Added automated upgrade option to set up cluster policy (#1024). This commit introduces an automated upgrade option for setting up a cluster policy for older versions of UCX, separating the cluster creation policy from install.py to installer.policy.py and adding an upgrade script for older UCX versions. A new class,
ClusterPolicyInstaller
, is added to thepolicy.py
file in theinstaller
package to manage the creation and update of a Databricks cluster policy for Unity Catalog Migration. This class handles creating a new cluster policy with specific configurations, extracting external Hive Metastore configurations, and updating job policies. Additionally, the commit includes refactoring, removal of library references, and a new script, v0.15.0_added_cluster_policy.py, which contains the upgrade function. The changes are tested through manual and automated testing with unit tests and integration tests. This feature is intended for software engineers working with the project. - Added crawling for init scripts on local files to assessment workflow (#960). This commit introduces the ability to crawl init scripts stored on local files and S3 as part of the assessment workflow, resolving issue #9
- Added database filter for the
assessment
workflow (#989). In this release, we have added a new configuration option,include_databases
, to the assessment workflow which allows users to specify a list of databases to include for migration, rather than crawling all the databases in the Hive Metastore. This feature is implemented in theTablesCrawler
,UdfsCrawler
,GrantsCrawler
classes and the associated functions such as_all_databases
,getIncludeDatabases
,_select_databases
. These changes aim to improve efficiency and reduce unnecessary crawling, and are accompanied by modifications to existing functionality, as well as the addition of unit and integration tests. The changes have been manually tested and verified on a staging environment. - Estimate migration effort based on assessment database (#1008). In this release, a new functionality has been added to estimate the migration effort for each asset in the assessment database. The estimation is presented in days and is displayed on a new estimates dashboard with a summary widget for a global estimate per object type, along with assumptions and scope for each object type. A new
query
parameter has been added to theSimpleQuery
class to support this feature. Additional changes include the update of the_install_viz
and_install_query
methods, the inclusion of thedata_source_id
in the query metadata, and the addition of tests to ensure the proper functioning of the new feature. A new fixture,mock_installation_with_jobs
, has been added to support testing of the assessment estimates dashboard. - Explicitly write to
hive_metastore
fromcrawl_tables
task (#1021). In this release, we have improved the clarity and specificity of our handling of thehive_metastore
in thecrawl_tables
task. Previously, thedf.write.saveAsTable
method was used without explicitly specifying thehive_metastore
database, which could result in ambiguity. To address this issue, we have updated thesaveAsTable
method to include thehive_metastore
database, ensuring that tables are written to the correct location in the Hive metastore. These changes are confined to thesrc/databricks/labs/ucx/hive_metastore/tables.scala
file and affect thecrawl_tables
task. While no new methods have been added, the existingsaveAsTable
method has been modified to enhance the accuracy and predictability of our interaction with the Hive metastore. - Improved documentation for
databricks labs ucx move
command (#1025). Thedatabricks labs ucx move
command has been updated with new improvements to its documentation, providing enhanced clarity and ease of use for developers and administrators. This command facilitates the movement of UC tables/table(s) from one schema to another, either in the same or different catalog, during the table upgrade process. A significant enhancement is the preservation of the source table's permissions when moving to a new schema or catalog, maintaining the original table's access controls, simplifying the management of table permissions, and streamlining the migration process. These improvements aim to facilitate a more efficient table migration experience, ensuring that developers and administrators can effectively manage their UC tables while maintaining the desired level of access control and security. - Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0 (#1030). In this update, the
databricks-sdk
package requirement has been updated to version~=0.21.0
from~=0.20.0
. This new version addresses several bugs and provides enhancements, including the fix for theget_workspace_client
method in GCP, the use of theall-apis
scope with the external browser, and an attempt to initialize all Databricks globals. Moreover, the API's settings nesting approach has changed, which may cause compatibility issues with previous versions. Several new services and dataclasses have been added to the API, and documentation and examples have been updated accordingly. There are no updates to thedatabricks-labs-blueprint
andPyYAML
dependencies in this commit.
- Added AWS S3 support for
migrate-locations
command (#1009). In this release, the open-source library has been enhanced with AWS S3 support for themigrate-locations
command, enabling efficient and secure management of S3 data. The new functionality includes the identification of missing S3 prefixes and the creation of corresponding roles and policies through the addition of methods_identify_missing_paths
,_get_existing_credentials_dict
, andcreate_external_locations
. The library now also includes new classesAwsIamRole
,ExternalLocationInfo
, andStorageCredentialInfo
for better handling of AWS-related functionality. Additionally, two new tests,test_create_external_locations
andtest_create_external_locations_skip_existing
, have been added to ensure the correct behavior of the new AWS-related functionality. The new test functiontest_migrate_locations_aws
checks the AWS-specific implementation of themigrate-locations
command, whiletest_missing_aws_cli
verifies the correct error message is displayed when the AWS CLI is not found in the system path. These changes enhance the library's capabilities, improving data security, privacy, and overall performance for users working with AWS S3. - Added
databricks labs ucx create-uber-principal
command to create Azure Service Principal for migration (#976). The new CLI command,databricks labs ucx create-uber-principal
, has been introduced to create an Azure Service Principal (SPN) and grant it STORAGE BLOB READER access on all the storage accounts used by the tables in the workspace. The SPN information is then stored in the UCX cluster policy. A new class, AzureApiClient, has been added to isolate Azure API calls, and unit and integration tests have been included to verify the functionality. This development enhances migration capabilities for Azure workspaces, providing a more streamlined and automated way to create and manage Service Principals, and improves the functionality and usability of the UCX tool. The changes are well-documented and follow the project's coding standards. - Added
migrate-locations
command (#1016). In this release, we've added a new CLI command,migrate_locations
, to create Unity Catalog (UC) external locations. This command extracts candidates for location creation from theguess_external_locations
assessment task and checks if corresponding UC Storage Credentials exist before creating the locations. Currently, the command only supports Azure, with plans to add support for AWS and GCP in the future. Themigrate_locations
function is marked with theucx.command
decorator and is available as a command-line interface (CLI) command. The pull request also includes unit tests for this new command, which check the environment (Azure, AWS, or GCP) before executing the migration and log a message if the environment is AWS or GCP, indicating that the migration is not yet supported on those platforms. No changes have been made to existing workflows, commands, or tables. - Added handling for widget delete on upgrade platform bug (#1011). In this release, the
_install_dashboard
method indashboards.py
has been updated to handle a platform bug that occurred during the deletion of dashboard widgets during an upgrade process (issue #1011). Previously, the method attempted to delete each widget using theself._ws.dashboard_widgets.delete(widget.id)
command, which resulted in aTypeError
when attempting to delete a widget. The updated method now includes a try/except block that catches thisTypeError
and logs a warning message, while also tracking the issue under bug ES-1061370. The rest of the method remains unchanged, creating a dashboard with the given name, role, and parent folder ID if no widgets are present. This enhancement improves the robustness of the_install_dashboard
method by adding error handling for the SDK API response when deleting dashboard widgets, ensuring a smoother upgrade process. - Create UC external locations in Azure based on migrated storage credentials (#992). The
locations.py
file in thedatabricks.labs.ucx.azure
package has been updated to include a new classExternalLocationsMigration
, which creates UC external locations in Azure based on migrated storage credentials. This class takes various arguments, includingWorkspaceClient
,HiveMetastoreLocations
,AzureResourcePermissions
, andAzureResources
. It has arun()
method that lists any missing external locations in UC, extracts their location URLs, and attempts to create a UC external location with a mapped storage credential name if the missing external location is in the mapping. The class also includes helper methods for generating credential name mappings. Additionally, theresources.py
file in the same package has been modified to include a new methodmanaged_identity_client_id
, which retrieves the client ID of a managed identity associated with a given access connector. Test functions for theExternalLocationsMigration
class and Azure external locations functionality have been added in the new filetest_locations.py
. Thetest_resources.py
file has been updated to include tests for themanaged_identity_client_id
method. A newmappings.json
file has also been added for tests related to Azure external location mappings based on migrated storage credentials. - Deprecate legacy installer (#1014). In this release, we have deprecated the legacy installer for the UCX project, which was previously implemented as a bash script. A warning message has been added to inform users about the deprecation and direct them to the UCX installation instructions. The functionality of the script remains unchanged, and it still performs tasks such as installing Python dependencies and building Python bindings. The script will eventually be replaced with the
databricks labs install ucx
command. This change is part of issue #1014 and is intended to streamline the installation process and improve the overall user experience. We recommend that users update their installation process to the new recommended method as soon as possible to avoid any issues with the legacy installer in the future. - Prompt user if Terraform utilised for deploying infrastructure (#1004). In this update, the
config.py
file has been modified to include a new attribute,is_terraform_used
, in theWorkspaceConfig
class. This boolean flag indicates whether Terraform has been used for deploying certain entities in the workspace. Issue #393 has been addressed with this change. TheWorkspaceInstaller
configuration has also been updated to take advantage of this new attribute, allowing developers to determine if Terraform was used for infrastructure deployment, thereby increasing visibility into the deployment process. Additionally, a new prompt has been added to thewarehouse_type
function to ascertain if Terraform is being utilized for infrastructure deployment, setting theis_terraform_used
variable to True if it is. This improvement is intended for software engineers adopting this open-source library. - Updated CONTRIBUTING.md (#1005). In this contribution to the open-source library, the CONTRIBUTING.md file has been significantly updated with clearer instructions on how to effectively contibute to the project. The previous command to print the Python path has been removed, as the IDE is now advised to be configured to use the Python interpreter from the virtual environment. A new step has been added, recommending the use of a consistent styleguide and formatting of the code before every commit. Moreover, it is now encouraged to run tests before committing to minimize potential issues during the review process. The steps on how to make a Fork from the ucx repo and create a PR have been updated with links to official documentation. Lastly, the commit now includes information on handling dependency errors that may occur after
git pull
. - Updated databricks-labs-blueprint requirement from ~=0.2.4 to ~=0.3.0 (#1001). In this pull request update, the requirements file, pyproject.toml, has been modified to upgrade the databricks-labs-blueprint package from version ~0.2.4 to ~0.3.0. This update integrates the latest features and bug fixes of the package, including an automated upgrade framework, a brute-forcing approach for handling SerdeError, and enhancements for running nightly integration tests with service principals. These improvements increase the testability and functionality of the software, ensuring its stable operation with service principals during nightly integration tests. Furthermore, the reliability of the test for detecting existing installations has been reinforced by adding a new test function that checks for the correct detection of existing installations and retries the test for up to 15 seconds if they are not.
Dependency updates:
- Updated databricks-labs-blueprint requirement from ~=0.2.4 to ~=0.3.0 (#1001).
- Added
upgraded_from_workspace_id
property to migrated tables to indicated the source workspace (#987). In this release, updates have been made to the_migrate_external_table
,_migrate_dbfs_root_table
, and_migrate_view
methods in thetable_migrate.py
file to include a new parameterupgraded_from_ws
in the SQL commands used to alter tables, views, or managed tables. This parameter is used to store the source workspace ID in the migrated tables, indicating the migration origin. A new utility methodsql_alter_from
has been added to theTable
class intables.py
to generate the SQL command with the new parameter. Additionally, a new class-level attributeUPGRADED_FROM_WS_PARAM
has been added to theTable
class intables.py
to indicate the source workspace. A new propertyupgraded_from_workspace_id
has been added to migrated tables to store the source workspace ID. These changes resolve issue #899 and are tested through manual testing, unit tests, and integration tests. No new CLI commands, workflows, or tables have been added or modified, and there are no changes to user documentation. - Added a command to create account level groups if they do not exist (#763). This commit introduces a new feature that enables the creation of account-level groups if they do not already exist in the account. A new command,
create-account-groups
, has been added to thedatabricks labs ucx
tool, which crawls all workspaces in the account and creates account-level groups if a corresponding workspace-local group is not found. The feature supports various scenarios, including creating account-level groups that exist in some workspaces but not in others, and creating multiple account-level groups with the same name but different members. Several new methods have been added to theaccount.py
file to support the new feature, and thetest_account.py
file has been updated with new tests to ensure the correct behavior of thecreate_account_level_groups
method. Additionally, thecli.py
file has been updated to include the newcreate-account-groups
command. With these changes, users can easily manage account-level groups and ensure that they are consistent across all workspaces in the account, improving the overall user experience. - Added assessment for the incompatible
RunSubmit
API usages (#849). In this release, the assessment functionality for incompatibleRunSubmit
API usages has been significantly enhanced through various changes. The 'clusters.py' file has seen improvements in clarity and consistency with the renaming of private methodscheck_spark_conf
to_check_spark_conf
andcheck_cluster_failures
to_check_cluster_failures
. The_assess_clusters
method has been updated to call the renamed_check_cluster_failures
method for thorough checks of cluster configurations, resulting in better assessment functionality. A newSubmitRunsCrawler
class has been added to thedatabricks.labs.ucx.assessment.jobs
module, implementingCrawlerBase
,JobsMixin
, andCheckClusterMixin
classes. This class crawls and assesses job runs based on their submitted runs, ensuring compatibility and identifying failure issues. Additionally, a new configuration attribute,num_days_submit_runs_history
, has been introduced in theWorkspaceConfig
class of theconfig.py
module, controlling the number of days for which submission history ofRunSubmit
API calls is retained. Lastly, various new JSON files have been added for unit testing, assessing theRunSubmit
API usages related to different scenarios like dbt task runs, Git source-based job runs, JAR file runs, and more. These tests will aid in identifying and addressing potential compatibility issues with theRunSubmit
API. - Added group members difference to the output of
validate-groups-membership
cli command (#995). Thevalidate-groups-membership
command has been updated to include a comparison of group memberships at both the account and workspace levels. This enhancement is implemented through thevalidate_group_membership
function, which has been updated to calculate the difference in members between the two levels and display it in a newgroup_members_difference
column. This allows for a more detailed analysis of group memberships and easily identifies any discrepancies between the account and workspace levels. The corresponding unit test file, "test_groups.py," has been updated to include a new test case that verifies the calculation of thegroup_members_difference
value. The functionality of the other commands remains unchanged. The newgroup_members_difference
value is calculated as the difference in the number of members in the workspace group and the account group, with a positive value indicating more members in the workspace group and a negative value indicating more members in the account group. The table template in the labs.yml file has also been updated to include the new column for the group membership difference. - Added handling for empty
directory_id
if managed identity encountered during the crawling of StoragePermissionMapping (#986). This PR adds atype
field to theStoragePermissionMapping
andPrincipal
dataclasses to differentiate between service principals and managed identities, allowingNone
for thedirectory_id
field if the principal is not a service principal. During the migration to UC storage credentials, managed identities are currently ignored. These changes improve handling of managed identities during the crawling ofStoragePermissionMapping
, prevent errors when creating storage credentials with managed identities, and address issue #339. The changes are tested through unit tests, manual testing, and integration tests, and only affect theStoragePermissionMapping
class and related methods, without introducing new commands, workflows, or tables. - Added migration for Azure Service Principals with secrets stored in Databricks Secret to UC Storage Credentials (#874). In this release, we have made significant updates to migrate Azure Service Principals with their secrets stored in Databricks Secret to UC Storage Credentials, enhancing security and management of storage access. The changes include: Addition of a new
migrate_credentials
command in thelabs.yml
file to migrate credentials for storage access to UC storage credential. Modification ofsecrets.py
to handle the case where a secret has been removed from the backend and to log warning messages for secrets with invalid Base64 bytes. Introduction of theStorageCredentialManager
andServicePrincipalMigration
classes incredentials.py
to manage Azure Service Principals and their associated client secrets, and to migrate them to UC Storage Credentials. Addition of a newdirectory_id
attribute in thePrincipal
class and its associated dataclass inresources.py
to store the directory ID for creating UC storage credentials using a service principal. Creation of a new pytest fixture,make_storage_credential_spn
, infixtures.py
to simplify writing tests requiring Databricks Storage Credentials with Azure Service Principal auth. Addition of a new test file for the Azure integration of the project, including new classes, methods, and test cases for testing the migration of Azure Service Principals to UC Storage Credentials. These improvements will ensure better security and management of storage access using Azure Service Principals, while providing more efficient and robust testing capabilities. - Added permission migration support for feature tables and the root permissions for models and feature tables (#997). This commit introduces support for migration of permissions related to feature tables and sets root permissions for models and feature tables. New functions such as
feature_store_listing
,feature_tables_root_page
,models_root_page
, andtokens_and_passwords
have been added to facilitate population of a workspace access page with necessary permissions information. Thefactory
function inmanager.py
has been updated to include new listings for models' root page, feature tables' root page, and the feature store for enhanced management and access control of models and feature tables. New classes and methods have been implemented to handle permissions for these resources, utilizingGenericPermissionsSupport
,AccessControlRequest
, andMigratedGroup
classes. Additionally, new test methods have been included to verify feature tables listing functionality and root page listing functionality for feature tables and registered models. The test manager method has been updated to includefeature-tables
in the list of items to be checked for permissions, ensuring comprehensive testing of permission functionality related to these new feature tables. - Added support for serving endpoints (#990). In this release, we have made significant enhancements to support serving endpoints in our open-source library. The
fixtures.py
file in thedatabricks.labs.ucx.mixins
module has been updated with new classes and functions to create and manage serving endpoints, accompanied by integration tests to verify their functionality. We have added a new listing for serving endpoints in the assessment's permissions crawling, using thews.serving_endpoints.list
function and theserving-endpoints
category. A new integration test, "test_endpoints," has been added to verify that assessments now crawl permissions for serving endpoints. This test demonstrates the ability to migrate permissions from one group to another. The test suite has been updated to ensure the proper functioning of the new feature and improve the assessment of permissions for serving endpoints, ensuring compatibility with the updatedtest_manager.py
file. - Expanded end-user documentation with detailed descriptions for workflows and commands (#999). The Databricks Labs UCX project has been updated with several new features to assist in upgrading to Unity Catalog, including an assessment workflow that generates a detailed compatibility report for workspace entities, a group migration workflow for upgrading all Databricks workspace assets, and utility commands for managing cross-workspace installations. The Assessment Report now includes a more detailed summary of the assessment findings, table counts, database summaries, and external locations. Additional improvements include expanded workspace group migration to handle potential conflicts with locally scoped group names, enhanced documentation for external Hive Metastore integration, a new debugging notebook, and detailed descriptions of table upgrade considerations, data access permissions, external storage, and table crawler.
- Fixed
config.yml
upgrade from very old versions (#984). In this release, we've introduced enhancements to the configuration upgrading process forconfig.yml
in our open-source library. We've replaced the previousv1_migrate
class method with a new implementation that specifically handles migration from version 1. The new method retrieves thegroups
field, extracts theselected
value, and assigns it to theinclude_group_names
key in the configuration. Thebackup_group_prefix
value from thegroups
field is assigned to therenamed_group_prefix
key, and thegroups
field is removed, with the version number updated to 2. These changes simplify the code and improve readability, enabling users to upgrade smoothly from version 1 of the configuration. Furthermore, we've added new unit tests to thetest_config.py
file to ensure backward compatibility. Two new tests,test_v1_migrate_zeroconf
andtest_v1_migrate_some_conf
, have been added, utilizing theMockInstallation
class and loading the configuration usingWorkspaceConfig
. These tests enhance the robustness and reliability of the migration process forconfig.yml
. - Renamed columns in assessment SQL queries to use actual names, not aliases (#983). In this update, we have resolved an issue where aliases used for column references in SQL queries caused errors in certain setups by renaming them to use actual names. Specifically, for assessment SQL queries, we have modified the definition of the
is_delta
column to use the actualtable_format
name instead of the aliasformat
. This change improves compatibility and enhances the reliability of query execution. As a software engineer, you will appreciate that this modification ensures consistent interpretation of column references across various setups, thereby avoiding potential errors caused by aliases. This change does not introduce any new methods, but instead modifies existing functionality to use actual column names, ensuring a more reliable and consistent SQL query for the05_0_all_tables
assessment. - Updated groups permissions validation to use Table ACL cluster (#979). In this update, the
validate_groups_permissions
task has been modified to utilize the Table ACL cluster, as indicated by the inclusion ofjob_cluster="tacl"
. This task is responsible for ensuring that all crawled permissions are accurately applied to the destination groups by calling thepermission_manager.apply_group_permissions
method during the migration state. This modification enhances the validation of group permissions by performing it on the Table ACL cluster, potentially improving performance or functionality. If you are implementing this project, it is crucial to comprehend the consequences of this change on your permissions validation process and adjust your workflows appropriately.
- Fixed
AnalysisException
incrawl_tables
task by ignoring the database that is not found (#970). - Fixed
Unknown: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException
incrawl_grants
task by ignoring the database that is not found (#967). - Fixed ruff config for ruff==2.0 (#969).
- Made groups integration tests less flaky (#965).
- Added secret detection logic to Azure service principal crawler (#950).
- Create storage credentials based on instance profiles and existing roles (#869).
- Enforced
protected-access
pylint rule (#956). - Enforced
pylint
on unit and integration test code (#953). - Enforcing
invalid-name
pylint rule (#957). - Fixed AzureResourcePermissions.load to call Installation.load (#962).
- Fixed installer script to reuse an existing UCX Cluster policy if present (#964).
- More
pylint
tuning (#958). - Refactor
workspace_client_mock
to have combine fixtures stored in separate JSON files (#955).
Dependency updates:
- Updated databricks-sdk requirement from ~=0.19.0 to ~=0.20.0 (#961).
- Added CLI Command
databricks labs ucx principal-prefix-access
(#949). - Added a widget with all jobs to track migration progress (#940).
- Added legacy cluster types to the assessment result (#932).
- Cleanup of install documentation (#951, #947).
- Fixed
WorkspaceConfig
initialization forDEBUG
notebook (#934). - Fixed installer not opening config file during the installation (#945).
- Fixed groups in config file not considered for group migration job (#943).
- Fixed bug where
tenant_id
inside secret scope is not detected (#942).
- Added CLI Command
databricks labs ucx save-uc-compatible-roles
(#863). - Added dashboard widget with table count by storage and format (#852).
- Added verification of group permissions (#841).
- Checking pipeline cluster config and cluster policy in 'crawl_pipelines' task (#864).
- Created cluster policy (ucx-policy) to be used by all UCX compute. This may require customers to reinstall UCX. (#853).
- Skip scanning objects that were removed on platform side since the last scan time, so that integration tests are less flaky (#922).
- Updated assessment documentation (#873).
Dependency updates:
- Updated databricks-sdk requirement from ~=0.18.0 to ~=0.19.0 (#930).
- Added "what" property for migration to scope down table migrations (#856).
- Added job count in the assessment dashboard (#858).
- Adopted
installation
package fromdatabricks-labs-blueprint
(#860). - Debug logs to print only the first 96 bytes of SQL query by default, tunable by
debug_truncate_bytes
SDK configuration property (#859). - Extract command codes and unify the checks for spark_conf, cluster_policy, init_scripts (#855).
- Improved installation failure with actionable message (#840).
- Improved validating groups membership cli command (#816).
Dependency updates:
- Updated databricks-labs-blueprint requirement from ~=0.1.0 to ~=0.2.4 (#867).
- Added
databricks labs ucx alias
command to create a view of tables from one schema/catalog in another schema/catalog (#837). - Added
databricks labs ucx save-aws-iam-profiles
command to scan instance profiles identify AWS S3 access and save a CSV with permissions (#817). - Added total view counts in the assessment dashboard (#834).
- Cleaned up
assess_jobs
andassess_clusters
tasks in theassessment
workflow to improve testing and reduce redundancy.(#825). - Added documentation for the assessment report (#806).
- Fixed escaping for SQL object names (#836).
Dependency updates:
- Updated databricks-sdk requirement from ~=0.17.0 to ~=0.18.0 (#832).
- Added
databricks labs ucx validate-groups-membership
command to validate groups to see if they have same membership across acount and workspace level (#772). - Added baseline for getting Azure Resource Role Assignments (#764).
- Added issue and pull request templates (#791).
- Added linked issues to PR template (#793).
- Added optional
debug_truncate_bytes
parameter to the config and extend the default log truncation limit (#782). - Added support for crawling grants and applying Hive Metastore UDF ACLs (#812).
- Changed Python requirement from 3.10.6 to 3.10 (#805).
- Extend error handling of delta issues in crawlers and hive metastore (#795).
- Fixed
databricks labs ucx repair-run
command to execute correctly (#801). - Fixed handling of
DELTASHARING
table format (#802). - Fixed listing of workflows via CLI (#811).
- Fixed logger import path for DEBUG notebook (#792).
- Fixed move table command to delete table/view regardless if permissions are present, skipping corrupted tables when crawling table size and making existing tests more stable (#777).
- Fixed the issue of
databricks labs ucx installations
anddatabricks labs ucx manual-workspace-info
(#814). - Increase the unit test coverage for cli.py (#800).
- Mount Point crawler lists /Volume with four variations which is confusing (#779).
- Updated README.md to remove mention of deprecated install.sh (#781).
- Updated
bug
issue template (#797). - Fixed writing log readme in multiprocess safe way (#794).
- Added assessment step to estimate the size of DBFS root tables (#741).
- Added
TableMapping
functionality to table migrate (#752). - Added
databricks labs ucx move
command to move tables and schemas between catalogs (#756). - Added functionality to determine migration method based on DBFS Root (#759).
- Added
get_tables_to_migrate
functionality in the mapping module (#755). - Added retry and rate limit to rename workspace group operation and corrected rate limit for reflecting account groups to workspace (#751).
- Adopted
databricks-labs-blueprint
library for common utilities to be reused in the other projects (#758). - Converted
RuntimeBackend
query executions exceptions to SDK exceptions (#769). - Fixed issue with missing users and temp groups after workspace-local groups migration and skip table when crawling table size if it does not exist anymore (#770).
- Improved error handling by not failing group rename step if a group was removed from account before reflecting it to workspace (#762).
- Improved error message inference from failed workflow runs (#753).
- Moved
TablesMigrate
to a separate module (#747). - Reorganized assessment dashboard to increase readability (#738).
- Updated databricks-sdk requirement from ~=0.16.0 to ~=0.17.0 (#773).
- Verify metastore exists in current workspace (#735).
- Added
databricks labs ucx repair-run --step ...
CLI command for repair run of any failed workflows, likeassessment
,migrate-groups
etc. (#724). - Added
databricks labs ucx revert-migrated-table
command (#729). - Allow specifying a group list when group match options are used (#725).
- Fixed installation issue when upgrading from an older version of the tool and improve logs (#740).
- Renamed summary panel from Failure Summary to Assessment Summary (#733).
- Retry internal error when getting permissions and update legacy table ACL documentation (#728).
- Speedup installer execution (#727).
- Added
databricks labs ucx create-table-mapping
anddatabricks labs ucx manual-workspace-info
commands for CLI (#682). - Added
databricks labs ucx ensure-assessment-run
to CLI commands (#708). - Added
databricks labs ucx installations
command (#679). - Added
databricks labs ucx skip --schema ... --table ...
command to mark table/schema for skipping in the table migration process (#680). - Added
databricks labs ucx validate-external-locations
command for cli (#715). - Added capturing
ANY FILE
andANONYMOUS FUNCTION
grants (#653). - Added cluster override and handle case of write protected DBFS (#610).
- Added cluster policy selector in the installer (#655).
- Added detailed UCX pre-requisite information to README.md (#689).
- Added interactive wizard for
databricks labs uninstall ucx
command (#657). - Added more granular error retry logic (#704).
- Added parallel fetching of registered model identifiers to speed-up assessment workflow (#691).
- Added retry on workspace listing (#659).
- Added support for mapping workspace group to account group by prefix/suffix/regex/external id (#650).
- Changed cluster security mode from NONE to LEGACY_SINGLE_USER, as
crawl_tables
was failing when run on non-UC Workspace in No Isolation mode with unable to access the config file (#661). - Changed the fields of the table "Tables" to lower case (#684).
- Enabled integration tests for
EXTERNAL
table migrations (#677). - Enforced
mypy
validation (#713). - Filtering out inventory database from loading into tables and filtering out the same from grant detail view (#705).
- Fixed documentation for
reflect_account_groups_on_workspace
task and updatedCONTRIBUTING.md
guide (#654). - Fixed secret scope apply task to raise ValueError (#683).
- Fixed legacy table ACL ownership migration and other integration testing issues (#722).
- Fixed some flaky integration tests (#700).
- New CLI command for workspace mapping (#678).
- Reduce server load for getting workspace groups and their members (#666).
- Throwing ManyError on migrate-groups tasks (#710).
- Updated installation documentation to use Databricks CLI (#686).
Dependency updates:
- Updated databricks-sdk requirement from ~=0.13.0 to ~=0.14.0 (#651).
- Updated databricks-sdk requirement from ~=0.14.0 to ~=0.15.0 (#687).
- Updated databricks-sdk requirement from ~=0.15.0 to ~=0.16.0 (#712).
- Added current version of UCX to task logs (#566).
- Fixed
'str' object has no attribute 'value'
failure on apply backup group permissions task (#574). - Fixed
crawl_cluster
failure over custom runtimes (#602). - Fixed
databricks labs ucx workflows
command (#608). - Fixed problematic integration test fixture
make_ucx_group
(#613). - Fixed internal API request retry logic by relying on concrete exception types (#637).
- Fixed
tables.scala
notebook to read inventory database from~/.ucx/config.yml
file. (#614). - Introduced
StaticTablesCrawler
for integration tests (#632). - Reduced runtime of
test_set_owner_permission
from 15 minutes to 44 seconds (#636). - Updated
LICENSE
(#643). - Updated documentation (#611, #646).
Breaking changes (existing installations need to remove ucx
database, reinstall UCX and re-run assessment jobs)
- Fixed external locations widget to return hostname for
jdbc:
-sourced tables (#621).
- Added a logo for UCX (#605).
- Check if the
hatch
is already installed, and install only if it isn't installed yet (#603). - Fixed installation check for git pre-release versions (#600).
- Temporarily remove SQL warehouse requirement from
labs.yml
(#604).
Breaking changes (existing installations need to reinstall UCX and re-run assessment jobs)
- Switched local group migration component to rename groups instead of creating backup groups (#450).
- Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together (#512).
New features
- Added support for the experimental Databricks CLI launcher (#517).
- Added support for external Hive Metastores including AWS Glue (#400).
- Added more views to assessment dashboard (#474).
- Added rate limit for creating backup group to increase stability (#500).
- Added deduplication for mount point list (#569).
- Added documentation to describe interaction with external Hive Metastores (#473).
- Added failure injection for job failure message propagation (#591).
- Added uniqueness in the new warehouse name to avoid conflicts on installation (#542).
- Added a global init script to collect Hive Metastore lineage (#513).
- Added retry set/update permissions when possible and assess the changes in the workspace (#519).
- Use
~/.ucx/state.json
to store the state of both dashboards and jobs (#561).
Bug fixes
- Fixed handling for
OWN
table permissions (#571). - Fixed handling of keys with and without values. (#514).
- Fixed integration test failures related to concurrent group delete (#584).
- Fixed issue with workspace listing process on None type
object_type
(#481). - Fixed missing group entitlement migration bug (#583).
- Fixed entitlement application for account-level groups (#529).
- Fixed assessment throwing an error when the owner of an object is empty (#485).
- Fixed installer to migrate between different configuration file versions (#596).
- Fixed cluster policy crawler to be aware of deleted policies (#486).
- Improved error message for not null constraints violated (#532).
- Improved integration test resiliency (#597, #594, #586).
- Introduced Safer access to workspace objects' properties. (#530).
- Mitigated permissions loss in Table ACLs by running appliers with single thread (#518).
- Running apply permission task before assessment should display message (#487).
- Split integration tests from blocking the merge queue (#496).
- Support more than one dashboard per step (#472).
- Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 (#505).
- Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 (#575).
- Added
make install-dev
and a strongermake clean
for easier dev on-boarding and release upgrades (#458). - Added failure summary in the assessment dashboard (#455).
- Added test for checking grants in default schema (#470).
- Added unit tests for generic permissions (#457).
- Enabled integration tests via OIDC for every pull request (#378).
- Added check if permissions are up to date (#421).
- Fixed casing in
all_tables.sql
query. (#464). - Fixed missed scans for empty databases and views in
crawl_grants
(#469). - Improved logging colors for dark terminal backgrounds (#468).
- Improved local group migration state handling and made log files flush every 10 minutes (#449).
- Moved workspace listing as a separate task for an assessment workflow (#437).
- Removed rate limit for get or create backup group to speed up the prepare environment (#453).
- Updated databricks-sdk requirement from ~=0.10.0 to ~=0.11.0 (#448).
- Added exception handling for secret scope not found. (#418).
- Added a crawler for creating an inventory of Azure Service Principals (#326).
- Added check if account group already exists during failure recovery (#446).
- Added checking for index out of range. (#429).
- Added hyperlink to UCX releases in the main readme (#408).
- Added integration test to check backup groups get deleted (#387).
- Added logging of errors during threadpool operations. (#376).
- Added recovery mode for workspace-local groups from temporary groups (#435).
- Added support for migrating Legacy Table ACLs from workspace-local to account-level groups (#412).
- Added detection for installations of unreleased versions (#399).
- Decoupled
PermissionsManager
fromGroupMigrationToolkit
(#407). - Enabled debug logging for every job task run through a file, which is accessible from both workspace UI and Databricks CLI (#426).
- Ensured that table exists, even when crawlers produce zero records (#373).
- Extended test suite for HMS->HMS TACL migration (#439).
- Fixed handling of secret scope responses (#431).
- Fixed
crawl_permissions
task to respect 'workspace_start_path' config (#444). - Fixed broken logic in
parallel
module and applied hardened error handling design for parallel code (#405). - Fixed codecov.io reporting (#403).
- Fixed integration tests for crawlers (#379).
- Improved README.py and logging messages (#433).
- Improved cleanup for workspace backup groups by adding more retries on errors (#375).
- Improved dashboard queries to show unsupported storage types. (#398).
- Improved documentation for readme notebook (#257).
- Improved test coverage for installer (#371).
- Introduced deterministic
env_or_skip
fixture for integration tests (#396). - Made HMS & UC fixtures return
CatalogInfo
,SchemaInfo
, andTableInfo
(#409). - Merge
workspace_access.Crawler
andworkspace_access.Applier
interfaces toworkspace_access.AclSupport
(#436). - Moved examples to docs (#404).
- Properly isolated integration testing for workflows on an existing shared cluster (#414).
- Removed thread pool for any IAM Group removals and additions (#394).
- Replace plus char with minus in version tag for GCP dev installation of UCX (#420).
- Run integration tests on shared clusters for a faster devloop (#397).
- Show difference between serverless and PRO warehouses during installation (#385).
- Split
migrate-groups
workflow into three different stages for reliability (#442). - Use groups instead of usernames in code owners file (#389).
- Added
inventory_database
name check during installation (#275). - Added a column to
$inventory.tables
to specify if a table might have been synchronised to Unity Catalog already or not (#306). - Added a migration state to skip already migrated tables (#325).
- Fixed appending to tables by adding filtering of
None
rows (#356). - Fixed handling of missing but linked cluster policies. (#361).
- Ignore errors for Redash widgets and queries redeployment during installation (#367).
- Remove exception and added proper logging for groups in the list that… (#357).
- Skip group migration when no groups are available after preparation step. (#363).
- Update databricks-sdk requirement from ~=0.9.0 to ~=0.10.0 (#362).
- Added retrieving for all account-level groups with matching names to workspace-level groups in case no explicit configuration (#277).
- Added crawler for Azure Service principals used for direct storage access (#305).
- Added more SQL queries to the assessment step dashboard (#269).
- Added filtering out for job clusters in the clusters crawler (#298).
- Added recording errors from
crawl_tables
step in$inventory.table_failures
table and display counter on the dashboard (#300). - Added comprehensive introduction user manual (#273).
- Added interactive tutorial for local group migration readme (#291).
- Added tutorial links to the landing page of documentation (#290).
- Added (internal) support for account-level configuration and multi-cloud workspace list (#264).
- Improved order of tasks in the README notebook (#286).
- Improved installation script to run in a Windows Git Bash terminal (#282).
- Improved installation script by setting log level to uppercase by default (#271).
- Improved installation finish messages within installer script (#267).
- Improved automation for
MANAGED
table migration and continued building tables migration component (#295). - Fixed debug notebook code with refactored package structure (#250) (#265).
- Fixed replacement of custom configured database to replicate in the report for external locations (#296).
- Removed redundant
notebooks
top-level folder (#263). - Split checking for test failures and linting errors into independent GitHub Actions checks (#287).
- Verify query metadata for assessment dashboards during unit tests (#294).
- Added batched iteration for
INSERT INTO
queries inStatementExecutionBackend
with defaultmax_records_per_batch=1000
(#237). - Added crawler for mount points (#209).
- Added crawlers for compatibility of jobs and clusters, along with basic recommendations for external locations (#244).
- Added safe return on grants (#246).
- Added ability to specify empty group filter in the installer script (#216) (#217).
- Added ability to install application by multiple different users on the same workspace (#235).
- Added dashboard creation on installation and a requirement for
warehouse_id
in config, so that the assessment dashboards are refreshed automatically after job runs (#214). - Added reliance on rate limiting from Databricks SDK for listing workspace (#258).
- Fixed errors in corner cases where Azure Service Principal Credentials were not available in Spark context (#254).
- Fixed
DESCRIBE TABLE
throwing errors when listing Legacy Table ACLs (#238). - Fixed
file already exists
error in the installer script (#219) (#222). - Fixed
guess_external_locations
failure withAttributeError: as_dict
and added an integration test (#259). - Fixed error handling edge cases in
crawl_tables
task (#243) (#251). - Fixed
crawl_permissions
task failure on folder names containing a forward slash (#234). - Improved
README
notebook documentation (#260, #228, #252, #223, #225). - Removed redundant
.python-version
file (#221). - Removed discovery of account groups from
crawl_permissions
task (#240). - Updated databricks-sdk requirement from ~=0.8.0 to ~=0.9.0 (#245).
Features
- Added interactive installation wizard (#184, #117).
- Added schedule of jobs as part of
install.sh
flow and created some documentation (#187). - Added debug notebook companion to troubleshoot the installation (#191).
- Added support for Hive Metastore Table ACLs inventory from all databases (#78, #122, #151).
- Created
$inventory.tables
from Scala notebook (#207). - Added local group migration support for ML-related objects (#56).
- Added local group migration support for SQL warehouses (#57).
- Added local group migration support for all compute-related resources (#53).
- Added local group migration support for security-related objects (#58).
- Added local group migration support for workflows (#54).
- Added local group migration support for workspace-level objects (#59).
- Added local group migration support for dashboards, queries, and alerts (#144).
Stability
- Added
codecov.io
publishing (#204). - Added more tests to group.py (#148).
- Added tests for group state (#133).
- Added tests for inventorizer and typed (#125).
- Added tests WorkspaceListing (#110).
- Added
make_*_permissions
fixtures (#159). - Added reusable fixtures module (#119).
- Added testing for permissions (#126).
- Added inventory table manager tests (#153).
- Added
product_info
to track as SDK integration (#76). - Added failsafe permission get operations (#65).
- Always install the latest
pip
version in./install.sh
(#201). - Always store inventory in
hive_metastore
and make onlyinventory_database
configurable (#178). - Changed default logging level from
TRACE
toDEBUG
log level (#124). - Consistently use
WorkspaceClient
fromdatabricks.sdk
(#120). - Convert pipeline code to use fixtures. (#166).
- Exclude mixins from coverage (#130).
- Fixed codecov.io reporting (#212).
- Fixed configuration path in job task install code (#210).
- Fixed a bug with dependency definitions (#70).
- Fixed failing
test_jobs
(#140). - Fixed the issues with experiment listing (#64).
- Fixed integration testing configuration (#77).
- Make project runnable on nightly testing infrastructure (#75).
- Migrated cluster policies to new fixtures (#174).
- Migrated clusters to the new fixture framework (#162).
- Migrated instance pool to the new fixture framework (#161).
- Migrated to
databricks.labs.ucx
package (#90). - Migrated token authorization to new fixtures (#175).
- Migrated experiment fixture to standard one (#168).
- Migrated jobs test to fixture based one. (#167).
- Migrated model fixture to the standard fixtures (#169).
- Migrated warehouse fixture to standard one (#170).
- Organise modules by domain (#197).
- Prefetch all account-level and workspace-level groups (#192).
- Programmatically create a dashboard (#121).
- Properly integrate Python
logging
facility (#118). - Refactored code to use Databricks SDK for Python (#27).
- Refactored configuration and remove global provider state (#71).
- Removed
pydantic
dependency (#138). - Removed redundant
pyspark
,databricks-connect
,delta-spark
, andpandas
dependencies (#193). - Removed redundant
typer[all]
dependency and its usages (#194). - Renamed
MigrationGroupsProvider
toGroupMigrationState
(#81). - Replaced
ratelimit
andtenacity
dependencies with simpler implementations (#195). - Reorganised integration tests to align more with unit tests (#206).
- Run
build
workflow also onmain
branch (#211). - Run integration test with a single group (#152).
- Simplify
SqlBackend
and table creation logic (#203). - Updated
migration_config.yml
(#179). - Updated legal information (#196).
- Use
make_secret_scope
fixture (#163). - Use fixture factory for
make_table
,make_schema
, andmake_catalog
(#189). - Use new fixtures for notebooks and folders (#176).
- Validate toolkit notebook test (#183).
Contributing
- Added a note on external dependencies (#139).
- Added ability to run SQL queries on Spark when in Databricks Runtime (#108).
- Added some ground rules for contributing (#82).
- Added contributing instructions link from main readme (#109).
- Added info about environment refreshes (#155).
- Clarified documentation (#137).
- Enabled merge queue (#146).
- Improved
CONTRIBUTING.md
guide (#135, #145).