Releases: databrickslabs/ucx
v0.57.0
- Convert UCX job ids to
int
before passing toJobsCrawler
(#3816). In this release, we have addressed issue #3722 and improved the robustness of the open-source library by modifying thejobs_crawler
method to handle job IDs more effectively. Previously, job IDs were passed directly to theexclude_job_ids
parameter, which could cause issues if they were not integers. To address this problem, we have updated thejobs_crawler
method to convert all job IDs to integers using a list comprehension before passing them to the method. This change ensures that only valid integer job IDs are used, thereby enhancing the reliability of the method. The commit includes a manual test to confirm the correct behavior of this modification. In summary, this modification improves the robustness of the code by ensuring that integer job IDs are utilized correctly in theJobsCrawler
method. - Exclude UCX jobs from crawling (#3733). In this release, we have made modifications to the
JobsCrawler
and the existingassessment
workflow to exclude UCX jobs from crawling, avoiding confusion for users when they appear in assessment reports. This change addresses issues #3656 and #3722, and is a follow-up to previous issue #3732. We have also incorporated updates from pull requests #3767 and #3759 to improve integration tests and linting. Additionally, a retry mechanism has been added to wait for grants to exist before crawling, addressing issue #3758. The changes include the addition of unit and integration tests to ensure the correctness of the modifications. A newexclude_job_ids
parameter has been added to theJobsCrawler
constructor, which is initialized with the list of UCX job IDs, ensuring that UCX jobs are not included in the assessment report. The_list_jobs
method now excludes jobs based on the providedexclude_job_ids
andinclude_job_ids
arguments. The_crawl
method now uses the_list_jobs
method to list the jobs to be crawled. The_assess_jobs
method has been updated to take into account the exclusion of specific job IDs. Thetest_grant_detail
file, an integration test for the Hive Metastore grants functionality, has been updated to include a retry mechanism to wait for grants to exist before crawling and to check if the SELECT permission on ANY FILE is present in the grants. - Let
WorkflowLinter.refresh_report
lint jobs fromJobsCrawler
(#3732). In this release, theWorkflowLinter.refresh_report
method has been updated to lint jobs from theJobsCrawler
class, ensuring that only jobs within the scope of the crawler are processed. This change resolves issue #3662 and progresses issue #3722. The workflow linting code, theassessment
workflow, and theJobsCrawler
class have been modified. TheJobsCrawler
class now includes asnapshot
method, which is used in theWorkflowLinter.refresh_report
method to retrieve necessary data about jobs. Unit and integration tests have been updated correspondingly, with the integration test for workflows now verifying that all rows returned from a query to theworkflow_problems
table have a validpath
field. TheWorkflowLinter
constructor now includes an instance ofJobsCrawler
, allowing for more targeted linting of jobs. The introduction of theJobsCrawler
class enables more efficient and precise linting of jobs, improving the overall accuracy of workflow assessment. - Let dashboard name adhere to naming convention (#3789). In this release, the naming convention for dashboard names in the
ucx
library has been enforced, restricting them to alphanumeric characters, hyphens, and underscores. This change replaces any non-conforming characters in existing dashboard names with hyphens or underscores, addressing several issues (#3761 through #3788). A temporary fix has been added to the_create_dashboard
method to ensure newly created dashboard names adhere to the new naming convention, indicated by a TODO comment. This release also resolves a test failure in a specific GitHub Actions run and addresses a total of 29 issues. The specifics of the modification made to thedatabricks labs install ucx
command and the changes to existing functionality are not detailed, making it difficult to assess their scope. The commit includes the deletion of a file called02_0_owner.filter.yml
, and all changes have been manually tested. For future reference, it would be helpful to include more information about the changes made, their impact, and the reason for deleting the specified file. - Partial revert
Let dashboard name adhere to naming convention
(#3794). In this release, we have partially reverted a previous change to the migration progress dashboard, reintroducing the owner filter. This change was made in response to feedback from users who found the previous modification to the dashboard less intuitive. The new owner filter has been defined in a new file, '02_0_owner.filter.yml', which includes the title, column name, type, and width of the filter. To ensure proper functionality, this change requires the release of lsql after merging. The change has been thoroughly tested to guarantee its correct operation and to provide the best possible user experience. - Partial revert
Let dashboard name adhere to naming convention
(#3795). In this release, we have partially reversed a previous change that enforced a naming convention for dashboard names, allowing the use of special characters such as spaces and brackets again. The_create_dashboard
method in theinstall.py
file and the_name
method in themixins.py
file have been updated to reflect this change, affecting the migration progress dashboard. Thedisplay_name
attribute of themetadata
object has been updated to use the original format, which may include special characters. Thereference
variable has also been updated accordingly. The functionscreated_job_tasks
andcreated_job
have been updated to use the new naming convention when retrieving installation jobs with specific names. These changes have been manually tested and the tests have been verified to work correctly after the reversion. This change is related to issues #3799, #3789, and reverts commit 048bc8f. - Put back dashboard names (#3808). In the lsql release v0.16.0, the naming convention for dashboards has been updated to support non-alphanumeric characters in the dashboard names. This change modifies the
_create_dashboard
function ininstall.py
and the_name
method inmixins.py
to create dashboard names with a format like[UCX] assessment (Main)
, which includes parent and child folder names. This update addresses issues reported in tickets #3797 and #3790, and partially reverses previous changes made in commits 4017a25 and 834ef14. The functionality of other methods remains unchanged. With this release, thecreated_job_tasks
andcreated_job
functions now accept dashboard names with non-alphanumeric characters as input. - Updated databricks-labs-lsql requirement from <0.15,>=0.14.0 to >=0.14.0,<0.17 (#3801). In this update, we have updated the required version of the
dat ab ricks-l abs-ls ql
package from a version greater than or equal to 0.15.0 and less than 0.16.0 to a version greater than or equal to 0.16.0 and less than 0.17.0. This change allows for the use of the latest version of the package, which includes various bug fixes and dependency updates. The package is utilized in the acceptance tests that are run as part of the CI/CD pipeline. With this update, the acceptance tests can now be executed using the most recent version of the package, resulting in enhanced functionality and reliability. - Updated databricks-sdk requirement from <0.42,>=0.40 to >=0.44,<0.45 (#3686). In this release, we have updated the version requirement for the
databricks-sdk
package to be greater than or equal to 0.44.0 and less than 0.45.0. This update allows for the use of the latest version of thedatabricks-sdk
, which includes new methods, fields, and bug fixes. For instance, theget_message_query_result_by_attachment
method has been added for thew.genie.workspace_level_service
, and several fields such asreview_state
,reviews
, andrunner_collaborators
have been removed for thedatabricks.sdk.service.clean_rooms.CleanRoomAssetNotebook
object. Additionally, thesecurable_kind
field has been removed for various objects such asCatalogInfo
andConnectionInfo
. We recommend thoroughly testing this update to ensure compatibility with your project. The release notes for versions 0.44.0 and 0.43.0 can be found in the commit history. Please note that there are several backward-incompatible changes listed in the changelog for bot...
v0.56.0
- Added documentation to use Delta Live Tables migration (#3587). In this documentation update, we introduce a new section for migrating Delta Live Table pipelines to the Unity Catalog as part of the migration process. This workflow allows for the original and cloned pipelines to run independently after the cloned pipeline reaches the
RUNNING
state. The update includes an example of stopping and renaming an existing HMS DLT pipeline, and creating a new cloned pipeline. Additionally, known issues and limitations are outlined, such as supported streaming sources, maintenance pausing, and querying by timestamp. To streamline the migration process, themigrate-dlt-pipelines
command is introduced with optional parameters for including or excluding specific pipeline IDs. This feature is intended for developers and administrators managing data pipelines and handling table aliasing issues. Relevant user documentation has been added and the changes have been manually tested. - Added support for MSSQL and POSTGRESQL to HMS Federation (#3701). In this enhancement, the open-source library now supports Microsoft SQL Server (MSSQL) and PostgreSQL databases in the Hive Metastore Federation (HMS Federation) feature. This update introduces classes for handling external Hive Metastore instances and their versions, and refactors a regex pattern for better support of various JDBC URL formats. A new
supported_databases_port
class variable is added to map supported databases to default ports, allowing the code to handle SQL Server's distinct default port. Additionally, asupported_hms_versions
class variable is created, outlining supported Hive Metastore versions. The_external_hms
method is updated to extract HMS version information more accurately, and the_split_jdbc_url
method is refactored for better URL format compatibility and parameter extraction. The test filetest_federation.py
has been updated with new unit tests for external catalog creation with MSSQL and PostgreSQL, further enhancing compatibility with various databases and expanding HMS Federation's capabilities. - Added the CLI command for migrating DLT pipelines (#3579). A new CLI command, "migrate-dlt-pipelines," has been added for migrating DLT pipelines from HMS to UC using the DLT Migration API. This command allows users to include or exclude specific pipeline IDs during migration using the
--include-pipeline-ids
and--exclude-pipeline-ids
flags, respectively. The change impacts thePipelinesMigrator
class, which has been updated to accept and use these new parameters. Currently, there is no information available about testing, but the changes are expected to be manually tested and accompanied by corresponding unit and integration tests in the future. The changes are isolated to thePipelinesMigrator
class and related functionality, with no impact on existing methods or functionality. - Addressed Bug with Dashboard migration (#3663). In this release, the
_crawl
method indashboards.py
has been enhanced to exclude SDK dashboards that lack IDs during the dashboard migration process. This modification enhances migration efficiency by avoiding unnecessary processing of incomplete dashboards. Additionally, the_list_dashboards
method now includes a check for dashboards with no IDs while iterating through thedashboards_iterator
. If a dashboard with no ID is found, the method fetches the dashboard details using the_get_dashboard
method and adds them to thedashboards
list, ensuring proper processing. Furthermore, a bug fix for issue #3663 has been implemented in theRedashDashboardCrawler
class inassessment/test_dashboards.py
. Theget
method has been added as a side effect to theWorkspaceClient
mock'sdashboards
attribute, enabling the retrieval of individual dashboard objects by their IDs. This modification ensures that theRedashDashboardCrawler
can correctly retrieve and process dashboard objects from theWorkspaceClient
mock, preventing errors due to missing dashboard objects. - Broaden safe read text caught exception scope (#3705). In this release, the
safe_read_text
function has been enhanced to handle a broader range of exceptions that may occur while reading a text file, includingOSError
andUnicodeError
, making it more robust and safe. The function previously caught specific exceptions such asFileNotFoundError
,UnicodeDecodeError
, andPermissionError
. Additionally, the codebase has been improved with updated unit tests, ensuring that the new functionality works correctly. The linting parts of the code have also been updated, enhancing the readability and maintainability of the project for other software engineers. A new method,safe_read_text
, has been added to thesource_code
module, with several new test cases designed to ensure that the method handles edge cases correctly, such as when the file does not exist, when the path is a directory, or when an OSError occurs. These changes make the open-source library more reliable and robust for various use cases. - Case sensitive/insensitive table validation (#3580). In this release, the library has been updated to enable more flexible and customizable metadata comparison for tables. A case sensitive flag has been introduced for metadata comparison, which allows for consideration or ignoring of column name case during validation. The
TableMetadataRetriever
abstract base class now includes a new parametercolumn_name_transformer
in theget_metadata
method, which is a callable that can be used to transform column names as needed for comparison. Additionally, a newcase_sensitive
parameter has been added to theStandardSchemaComparator
constructor to determine whether column names should be compared case sensitively or not. A new parametrized test functiontest_schema_comparison_case
has also been included to ensure that this functionality works as expected. These changes provide users with more control over the metadata comparison process and improve the library's handling of cases where column names in the source and target tables may have different cases. - Catch
AttributeError
inInfferedValue._safe_infer_internal
(#3684). In this release, we have implemented a change to the_safe_infer_internal
method in theInferredValue
class to catchAttributeError
. This change addresses an issue in the Astroid library reported in their GitHub repository (pylint-dev/astroid#2683) and resolves issue #3659 in our project. By handlingAttributeError
during the inference process, we have made the code more robust and safer. When an exception occurs, an error message is logged with debug-level logging, and the method yields theUninferable
sentinel value to indicate that inference failed for the node. This enhancement strengthens the source code linting code through value inference in our open-source library. - Document to run
validate-groups-membership
before groups migration, not after (#3631). In this release, we have updated the order of executing thevalidate-groups-membership
command in the group migration process. Previously, the command was recommended to be run after the groups migration, but it has been updated to be executed before the migration. This change ensures that the groups have the correct membership and the number of groups and users in the workspace and account are the same before migration, providing an extra level of safety. Additionally, we have updated theremove-workspace-local-backup-groups
command to remove workspace-level backup groups and their permissions only after confirming the successful migration of all groups. We have also updated the spelling of thevalidate-group-membership
command tovalidate-groups-membership
in a documentation file. This release is aimed at software engineers who are adopting the project and looking to migrate their groups to the account level. - Extend code migration progress documentation (#3588). In this documentation update, we have added two new sections,
Code Migration
and "Final details," to the open-source library's migration process documentation. TheCode Migration
section provides a detailed walkthrough of the steps to migrate code after completing table migration and data reconciliation, including using the linter to investigate compatibility issues and linted workspace resources. The "linter advices" provide codes and messages on detected issues and resolution methods. The migrated code can then be prioritized and tracked using themigration-progress
dashboard, and migrated using themigrate-
commands. TheFinal details
section outlines the steps to take once code migration is complete, including running thecluster-remap
command to remap clusters to be Unity Catalog compatible. This update resolves issue #2231 and includes updated user documentation, with new methods for linting and migrating local code, managing dashboard migrations, and syncing workspace information. Additional commands for creating and validating table mappings, migrating locations, and assigning metastores are also included, with the aim of improving the code migration process by providing more detailed documentation and new commands for managing the migration. - Fixed Skip/Unskip sch...
v0.55.0
- Introducing UCX docs! (#3458). In this release, we introduced the new documents for UCX, you can find them here: https://databrickslabs.github.io/ucx/
- Hosted Runner for release (#3532). In this release, we have made improvements to the release job's security and control by moving the release.yml file to a new location within a hosted runner group labeled "linux-ubuntu-latest." This change ensures that the release job now runs in a protected runner group, enhancing the overall security and reliability of the release process. The job's environment remains set to "release," and it retains the same authentication and artifact signing permissions as before the move, ensuring a seamless transition while improving the security and control of the release process.
Contributors: @sundarshankar89, @renardeinside
v0.54.0
-
Implement disposition field in SQL backend (#3477). This commit adds a query_statement_disposition configuration option for the SQL backend in the UCX tool, allowing users to specify the disposition of SQL statements during assessment results export and preventing failures when dealing with large workspaces and a large number of findings. The new configuration option is added to the config.yml file and used by the SqlBackend definition. The databricks labs install ucx and databricks labs ucx export-assessment commands have been modified to support this new functionality. A new Disposition enum has been added to the databricks.sdk.service.sql module. This change resolves issue #3447 and is related to pull request #3455. The functionality has been manually tested.
-
AWS role issue with external locations pointing to the root of a storage account (#3510). The AWSResources class in the aws.py file has been updated to enhance the regular expression pattern for matching S3 bucket names, now including an optional group for trailing slashes and any subsequent characters. This allows for recognition of external locations pointing to the root of a storage account, addressing issue #3505. The access.py file within the AWS module has also been updated, introducing a new path variable and updating a for loop condition to accurately identify missing paths in external locations referencing the root of a storage account. New unit tests have been added to tests/unit/aws/test_access.py, including a test_uc_roles_create_all_roles method that checks the creation of all possible UC roles when none exist and external locations with and without folders. Additionally, the backend fixture has been updated to include a new external location s3://BUCKET4, and various tests have been updated to incorporate this location and handle errors appropriately.
-
Added assert to make sure installation is finished before re-installation (#3546). In this release, we have added an assertion to ensure that the installation process is completed before attempting to reinstall, addressing a previous issue where the reinstallation was starting before the first installation was finished, causing a warning to not be raised and resulting in a test failure. We have introduced a new function wait_for_installation_to_finish(), which retries loading the installation if it is not found, with a timeout of 2 minutes. This function is utilized in the test_compare_remote_local_install_versions test to ensure that the installation is finished before proceeding. Furthermore, we have extracted the warning message to a variable error_message for better readability. This change enhances the reliability of the installation process.
-
Added dashboards to migration progress dashboard (#3314). This commit introduces significant updates to the migration progress dashboard, adding dashboards, linting resources, and modifying existing components. The changes include a new dashboard displaying the number of dashboards pending migration, with the data sourced from the ucx_catalog.multiworkspace.objects_snapshot table. The existing 'Migration [main]' dashboard has been updated, and unit and integration tests have been adapted accordingly. The commit also renames several SQL files, updates the percentage UDF, grant, job, cluster, table, and pipeline migration progress queries, and resolves linting compatibility issues related to Unity Catalog. The changes depend on issue #3424, progress issue #3045, and break up issue #3112. The new dashboard aims to enhance the migration process and ensure a smooth transition to the Unity Catalog.
-
Added history log encoder for dashboards (#3424). A new history log encoder for dashboards has been added, addressing issues #3368 and #3369, and modifying the existing experimental-migration-progress workflow. This update includes the addition of the DashboardOwnership class, used to generate ownership information for dashboards, and the DashboardProgressEncoder class, responsible for encoding progress data related to dashboards. The new functionality is tested through manual, unit, and integration testing. In the Table class, the from_table_info and from_historical_data methods have been added, allowing for the creation of Table instances from TableInfo objects and historical data dictionaries with more flexibility and safety. The test_tables.py file in the integration/progress directory has also been updated to include a new test function for checking table failures. These changes improve the tracking and management of dashboard IDs, enhance user name retrieval, and ensure the accurate determination of object ownership.
-
Create specific failure for Python syntax error while parsing with Astroid (#3498). This commit enhances the Python linting functionality in our open-source library by introducing a specific failure message, python-parse-error, for syntax errors encountered during code parsing using Astroid. Previously, a generic system-error message was used, which has been renamed to maintain consistency with the existing sql-parse-error message. This change provides clearer failure indicators and includes more detailed information about the error location. Additionally, modifications to Python linting-related code, unit test additions, and updates to the README guide users on handling these new error types have been implemented. A new method, Tree.maybe_parse(), has been introduced to parse Python code and detect syntax errors, ensuring more precise error handling for users.
-
DBR 16 and later support (#3481). This pull request introduces support for Databricks Runtime (DBR) 16 and later in the code that converts Hive Metastore (HMS) tables to external tables within the migrate-tables workflow. The changes include the addition of a new static method _get_entity_storage_locations to handle the new entityStorageLocations property in DBR16 and the modification of the _convert_hms_table_to_external method to account for this property. Additionally, the run_workflow function in the assessment workflow now has the skip_job_wait parameter set to True, which allows the workflow to continue running even if a job within it fails. The changes have been manually tested for DBR16, verified in a staging environment, and existing integration tests have been run for DBR 15. The diff also includes updates to the test_table_migration_convert_manged_to_external method to skip job waiting during testing, enabling the test to run successfully on DBR 16.
-
Delete stale code: NotebookLinter._load_source_from_run_cell (#3529). In this update, we have removed the stale code NotebookLinter._load_source_from_run_cell, which was responsible for loading the source code from a run cell in a notebook. This change is a part of the ongoing effort to address issue #3514 and enhances the overall codebase. Additionally, we have modified the existing databricks labs ucx lint-local-code command to update the code linting functionality. We have conducted manual testing to ensure that the changes function as intended and have added and modified several unit tests. The _load_source_from_run_cell method is no longer needed, as it was part of a deprecated functionality. The modifications to the databricks labs ucx lint-local-code command impact the way code linting is performed, ultimately improving the efficiency and maintainability of the codebase.
-
Exclude ucx dashboards from Lakeview dashboard crawler (#3450). In this release, we have enhanced the lakeview_crawler method in the open-source library to exclude Ucx dashboards and prevent false positives. This has been achieved by adding a new optional argument, exclude_dashboard_ids, to the init method, which takes a list of dashboard IDs to exclude from the crawler. The _crawl method has been updated to skip dashboards whose IDs match the ones in the exclude_dashboard_ids list. The change includes unit tests and manual testing to ensure proper functionality and has been verified on the staging environment. These updates improve the accuracy and reliability of the dashboard crawler, providing better results for software engineers utilizing this library.
-
Fixed issue in installing UCX on UC enabled workspace (#3501). This PR introduces changes to the ClusterPolicyInstaller class, updating the spark_version policy definition from a fixed value to an allowlist with a default value. This resolves an issue where, when UC is enabled on a workspace, the cluster definition takes on single_user and user_isolation values instead of Legacy_Single_User and 'Legacy_Table_ACL'. The job definition is also updated to use the default value when not explicitly provided. These changes improve compatibility with UC-enabled workspaces, ensuring the correct values for spark_version in the cluster definition. The PR includes updates to unit tests and installation tests, addressing issue #3420.
-
Fixed typo in workflow name (in error message) (#3491). This PR (Pull Request) addresses a minor typo in the error message displayed by the validate_groups_permissions method in the workflows.py file. The typo occurred in the workflow name mentioned in the error message, where group was incorrectly spelled as "groups." The corrected spelling is now validate-groups-permissions. This change does not introduce any new methods or modify any existing functionality, but instead focuses on enhancing the...
v0.53.1
- Removed
packaging
package dependency (#3469). In this release, we have removed the dependency on thepackaging
package in the open-source library to address a release issue. The import statements for "packaging.version.Version" and "packaging.version.InvalidVersion" have been removed. The function _external_hms in the federation.py file has been updated to retrieve the Hive Metastore version using the "spark.sql.hive.metastore.version" configuration key and validate it using a regular expression pattern. If the version is not valid, the function logs an informational message and returns None. This change modifies the Hive Metastore version validation logic and improves the overall reliability and maintainability of the library.
Contributors: @FastLee
v0.53.0
- Added dashboard crawlers (#3397). The open-source library has been updated with new dashboard crawlers for the assessment workflow, Redash migration, and QueryLinter. These crawlers are responsible for crawling and persisting dashboards, as well as migrating or reverting them during Redash migration. They also lint the queries of the crawled dashboards using QueryLinter. This change resolves issues #3366 and #3367, and progresses #2854. The 'databricks labs ucx {migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the
assessment
workflow have been modified to incorporate these new features. Unit tests and integration tests have been added to ensure proper functionality of the new dashboard crawlers. Additionally, two new tables, $inventory.redash_dashboards and $inventory.lakeview_dashboards, have been introduced to hold a list of all Redash or Lakeview dashboards and are used by theQueryLinter
andRedash
migration. These changes improve the assessment, migration, and linting processes for dashboards in the library. - DBFS Root Support for HMS Federation (#3425). The commit
DBFS Root Support for HMS Federation
introduces changes to support the DBFS root location for HMS federation. A new method,external_locations_with_root
, is added to theExternalLocations
class to return a list of external locations including the DBFS root location. This method is used in various functions and test cases, such astest_create_uber_principal_no_storage
,test_create_uc_role_multiple_raises_error
,test_create_uc_no_roles
,test_save_spn_permissions
, andtest_create_access_connectors_for_storage_accounts
, to ensure that the DBFS root location is correctly identified and tested in different scenarios. Additionally, theexternal_locations.snapshot.return_value
is changed toexternal_locations.external_locations_with_root.return_value
in test functionstest_create_federated_catalog
andtest_already_existing_connection
to retrieve a list of external locations including the DBFS root location. This commit closes issue #3406, which was related to this functionality. Overall, these changes improve the handling and testing of DBFS root location in HMS federation. - Log message as error when legacy permissions API is enabled/disabled depending on the workflow ran (#3443). In this release, logging behavior has been updated in several methods in the 'workflows.py' file. When the
use_legacy_permission_migration
configuration is set to False and specific conditions are met, error messages are now logged instead of info messages for the methods 'verify_metastore_attached', 'rename_workspace_local_groups', 'reflect_account_groups_on_workspace', 'apply_permissions_to_account_groups', 'apply_permissions', and 'validate_groups_permissions'. This change is intended to address issue #3388 and provides clearer guidance to users when the legacy permissions API is not functioning as expected. Users will now see an error message advising them to run themigrate-groups
job or setuse_legacy_permission_migration
to True in the config.yml file. These updates will help ensure smoother workflow runs and more accurate logging for better troubleshooting. - MySQL External HMS Support for HMS Federation (#3385). This commit adds support for MySQL-based Hive Metastore (HMS) in HMS Federation, enhances the CLI for creating a federated catalog, and improves external HMS functionality. It introduces a new parameter
enable_hms_federation
in theLocations
class constructor, allowing users to enable or disable MySQL-based HMS federation. Theexternal_locations
method inapplication.py
now acceptsenable_hms_federation
as a parameter, enabling more granular control of the federation feature. Additionally, the CLI for creating a federated catalog has been updated to accept aprompts
parameter, providing more flexibility. The commit also introduces a new dataclassExternalHmsInfo
for external HMS connection information and updates theHiveMetastoreFederationEnabler
andHiveMetastoreFederation
classes to support non-Glue external metastores. Furthermore, it adds methods to handle the creation of a Federated Catalog from the command-line interface, split JDBC URLs, and manage external connections and permissions. - Skip listing built-in catalogs to update table migration process (#3464). In this release, the migration process for updating tables in the Hive Metastore has been optimized with the introduction of the
TableMigrationStatusRefresher
class, which inherits fromCrawlerBase
. This new class includes modifications to the_iter_schemas
method, which now filters out built-in catalogs and schemas when listing catalogs and schemas, thereby skipping unnecessary processing during the table migration process. Additionally, theget_seen_tables
method has been updated to include checks forschema.name
andschema.catalog_name
, and the_crawl
and_try_fetch
methods have been modified to reflect changes in theTableMigrationStatus
constructor. These changes aim to improve the efficiency and performance of the migration process by skipping built-in catalogs and schemas. The release also includes modifications to the existingmigrate-tables
workflow and adds unit tests that demonstrate the exclusion of built-in catalogs during the table migration status update process. The test case utilizes theCatalogInfoSecurableKind
enumeration to specify the kind of catalog and verifies that the seen tables only include the non-builtin catalogs. These changes should prevent unnecessary processing of built-in catalogs and schemas during the table migration process, leading to improved efficiency and performance. - Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40 (#3434). In this release, the requirement for the
databricks-sdk
package has been updated in the pyproject.toml file to be strictly greater than or equal to 0.39 and less than 0.40, allowing for the use of the latest version of the package while preventing the use of versions above 0.40. This change is based on the release notes and changelog for version 0.39 of the package, which includes bug fixes, internal changes, and API changes such as the addition of thecleanrooms
package, delete() method for workspace-level services, and fields for various request and response objects. The commit history for the package is also provided. Dependabot has been configured to resolve any conflicts with this PR and can be manually triggered to perform various actions as needed. Additionally, Dependabot can be used to ignore specific dependency versions or close the PR. - Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 (#3456). In this pull request, the version range of the
databricks-sdk
dependency has been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of the latest version of thedatabricks-sdk
while ensuring that it is less than 0.41. The pull request also includes release notes detailing the API changes in version 0.40.0, such as the addition of new fields to various compute, dashboard, job, and pipeline services. A changelog is provided, outlining the bug fixes, internal changes, new features, and improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits is also included, showing the development progress of these versions. - Use LTS Databricks runtime version (#3459). This release introduces a change in the Databricks runtime version to a Long-Term Support (LTS) release to address issues encountered during the migration to external tables. The previous runtime version caused the
convert to external table
migration strategy to fail, and this change serves as a temporary solution. Themigrate-tables
workflow has been modified, and existing integration tests have been reused to ensure functionality. Thetest_job_cluster_policy
function now uses the LTS version instead of the latest version, ensuring a specified Spark version for the cluster policy. The function also checks for matching node type ID, Spark version, and necessary resources. However, users may still encounter problems with the latest Universal Connectivity (UCX) release. The_convert_hms_table_to_external
method in thetable_migrate.py
file has been updated to return a boolean value, with a new TODO comment about a possible failure with Databricks runtime 16.0 due to a JDK update. - Use
CREATE_FOREIGN_CATALOG
instead ofCREATE_FOREIGN_SECURABLE
with HMS federation enablement commands (#3309). A change has been made to update thedatabricks-sdk
dependency version from>=0.38,<0.39
to>=0.39
in thepyproject.toml
file, which may affect the project's functionality related to thedatabricks-sdk
library. In the Hive Metastore Federation codebase,CREATE_FOREIGN_CATALOG
is now used instead ofCREATE_FOREIGN_SECURABLE
for HMS federation enablement commands, aligned with issue #3308. The_add_missing_permissions_if_needed
method has been updated to check forCREATE_FOREIGN_SECURABLE
instead ofCREATE_FOREIGN_CATALOG
when granting permissions. Additionally, a unit test file for HiveMetastore Federation has ...
v0.52.0
- Added handling for Databricks errors during workspace listings in the table migration status refresher (#3378). In this release, we have implemented changes to enhance error handling and improve the stability of the table migration status refresher in the open-source library. We have resolved issue #3262, which addressed Databricks errors during workspace listings. The
assessment
workflow has been updated, and new unit tests have been added to ensure proper error handling. The changes include the import ofDatabricksError
from thedatabricks.sdk.errors
module and the addition of a new method_iter_catalogs
to list catalogs with error handling forDatabricksError
. The_iter_schemas
method now replaces_ws.catalogs.list()
withself._iter_catalogs()
, also including error handling forDatabricksError
. Furthermore, new unit tests have been developed to check the logging of theTableMigration
class when listing tables in the Databricks workspace, focusing on handling errors during catalog, schema, and table listings. These changes improve the library's robustness and ensure that it can gracefully handle errors during the table migration status refresher process. - Convert READ_METADATA to UC BROWSE permission for tables, views and database (#3403). The
uc_grant_sql
method in thegrants.py
file has been modified to convertREAD_METADATA
permissions toBROWSE
permissions for tables, views, and databases. This change involves adding new entries to the dictionary used to map permission types to their corresponding UC actions and has been manually tested. The behavior of thegrant_loader
function in thehive_metastore
module has also been modified to change the action type of a grant fromREAD_METADATA
toEXECUTE
for a specific case. Additionally, thetest_grants.py
unit test file has been updated to include a new test case that verifies the conversion ofREAD_METADATA
toBROWSE
for a grant on a database and handles the conversion ofREAD_METADATA
permission toUC BROWSE
for a newudf="function"
parameter. These changes resolve issue #2023 and have been tested through manual testing and unit tests. No new methods have been added, and existing functionality has been changed in a limited scope. No new unit or integration tests have been added as it is assumed that the existing tests will continue to pass after these changes have been made. - Migrates Pipelines crawled during the assessment phase (#2778). A new utility class,
PipelineMigrator
, has been introduced in this release to facilitate the migration of Databricks Labs SQL (DLT) pipelines. This class is used in a new workflow that tests pipeline migration, which involves cloning DLT pipelines in the assessment phase with specific configurations to a new Unity Catalog (UC) pipeline. The migration can be skipped for certain pipelines by specifying their pipeline IDs in a list. Three test scenarios, each with different pipeline specifications, are defined to ensure the proper functioning of the migration process under various conditions. The class and the migration process are thoroughly tested with manual testing, unit tests, and integration tests, with no reliance on a staging environment. The migration process takes into account theWorkspaceClient
,WorkspaceContext
,AccountClient
, and a flag for running the command as a collection. ThePipelinesMigrator
class uses aPipelinesCrawler
andJobsCrawler
to perform the migration and ensures better functionality for the users with additional parameters. The commit also introduces a new command,migrate_dlt_pipelines
, to the CLI of the ucx package, which helps migrate DLT pipelines. The migration process is tested using a mock installation, unit tests, and integration tests. The tests cover the scenario where the installation has two jobs,test
and 'assessment', with job IDs123
and456
respectively. The state of the installation is recorded in astate.json
file. A configuration filepipeline_mapping.csv
is used to map the source pipeline ID to the target catalog, schema, pipeline, and workspace names. - Removed
try-except
around verifying the migration progress prerequisites in themigrate-tables
cli command (#3439). In the latest release, theucx
package'smigrate-tables
CLI command has undergone a significant modification in the handling of progress tracking prerequisites. The previous try-except block surrounding the verification has been removed, and the RuntimeWarning is now propagated, providing a more specific and helpful error message. If the prerequisites are not met, theverify
method will raise an exception, and the migration will not proceed. This change enhances the accuracy of error messages for users and ensures that the prerequisites for migration are properly met. The tests formigrate_tables
have been updated accordingly, including a new test casetest_migrate_tables_errors_out_before_assessment
that checks whether the migration does not proceed with the verification fails. This change affects the existingdatabricks labs ucx migrate-tables
command and brings improved precision and reliability to the migration process. - Removed redundant internal methods from create_account_group (#3395). In this change, the
create_account_group
function's internal methods have been removed, and its signature has been modified to retrieve the workspace ID fromaccountworkspace._workspaces()
instead of passing it as a parameter. This resolves issue #3170 and improves code efficiency by removing unnecessary parameters and methods. TheAccountWorkspaces
class now accepts a list of workspace IDs upon instantiation, enhancing code readability and eliminating redundancy. The function has been tested with unit tests, ensuring it creates a group if it doesn't exist, throws an exception if a group already exists, filters system groups, and handles cases where a group already has the required number of members in a workspace. These changes simplify the codebase, eliminate redundancy, and improve the maintainability of the project. - Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.34 (#3407). In this release, we have updated the sqlglot requirement to version 25.33.9999 from a range that included versions 25.5.0 to 25.32.9999. This update allows us to utilize the latest version of sqlglot, which includes various bug fixes and new features. In v25.33.0, there were two breaking changes: the TIMESTAMP data type now maps to Type.TIMESTAMPTZ, and the NEXT keyword is now treated as a function keyword. Several new features were also introduced, including support for generated columns in PostgreSQL and the ability to preserve tables in the replace_table method. Additionally, there were several bug fixes, including fixes for issues related to BigQuery, Presto, and Spark. The v25.32.1 release contained two bug fixes related to BigQuery and one bug fix related to Presto. Furthermore, v25.32.0 had three breaking changes: support for ATTACH/DETACH statements, tokenization of hints as comments, and a fix to datetime coercion in the canonicalize rule. This release also introduced new features, such as support for TO_TIMESTAMP* variants in Snowflake and improved error messages in the Redshift transpiler. Lastly, there were several bug fixes, including fixes for issues related to SQL Server, MySQL, and PostgreSQL.
- Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.35 (#3413). In this release, the
sqlglot
dependency has been updated from a version range that allows up to25.33
, but excludes25.34
, to a version range that allows25.5.0
and above, but excludes25.35
. This update was made to enable the latest version ofsqlglot
, which includes one breaking change related to the alias expansion of USING STRUCT fields. This version also introduces two new features, an optimization for alias expansion of USING STRUCT fields, and support for generated columns in PostgreSQL. Additionally, two bug fixes were implemented, addressing proper consumption of dashed table parts and removal of parentheses from CURRENT_USER in Presto. The update also includes a fix to make TIMESTAMP map to Type.TIMESTAMPTZ, a fix to parse DEFAULT in VALUES clause into a Var, and changes to the BigQuery and Snowflake dialects to improve transpilation and JSONPathTokenizer leniency. The commit message includes a reference to issue[#3413](https://github.com/databrickslabs/ucx/issues/3413)
and a link to thesqlglot
changelog for further reference. - Updated sqlglot requirement from <25.35,>=25.5.0 to >=25.5.0,<26.1 (#3433). In this release, we have updated the required version of the
sqlglot
library to a range that includes version 25.5.0 but excludes version 26.1. This change is crucial due to the breaking changes introduced insqlglot
v26.0.0 that are not yet compatible with our project. The commit message includes the changelog forsqlglot
v26.0.0, which highlights the breaking changes, new features, bug fixes, and other modifications in this version. Additionally, the commit includes a list of commits merged into thesqlglot
repository for a comprehensive understanding of the changes. As a software engineer, I recommend approving this change to maintain compatibility withsqlglot
. However, I advise thorough testing to ensure the updated version does n...
v0.51.0
- Added
assign-owner-group
command (#3111). The Databricks Labs Unity Catalog Exporter (UCX) tool now includes a newassign-owner-group
command, allowing users to assign an owner group to the workspace. This group will be designated as the owner for all migrated tables and views, providing better control and organization of resources. The command can be executed in the context of a specific workspace or across multiple workspaces. The implementation includes new classes, methods, and attributes in various files, such ascli.py
,config.py
, andgroups.py
, enhancing ownership management functionality. Theassign-owner-group
command replaces the functionality of issue #3075 and addresses issue #2890, ensuring proper schema ownership and handling of crawled grants. Developers should be aware that running themigrate-tables
workflow will result in assigning a new owner group for the Hive Metastore instance in the workspace installation. - Added
opencensus
to known list (#3052). In this release, we have added OpenCensus to the list of known libraries in our configuration file. OpenCensus is a popular set of tools for distributed tracing and monitoring, and its inclusion in our system will enhance support and integration for users who utilize this tool. This change does not affect existing functionality, but instead adds a new entry in the configuration file for OpenCensus. This enhancement will allow our library to better recognize and work with OpenCensus, enabling improved performance and functionality for our users. - Added default owner group selection to the installer (#3370). A new class, AccountGroupLookup, has been added to the AccountGroupLookup module to select the default owner group during the installer process, addressing previous issue #3111. This class uses the workspace_client to determine the owner group, and a pick_owner_group method to prompt the user for a selection if necessary. The ownership selection process has been improved with the addition of a check in the installer's
_static_owner
method to determine if the current user is part of the default owner group. The GroupManager class has been updated to use the new AccountGroupLookup class and its methods,pick_owner_group
andvalidate_owner_group
. A new variable,default_owner_group
, is introduced in the ConfigureGroups class to configure groups during installation based on user input. The installer now includes a unit test, "test_configure_with_default_owner_group", to demonstrate how it sets expected workspace configuration values when a default owner group is specified during installation. - Added handling for non UTF-8 encoded notebook error explicitly (#3376). A new enhancement has been implemented to address the issue of non-UTF-8 encoded notebooks failing to load by introducing explicit error handling for this case. A UnicodeDecodeError exception is now caught and logged as a warning, while the notebook is skipped and returned as None. This change is implemented in the load_dependency method in the loaders.py file, which is a part of the assessment workflow. Additionally, a new unit test has been added to verify the behavior of this change, and the assessment workflow has been updated accordingly. The new test function in test_loaders.py checks for different types of exceptions, specifically PermissionError and UnicodeDecodeError, ensuring that the system can handle notebooks with non-UTF-8 encoding gracefully. This enhancement resolves issue #3374, thereby improving the overall robustness of the application.
- Added migration progress documentation (#3333). In this release, we have updated the
migration-progress-experimental
workflow to track the migration progress of a subset of inventory tables related to workspace resources being migrated to Unity Catalog (UCX). The workflow updates the inventory tables and tracks the migration progress in the UCX catalog tables. To use this workflow, users must attach a UC metastore to the workspace, create a UCX catalog, and ensure that the assessment job has run successfully. TheMigration Progress
section in the documentation has been updated with a new markdown file that provides details about the migration progress, including a migration progress dashboard and an experimental migration progress workflow that generates historical records of inventory objects relevant to the migration progress. These records are stored in the UCX UC catalog, which contains a historical table with information about the object type, object ID, data, failures, owner, and UCX version. The migration process also tracks dangling Hive or workspace objects that are not referenced by business resources, and the progress is persisted in the UCX UC catalog, allowing for cross-workspace tracking of migration progress. - Added note about running assessment once (#3398). In this release, we have introduced an update to the UCX assessment workflow, which will now only be executed once and will not update existing results in repeated runs. To accommodate this change, we have updated the README file with a note clarifying that the assessment workflow is a one-time process. Additionally, we have provided instructions on how to update the inventory and findings by uninstalling and reinstalling the UCX. This will ensure that the inventory and findings for a workspace are up-to-date and accurate. We recommend that software engineers take note of this change and follow the updated instructions when using the UCX assessment workflow.
- Allowing skipping TACLs migration during table migration (#3384). A new optional flag, "skip_tacl_migration", has been added to the configuration file, providing users with more flexibility during migration. This flag allows users to control whether or not to skip the Table Access Control Language (TACL) migration during table migrations. It can be set when creating catalogs and schemas, as well as when migrating tables or using the
migrate_grants
method inapplication.py
. Additionally, theinstall.py
file now includes a new variable,skip_tacl_migration
, which can be set toTrue
during the installation process to skip TACL migration. New test cases have been added to verify the functionality of skipping TACL migration during grants management and table migration. These changes enhance the flexibility of the system for users managing table migrations and TACL operations in their infrastructure, addressing issues #3384 and #3042. - Bump
databricks-sdk
anddatabricks-labs-lsql
dependencies (#3332). In this update, thedatabricks-sdk
anddatabricks-labs-lsql
dependencies are upgraded to versions 0.38 and 0.14.0, respectively. Thedatabricks-sdk
update addresses conflicts, bug fixes, and introduces new API additions and changes, notably impacting methods likecreate()
,execute_message_query()
, and others in workspace-level services. Whiledatabricks-labs-lsql
updates ensure compatibility, its changelog and specific commits are not provided. This pull request also includes ignore conditions for thedatabricks-sdk
dependency to prevent future Dependabot requests. It is strongly advised to rigorously test these updates to avoid any compatibility issues or breaking changes with the existing codebase. This pull request mirrors another (#3329), resolving integration CI issues that prevented the original from merging. - Explain failures when cluster encounters Py4J error (#3318). In this release, we have made significant improvements to the error handling mechanism in our open-source library. Specifically, we have addressed issue #3318, which involved handling failures when the cluster encounters Py4J errors in the
databricks/labs/ucx/hive_metastore/tables.py
file. We have added code to raise noisy failures instead of swallowing the error with a warning when a Py4J error occurs. The functions_all_databases()
and_list_tables()
have been updated to check if the error message contains "py4j.security.Py4JSecurityException", and if so, log an error message with instructions to update or reinstall UCX. If the error message does not contain "py4j.security.Py4JSecurityException", the functions log a warning message and return an empty list. These changes also resolve the linked issue #3271. The functionality has been thoroughly tested and verified on the labs environment. These improvements provide more informative error messages and enhance the overall reliability of our library. - Rearranged job summary dashboard columns and make job_name clickable (#3311). In this update, the job summary dashboard columns have been improved and the need for the
30_3_job_details.sql
file, which contained a SQL query for selecting job details from theinventory.jobs
table, has been eliminated. The dashboard columns have been rearranged, and thejob_name
column is now clickable, providing easy access to job details via the corresponding job ID. The changes include modifying the...
v0.50.0
- Added
pytesseract
to known list (#3235). A new addition has been made to theknown.json
file, which tracks packages with native code, to includepytesseract
, an Optical Character Recognition (OCR) tool for Python. This change improves the handling ofpytesseract
within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation ofpytesseract
and its native components. However, specific details on the usage ofpytesseract
within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment ofpytesseract
and its native dependencies, making it easier to work with. - Added hyperlink to database names in database summary dashboard (#3310). The recent change to the
Database Summary
dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding alinkUrlTemplate
property to thedatabase
field in theencodings
object within theoverrides
property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard. - Bump codecov/codecov-action from 4 to 5 (#3316). In this release, the version of the
codecov/codecov-action
dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, includingbinary
,gcov_args
,gcov_executable
,gcov_ignore
,gcov_include
,report_type
,skip_validation
, andswift_project
. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking. - Depend on a Databricks SDK release compatible with 0.31.0 (#3273). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new
InvalidState
error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in thepyproject.toml
file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project. - Eliminate redundant migration-index refresh and loads during view migration (#3223). In this pull request, we have optimized the view migration process in the
databricks/labs/ucx/hive_metastore/table_metastore.py
file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new classTableMigrationIndex
and imported theTableMigrationStatusRefresher
class. The_migrate_views
method now takes an additional argumentmigration_index
, which is used in theViewsMigrationSequencer
and in the_migrate_view
method. The_view_can_be_migrated
and_sql_migrate_view
methods now also takemigration_index
as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly. - Fixed backwards compatibility breakage from Databricks SDK (#3324). In this release, we have addressed a backwards compatibility issue (Issue #3324) that was caused by an update to the Databricks SDK. This was done by adding new methods to the
databricks.sdk.service
module to interact with dashboards. Additionally, we have fixed bug #3322 and updated thecreate
function in theconftest.py
file to utilize the newdashboards
module and itsDashboard
class. The function now returns the dashboard object as a dictionary and calls thepublish
method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the--cov-fail-under=89
flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality. - Fixed issue with cleanup of failed
create-missing-principals
command (#3243). In this update, we have improved thecreate_uc_roles
method within theaccess.py
file of thedatabricks/labs/ucx/aws
directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if aPermissionDenied
orNotFound
exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of thedatabricks labs ucx create-missing-principals
command by handling permission errors and restoring the system to its initial state. - Improve error handling for
assess_workflows
task (#3255). This pull request introduces improvements to theassess_workflows
task in thedatabricks/labs/ucx
module, focusing on error handling and logging. A new error type,DatabricksError
, has been added to handle Databricks-specific exceptions in the_temporary_copy
method, ensuring proper handling and re-raising of Databricks-related errors asInvalidPath
exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed fromerror
towarning
. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of theassess_workflows
task, ensuring appropriate handling and logging of any errors that may occur during execution. - Require at least 4 cores for UCX VMs (#3229). In this release, the selection of
node_type_id
in thepolicy.py
file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering thenode_type_id
parameter. The updatednode_type_id
selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly. - Skip
test_feature_tables
integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features. - Speed up
update_migration_status
jobs by eliminating lots of redundant SQL queries (#3200). In this relea...
v0.49.0
- Added
MigrationSequencer
for jobs (#3008). In this commit, aMigrationSequencer
class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable ofMigrationStep
objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue #1415 and supersedes issue #2980. Additionally, the commit removes some unnecessary imports and fixtures from a test file. - Added
phik
to known list (#3198). In this release, we have addedphik
to the known list in the provided JSON file. This change addresses part of issue #1931, as outlined in the linked issues. Thephik
key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding thephik
key. - Added
pmdarima
to known list (#3199). In this release, we are excited to announce the addition of support for thepmdarima
library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have addedpmdarima
to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integratingpmdarima
, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue #1931 and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available. - Added
preshed
to known list (#3220). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython,preshed
is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules,preshed
and "preshed.about," this addition partially resolves issue #1931, improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage thepreshed
library's features and optimized routines for their projects, reducing development time and increasing efficiency. - Added
py-cpuinfo
to known list (#3221). In this release, we have added support for thepy-cpuinfo
library to our project, enabling the use of thecpuinfo
functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue #1931 and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources. - Cater for empty python cells (#3212). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the
_python_trees
dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the_load_children_from_tree
method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input. - Create
TODO
issues every nightly run (#3196). A commit has been made to update theacceptance
repository version in theacceptance.yml
GitHub workflow fromacceptance/v0.4.0
toacceptance/v0.4.2
, which affects the integration tests. TheRun nightly tests
step in the GitHub repository's workflow has also been updated to use a newer version of thedatabrickslabs/sandbox/acceptance
action, fromv0.3.1
tov0.4.2
. Software engineers should verify that the new version of theacceptance
repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues. - Fixed Integration test failure of migration_tables (#3108). This release includes a fix for two integration tests (
test_migrate_managed_table_to_external_table_without_conversion
andtest_migrate_managed_table_to_external_table_with_clone
) related to Hive Metastore table migration, addressing issues #3054 and #3055. Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing@pytest.mark.skip
markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase. - Replace MockInstallation with MockPathLookup for testing fixtures (#3215). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue #3115.
- Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 (#3224). The open-source library
sqlglot
has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpilingANY
toEXISTS
, supporting theMEDIAN()
function, wrapping values inNOT value IS ...
, and parsing information schema views into a single identifier. New features include support for theJSONB_EXISTS
function in PostgreSQL, transpilingANY
toEXISTS
in Spark, transpiling Snowflake'sTIMESTAMP()
function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding aNULL
filter onARRAY_AGG
only for columns, improving parsing ofWITH FILL ... INTERPOLATE
in Clickhouse, generatingLOG(...)
forexp.Ln
in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release. - Use acceptance/v0.4.0 (#3192). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the
databrickslabs/sandbox/acceptance
runner toacceptance/v0.4.0
and granting write permissions for theissues
field in thepermissions
section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. ATODO
comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly. - Warn about errors instead to a...