Skip to content

Releases: databrickslabs/ucx

v0.57.0

05 Mar 03:47
d0bcfc5
Compare
Choose a tag to compare
  • Convert UCX job ids to int before passing to JobsCrawler (#3816). In this release, we have addressed issue #3722 and improved the robustness of the open-source library by modifying the jobs_crawler method to handle job IDs more effectively. Previously, job IDs were passed directly to the exclude_job_ids parameter, which could cause issues if they were not integers. To address this problem, we have updated the jobs_crawler method to convert all job IDs to integers using a list comprehension before passing them to the method. This change ensures that only valid integer job IDs are used, thereby enhancing the reliability of the method. The commit includes a manual test to confirm the correct behavior of this modification. In summary, this modification improves the robustness of the code by ensuring that integer job IDs are utilized correctly in the JobsCrawler method.
  • Exclude UCX jobs from crawling (#3733). In this release, we have made modifications to the JobsCrawler and the existing assessment workflow to exclude UCX jobs from crawling, avoiding confusion for users when they appear in assessment reports. This change addresses issues #3656 and #3722, and is a follow-up to previous issue #3732. We have also incorporated updates from pull requests #3767 and #3759 to improve integration tests and linting. Additionally, a retry mechanism has been added to wait for grants to exist before crawling, addressing issue #3758. The changes include the addition of unit and integration tests to ensure the correctness of the modifications. A new exclude_job_ids parameter has been added to the JobsCrawler constructor, which is initialized with the list of UCX job IDs, ensuring that UCX jobs are not included in the assessment report. The _list_jobs method now excludes jobs based on the provided exclude_job_ids and include_job_ids arguments. The _crawl method now uses the _list_jobs method to list the jobs to be crawled. The _assess_jobs method has been updated to take into account the exclusion of specific job IDs. The test_grant_detail file, an integration test for the Hive Metastore grants functionality, has been updated to include a retry mechanism to wait for grants to exist before crawling and to check if the SELECT permission on ANY FILE is present in the grants.
  • Let WorkflowLinter.refresh_report lint jobs from JobsCrawler (#3732). In this release, the WorkflowLinter.refresh_report method has been updated to lint jobs from the JobsCrawler class, ensuring that only jobs within the scope of the crawler are processed. This change resolves issue #3662 and progresses issue #3722. The workflow linting code, the assessment workflow, and the JobsCrawler class have been modified. The JobsCrawler class now includes a snapshot method, which is used in the WorkflowLinter.refresh_report method to retrieve necessary data about jobs. Unit and integration tests have been updated correspondingly, with the integration test for workflows now verifying that all rows returned from a query to the workflow_problems table have a valid path field. The WorkflowLinter constructor now includes an instance of JobsCrawler, allowing for more targeted linting of jobs. The introduction of the JobsCrawler class enables more efficient and precise linting of jobs, improving the overall accuracy of workflow assessment.
  • Let dashboard name adhere to naming convention (#3789). In this release, the naming convention for dashboard names in the ucx library has been enforced, restricting them to alphanumeric characters, hyphens, and underscores. This change replaces any non-conforming characters in existing dashboard names with hyphens or underscores, addressing several issues (#3761 through #3788). A temporary fix has been added to the _create_dashboard method to ensure newly created dashboard names adhere to the new naming convention, indicated by a TODO comment. This release also resolves a test failure in a specific GitHub Actions run and addresses a total of 29 issues. The specifics of the modification made to the databricks labs install ucx command and the changes to existing functionality are not detailed, making it difficult to assess their scope. The commit includes the deletion of a file called 02_0_owner.filter.yml, and all changes have been manually tested. For future reference, it would be helpful to include more information about the changes made, their impact, and the reason for deleting the specified file.
  • Partial revert Let dashboard name adhere to naming convention (#3794). In this release, we have partially reverted a previous change to the migration progress dashboard, reintroducing the owner filter. This change was made in response to feedback from users who found the previous modification to the dashboard less intuitive. The new owner filter has been defined in a new file, '02_0_owner.filter.yml', which includes the title, column name, type, and width of the filter. To ensure proper functionality, this change requires the release of lsql after merging. The change has been thoroughly tested to guarantee its correct operation and to provide the best possible user experience.
  • Partial revert Let dashboard name adhere to naming convention (#3795). In this release, we have partially reversed a previous change that enforced a naming convention for dashboard names, allowing the use of special characters such as spaces and brackets again. The _create_dashboard method in the install.py file and the _name method in the mixins.py file have been updated to reflect this change, affecting the migration progress dashboard. The display_name attribute of the metadata object has been updated to use the original format, which may include special characters. The reference variable has also been updated accordingly. The functions created_job_tasks and created_job have been updated to use the new naming convention when retrieving installation jobs with specific names. These changes have been manually tested and the tests have been verified to work correctly after the reversion. This change is related to issues #3799, #3789, and reverts commit 048bc8f.
  • Put back dashboard names (#3808). In the lsql release v0.16.0, the naming convention for dashboards has been updated to support non-alphanumeric characters in the dashboard names. This change modifies the _create_dashboard function in install.py and the _name method in mixins.py to create dashboard names with a format like [UCX] assessment (Main), which includes parent and child folder names. This update addresses issues reported in tickets #3797 and #3790, and partially reverses previous changes made in commits 4017a25 and 834ef14. The functionality of other methods remains unchanged. With this release, the created_job_tasks and created_job functions now accept dashboard names with non-alphanumeric characters as input.
  • Updated databricks-labs-lsql requirement from <0.15,>=0.14.0 to >=0.14.0,<0.17 (#3801). In this update, we have updated the required version of the dat ab ricks-l abs-ls ql package from a version greater than or equal to 0.15.0 and less than 0.16.0 to a version greater than or equal to 0.16.0 and less than 0.17.0. This change allows for the use of the latest version of the package, which includes various bug fixes and dependency updates. The package is utilized in the acceptance tests that are run as part of the CI/CD pipeline. With this update, the acceptance tests can now be executed using the most recent version of the package, resulting in enhanced functionality and reliability.
  • Updated databricks-sdk requirement from <0.42,>=0.40 to >=0.44,<0.45 (#3686). In this release, we have updated the version requirement for the databricks-sdk package to be greater than or equal to 0.44.0 and less than 0.45.0. This update allows for the use of the latest version of the databricks-sdk, which includes new methods, fields, and bug fixes. For instance, the get_message_query_result_by_attachment method has been added for the w.genie.workspace_level_service, and several fields such as review_state, reviews, and runner_collaborators have been removed for the databricks.sdk.service.clean_rooms.CleanRoomAssetNotebook object. Additionally, the securable_kind field has been removed for various objects such as CatalogInfo and ConnectionInfo. We recommend thoroughly testing this update to ensure compatibility with your project. The release notes for versions 0.44.0 and 0.43.0 can be found in the commit history. Please note that there are several backward-incompatible changes listed in the changelog for bot...
Read more

v0.56.0

25 Feb 03:53
05c2d6a
Compare
Choose a tag to compare
  • Added documentation to use Delta Live Tables migration (#3587). In this documentation update, we introduce a new section for migrating Delta Live Table pipelines to the Unity Catalog as part of the migration process. This workflow allows for the original and cloned pipelines to run independently after the cloned pipeline reaches the RUNNING state. The update includes an example of stopping and renaming an existing HMS DLT pipeline, and creating a new cloned pipeline. Additionally, known issues and limitations are outlined, such as supported streaming sources, maintenance pausing, and querying by timestamp. To streamline the migration process, the migrate-dlt-pipelines command is introduced with optional parameters for including or excluding specific pipeline IDs. This feature is intended for developers and administrators managing data pipelines and handling table aliasing issues. Relevant user documentation has been added and the changes have been manually tested.
  • Added support for MSSQL and POSTGRESQL to HMS Federation (#3701). In this enhancement, the open-source library now supports Microsoft SQL Server (MSSQL) and PostgreSQL databases in the Hive Metastore Federation (HMS Federation) feature. This update introduces classes for handling external Hive Metastore instances and their versions, and refactors a regex pattern for better support of various JDBC URL formats. A new supported_databases_port class variable is added to map supported databases to default ports, allowing the code to handle SQL Server's distinct default port. Additionally, a supported_hms_versions class variable is created, outlining supported Hive Metastore versions. The _external_hms method is updated to extract HMS version information more accurately, and the _split_jdbc_url method is refactored for better URL format compatibility and parameter extraction. The test file test_federation.py has been updated with new unit tests for external catalog creation with MSSQL and PostgreSQL, further enhancing compatibility with various databases and expanding HMS Federation's capabilities.
  • Added the CLI command for migrating DLT pipelines (#3579). A new CLI command, "migrate-dlt-pipelines," has been added for migrating DLT pipelines from HMS to UC using the DLT Migration API. This command allows users to include or exclude specific pipeline IDs during migration using the --include-pipeline-ids and --exclude-pipeline-ids flags, respectively. The change impacts the PipelinesMigrator class, which has been updated to accept and use these new parameters. Currently, there is no information available about testing, but the changes are expected to be manually tested and accompanied by corresponding unit and integration tests in the future. The changes are isolated to the PipelinesMigrator class and related functionality, with no impact on existing methods or functionality.
  • Addressed Bug with Dashboard migration (#3663). In this release, the _crawl method in dashboards.py has been enhanced to exclude SDK dashboards that lack IDs during the dashboard migration process. This modification enhances migration efficiency by avoiding unnecessary processing of incomplete dashboards. Additionally, the _list_dashboards method now includes a check for dashboards with no IDs while iterating through the dashboards_iterator. If a dashboard with no ID is found, the method fetches the dashboard details using the _get_dashboard method and adds them to the dashboards list, ensuring proper processing. Furthermore, a bug fix for issue #3663 has been implemented in the RedashDashboardCrawler class in assessment/test_dashboards.py. The get method has been added as a side effect to the WorkspaceClient mock's dashboards attribute, enabling the retrieval of individual dashboard objects by their IDs. This modification ensures that the RedashDashboardCrawler can correctly retrieve and process dashboard objects from the WorkspaceClient mock, preventing errors due to missing dashboard objects.
  • Broaden safe read text caught exception scope (#3705). In this release, the safe_read_text function has been enhanced to handle a broader range of exceptions that may occur while reading a text file, including OSError and UnicodeError, making it more robust and safe. The function previously caught specific exceptions such as FileNotFoundError, UnicodeDecodeError, and PermissionError. Additionally, the codebase has been improved with updated unit tests, ensuring that the new functionality works correctly. The linting parts of the code have also been updated, enhancing the readability and maintainability of the project for other software engineers. A new method, safe_read_text, has been added to the source_code module, with several new test cases designed to ensure that the method handles edge cases correctly, such as when the file does not exist, when the path is a directory, or when an OSError occurs. These changes make the open-source library more reliable and robust for various use cases.
  • Case sensitive/insensitive table validation (#3580). In this release, the library has been updated to enable more flexible and customizable metadata comparison for tables. A case sensitive flag has been introduced for metadata comparison, which allows for consideration or ignoring of column name case during validation. The TableMetadataRetriever abstract base class now includes a new parameter column_name_transformer in the get_metadata method, which is a callable that can be used to transform column names as needed for comparison. Additionally, a new case_sensitive parameter has been added to the StandardSchemaComparator constructor to determine whether column names should be compared case sensitively or not. A new parametrized test function test_schema_comparison_case has also been included to ensure that this functionality works as expected. These changes provide users with more control over the metadata comparison process and improve the library's handling of cases where column names in the source and target tables may have different cases.
  • Catch AttributeError in InfferedValue._safe_infer_internal (#3684). In this release, we have implemented a change to the _safe_infer_internal method in the InferredValue class to catch AttributeError. This change addresses an issue in the Astroid library reported in their GitHub repository (pylint-dev/astroid#2683) and resolves issue #3659 in our project. By handling AttributeError during the inference process, we have made the code more robust and safer. When an exception occurs, an error message is logged with debug-level logging, and the method yields the Uninferable sentinel value to indicate that inference failed for the node. This enhancement strengthens the source code linting code through value inference in our open-source library.
  • Document to run validate-groups-membership before groups migration, not after (#3631). In this release, we have updated the order of executing the validate-groups-membership command in the group migration process. Previously, the command was recommended to be run after the groups migration, but it has been updated to be executed before the migration. This change ensures that the groups have the correct membership and the number of groups and users in the workspace and account are the same before migration, providing an extra level of safety. Additionally, we have updated the remove-workspace-local-backup-groups command to remove workspace-level backup groups and their permissions only after confirming the successful migration of all groups. We have also updated the spelling of the validate-group-membership command to validate-groups-membership in a documentation file. This release is aimed at software engineers who are adopting the project and looking to migrate their groups to the account level.
  • Extend code migration progress documentation (#3588). In this documentation update, we have added two new sections, Code Migration and "Final details," to the open-source library's migration process documentation. The Code Migration section provides a detailed walkthrough of the steps to migrate code after completing table migration and data reconciliation, including using the linter to investigate compatibility issues and linted workspace resources. The "linter advices" provide codes and messages on detected issues and resolution methods. The migrated code can then be prioritized and tracked using the migration-progress dashboard, and migrated using the migrate- commands. The Final details section outlines the steps to take once code migration is complete, including running the cluster-remap command to remap clusters to be Unity Catalog compatible. This update resolves issue #2231 and includes updated user documentation, with new methods for linting and migrating local code, managing dashboard migrations, and syncing workspace information. Additional commands for creating and validating table mappings, migrating locations, and assigning metastores are also included, with the aim of improving the code migration process by providing more detailed documentation and new commands for managing the migration.
  • Fixed Skip/Unskip sch...
Read more

v0.55.0

24 Jan 15:36
c3ad142
Compare
Choose a tag to compare
  • Introducing UCX docs! (#3458). In this release, we introduced the new documents for UCX, you can find them here: https://databrickslabs.github.io/ucx/
  • Hosted Runner for release (#3532). In this release, we have made improvements to the release job's security and control by moving the release.yml file to a new location within a hosted runner group labeled "linux-ubuntu-latest." This change ensures that the release job now runs in a protected runner group, enhancing the overall security and reliability of the release process. The job's environment remains set to "release," and it retains the same authentication and artifact signing permissions as before the move, ensuring a seamless transition while improving the security and control of the release process.

Contributors: @sundarshankar89, @renardeinside

v0.54.0

23 Jan 22:16
ebe97e0
Compare
Choose a tag to compare
  • Implement disposition field in SQL backend (#3477). This commit adds a query_statement_disposition configuration option for the SQL backend in the UCX tool, allowing users to specify the disposition of SQL statements during assessment results export and preventing failures when dealing with large workspaces and a large number of findings. The new configuration option is added to the config.yml file and used by the SqlBackend definition. The databricks labs install ucx and databricks labs ucx export-assessment commands have been modified to support this new functionality. A new Disposition enum has been added to the databricks.sdk.service.sql module. This change resolves issue #3447 and is related to pull request #3455. The functionality has been manually tested.

  • AWS role issue with external locations pointing to the root of a storage account (#3510). The AWSResources class in the aws.py file has been updated to enhance the regular expression pattern for matching S3 bucket names, now including an optional group for trailing slashes and any subsequent characters. This allows for recognition of external locations pointing to the root of a storage account, addressing issue #3505. The access.py file within the AWS module has also been updated, introducing a new path variable and updating a for loop condition to accurately identify missing paths in external locations referencing the root of a storage account. New unit tests have been added to tests/unit/aws/test_access.py, including a test_uc_roles_create_all_roles method that checks the creation of all possible UC roles when none exist and external locations with and without folders. Additionally, the backend fixture has been updated to include a new external location s3://BUCKET4, and various tests have been updated to incorporate this location and handle errors appropriately.

  • Added assert to make sure installation is finished before re-installation (#3546). In this release, we have added an assertion to ensure that the installation process is completed before attempting to reinstall, addressing a previous issue where the reinstallation was starting before the first installation was finished, causing a warning to not be raised and resulting in a test failure. We have introduced a new function wait_for_installation_to_finish(), which retries loading the installation if it is not found, with a timeout of 2 minutes. This function is utilized in the test_compare_remote_local_install_versions test to ensure that the installation is finished before proceeding. Furthermore, we have extracted the warning message to a variable error_message for better readability. This change enhances the reliability of the installation process.

  • Added dashboards to migration progress dashboard (#3314). This commit introduces significant updates to the migration progress dashboard, adding dashboards, linting resources, and modifying existing components. The changes include a new dashboard displaying the number of dashboards pending migration, with the data sourced from the ucx_catalog.multiworkspace.objects_snapshot table. The existing 'Migration [main]' dashboard has been updated, and unit and integration tests have been adapted accordingly. The commit also renames several SQL files, updates the percentage UDF, grant, job, cluster, table, and pipeline migration progress queries, and resolves linting compatibility issues related to Unity Catalog. The changes depend on issue #3424, progress issue #3045, and break up issue #3112. The new dashboard aims to enhance the migration process and ensure a smooth transition to the Unity Catalog.

  • Added history log encoder for dashboards (#3424). A new history log encoder for dashboards has been added, addressing issues #3368 and #3369, and modifying the existing experimental-migration-progress workflow. This update includes the addition of the DashboardOwnership class, used to generate ownership information for dashboards, and the DashboardProgressEncoder class, responsible for encoding progress data related to dashboards. The new functionality is tested through manual, unit, and integration testing. In the Table class, the from_table_info and from_historical_data methods have been added, allowing for the creation of Table instances from TableInfo objects and historical data dictionaries with more flexibility and safety. The test_tables.py file in the integration/progress directory has also been updated to include a new test function for checking table failures. These changes improve the tracking and management of dashboard IDs, enhance user name retrieval, and ensure the accurate determination of object ownership.

  • Create specific failure for Python syntax error while parsing with Astroid (#3498). This commit enhances the Python linting functionality in our open-source library by introducing a specific failure message, python-parse-error, for syntax errors encountered during code parsing using Astroid. Previously, a generic system-error message was used, which has been renamed to maintain consistency with the existing sql-parse-error message. This change provides clearer failure indicators and includes more detailed information about the error location. Additionally, modifications to Python linting-related code, unit test additions, and updates to the README guide users on handling these new error types have been implemented. A new method, Tree.maybe_parse(), has been introduced to parse Python code and detect syntax errors, ensuring more precise error handling for users.

  • DBR 16 and later support (#3481). This pull request introduces support for Databricks Runtime (DBR) 16 and later in the code that converts Hive Metastore (HMS) tables to external tables within the migrate-tables workflow. The changes include the addition of a new static method _get_entity_storage_locations to handle the new entityStorageLocations property in DBR16 and the modification of the _convert_hms_table_to_external method to account for this property. Additionally, the run_workflow function in the assessment workflow now has the skip_job_wait parameter set to True, which allows the workflow to continue running even if a job within it fails. The changes have been manually tested for DBR16, verified in a staging environment, and existing integration tests have been run for DBR 15. The diff also includes updates to the test_table_migration_convert_manged_to_external method to skip job waiting during testing, enabling the test to run successfully on DBR 16.

  • Delete stale code: NotebookLinter._load_source_from_run_cell (#3529). In this update, we have removed the stale code NotebookLinter._load_source_from_run_cell, which was responsible for loading the source code from a run cell in a notebook. This change is a part of the ongoing effort to address issue #3514 and enhances the overall codebase. Additionally, we have modified the existing databricks labs ucx lint-local-code command to update the code linting functionality. We have conducted manual testing to ensure that the changes function as intended and have added and modified several unit tests. The _load_source_from_run_cell method is no longer needed, as it was part of a deprecated functionality. The modifications to the databricks labs ucx lint-local-code command impact the way code linting is performed, ultimately improving the efficiency and maintainability of the codebase.

  • Exclude ucx dashboards from Lakeview dashboard crawler (#3450). In this release, we have enhanced the lakeview_crawler method in the open-source library to exclude Ucx dashboards and prevent false positives. This has been achieved by adding a new optional argument, exclude_dashboard_ids, to the init method, which takes a list of dashboard IDs to exclude from the crawler. The _crawl method has been updated to skip dashboards whose IDs match the ones in the exclude_dashboard_ids list. The change includes unit tests and manual testing to ensure proper functionality and has been verified on the staging environment. These updates improve the accuracy and reliability of the dashboard crawler, providing better results for software engineers utilizing this library.

  • Fixed issue in installing UCX on UC enabled workspace (#3501). This PR introduces changes to the ClusterPolicyInstaller class, updating the spark_version policy definition from a fixed value to an allowlist with a default value. This resolves an issue where, when UC is enabled on a workspace, the cluster definition takes on single_user and user_isolation values instead of Legacy_Single_User and 'Legacy_Table_ACL'. The job definition is also updated to use the default value when not explicitly provided. These changes improve compatibility with UC-enabled workspaces, ensuring the correct values for spark_version in the cluster definition. The PR includes updates to unit tests and installation tests, addressing issue #3420.

  • Fixed typo in workflow name (in error message) (#3491). This PR (Pull Request) addresses a minor typo in the error message displayed by the validate_groups_permissions method in the workflows.py file. The typo occurred in the workflow name mentioned in the error message, where group was incorrectly spelled as "groups." The corrected spelling is now validate-groups-permissions. This change does not introduce any new methods or modify any existing functionality, but instead focuses on enhancing the...

Read more

v0.53.1

30 Dec 16:57
a77ca8b
Compare
Choose a tag to compare
  • Removed packaging package dependency (#3469). In this release, we have removed the dependency on the packaging package in the open-source library to address a release issue. The import statements for "packaging.version.Version" and "packaging.version.InvalidVersion" have been removed. The function _external_hms in the federation.py file has been updated to retrieve the Hive Metastore version using the "spark.sql.hive.metastore.version" configuration key and validate it using a regular expression pattern. If the version is not valid, the function logs an informational message and returns None. This change modifies the Hive Metastore version validation logic and improves the overall reliability and maintainability of the library.

Contributors: @FastLee

v0.53.0

23 Dec 18:28
dcfe27e
Compare
Choose a tag to compare
  • Added dashboard crawlers (#3397). The open-source library has been updated with new dashboard crawlers for the assessment workflow, Redash migration, and QueryLinter. These crawlers are responsible for crawling and persisting dashboards, as well as migrating or reverting them during Redash migration. They also lint the queries of the crawled dashboards using QueryLinter. This change resolves issues #3366 and #3367, and progresses #2854. The 'databricks labs ucx {migrate-dbsql-dashboards|revert-dbsql-dashboards}' command and the assessment workflow have been modified to incorporate these new features. Unit tests and integration tests have been added to ensure proper functionality of the new dashboard crawlers. Additionally, two new tables, $inventory.redash_dashboards and $inventory.lakeview_dashboards, have been introduced to hold a list of all Redash or Lakeview dashboards and are used by the QueryLinter and Redash migration. These changes improve the assessment, migration, and linting processes for dashboards in the library.
  • DBFS Root Support for HMS Federation (#3425). The commit DBFS Root Support for HMS Federation introduces changes to support the DBFS root location for HMS federation. A new method, external_locations_with_root, is added to the ExternalLocations class to return a list of external locations including the DBFS root location. This method is used in various functions and test cases, such as test_create_uber_principal_no_storage, test_create_uc_role_multiple_raises_error, test_create_uc_no_roles, test_save_spn_permissions, and test_create_access_connectors_for_storage_accounts, to ensure that the DBFS root location is correctly identified and tested in different scenarios. Additionally, the external_locations.snapshot.return_value is changed to external_locations.external_locations_with_root.return_value in test functions test_create_federated_catalog and test_already_existing_connection to retrieve a list of external locations including the DBFS root location. This commit closes issue #3406, which was related to this functionality. Overall, these changes improve the handling and testing of DBFS root location in HMS federation.
  • Log message as error when legacy permissions API is enabled/disabled depending on the workflow ran (#3443). In this release, logging behavior has been updated in several methods in the 'workflows.py' file. When the use_legacy_permission_migration configuration is set to False and specific conditions are met, error messages are now logged instead of info messages for the methods 'verify_metastore_attached', 'rename_workspace_local_groups', 'reflect_account_groups_on_workspace', 'apply_permissions_to_account_groups', 'apply_permissions', and 'validate_groups_permissions'. This change is intended to address issue #3388 and provides clearer guidance to users when the legacy permissions API is not functioning as expected. Users will now see an error message advising them to run the migrate-groups job or set use_legacy_permission_migration to True in the config.yml file. These updates will help ensure smoother workflow runs and more accurate logging for better troubleshooting.
  • MySQL External HMS Support for HMS Federation (#3385). This commit adds support for MySQL-based Hive Metastore (HMS) in HMS Federation, enhances the CLI for creating a federated catalog, and improves external HMS functionality. It introduces a new parameter enable_hms_federation in the Locations class constructor, allowing users to enable or disable MySQL-based HMS federation. The external_locations method in application.py now accepts enable_hms_federation as a parameter, enabling more granular control of the federation feature. Additionally, the CLI for creating a federated catalog has been updated to accept a prompts parameter, providing more flexibility. The commit also introduces a new dataclass ExternalHmsInfo for external HMS connection information and updates the HiveMetastoreFederationEnabler and HiveMetastoreFederation classes to support non-Glue external metastores. Furthermore, it adds methods to handle the creation of a Federated Catalog from the command-line interface, split JDBC URLs, and manage external connections and permissions.
  • Skip listing built-in catalogs to update table migration process (#3464). In this release, the migration process for updating tables in the Hive Metastore has been optimized with the introduction of the TableMigrationStatusRefresher class, which inherits from CrawlerBase. This new class includes modifications to the _iter_schemas method, which now filters out built-in catalogs and schemas when listing catalogs and schemas, thereby skipping unnecessary processing during the table migration process. Additionally, the get_seen_tables method has been updated to include checks for schema.name and schema.catalog_name, and the _crawl and _try_fetch methods have been modified to reflect changes in the TableMigrationStatus constructor. These changes aim to improve the efficiency and performance of the migration process by skipping built-in catalogs and schemas. The release also includes modifications to the existing migrate-tables workflow and adds unit tests that demonstrate the exclusion of built-in catalogs during the table migration status update process. The test case utilizes the CatalogInfoSecurableKind enumeration to specify the kind of catalog and verifies that the seen tables only include the non-builtin catalogs. These changes should prevent unnecessary processing of built-in catalogs and schemas during the table migration process, leading to improved efficiency and performance.
  • Updated databricks-sdk requirement from <0.39,>=0.38 to >=0.39,<0.40 (#3434). In this release, the requirement for the databricks-sdk package has been updated in the pyproject.toml file to be strictly greater than or equal to 0.39 and less than 0.40, allowing for the use of the latest version of the package while preventing the use of versions above 0.40. This change is based on the release notes and changelog for version 0.39 of the package, which includes bug fixes, internal changes, and API changes such as the addition of the cleanrooms package, delete() method for workspace-level services, and fields for various request and response objects. The commit history for the package is also provided. Dependabot has been configured to resolve any conflicts with this PR and can be manually triggered to perform various actions as needed. Additionally, Dependabot can be used to ignore specific dependency versions or close the PR.
  • Updated databricks-sdk requirement from <0.40,>=0.39 to >=0.39,<0.41 (#3456). In this pull request, the version range of the databricks-sdk dependency has been updated from '<0.40,>=0.39' to '>=0.39,<0.41', allowing the use of the latest version of the databricks-sdk while ensuring that it is less than 0.41. The pull request also includes release notes detailing the API changes in version 0.40.0, such as the addition of new fields to various compute, dashboard, job, and pipeline services. A changelog is provided, outlining the bug fixes, internal changes, new features, and improvements in versions 0.39.0, 0.40.0, and 0.38.0. A list of commits is also included, showing the development progress of these versions.
  • Use LTS Databricks runtime version (#3459). This release introduces a change in the Databricks runtime version to a Long-Term Support (LTS) release to address issues encountered during the migration to external tables. The previous runtime version caused the convert to external table migration strategy to fail, and this change serves as a temporary solution. The migrate-tables workflow has been modified, and existing integration tests have been reused to ensure functionality. The test_job_cluster_policy function now uses the LTS version instead of the latest version, ensuring a specified Spark version for the cluster policy. The function also checks for matching node type ID, Spark version, and necessary resources. However, users may still encounter problems with the latest Universal Connectivity (UCX) release. The _convert_hms_table_to_external method in the table_migrate.py file has been updated to return a boolean value, with a new TODO comment about a possible failure with Databricks runtime 16.0 due to a JDK update.
  • Use CREATE_FOREIGN_CATALOG instead of CREATE_FOREIGN_SECURABLE with HMS federation enablement commands (#3309). A change has been made to update the databricks-sdk dependency version from >=0.38,<0.39 to >=0.39 in the pyproject.toml file, which may affect the project's functionality related to the databricks-sdk library. In the Hive Metastore Federation codebase, CREATE_FOREIGN_CATALOG is now used instead of CREATE_FOREIGN_SECURABLE for HMS federation enablement commands, aligned with issue #3308. The _add_missing_permissions_if_needed method has been updated to check for CREATE_FOREIGN_SECURABLE instead of CREATE_FOREIGN_CATALOG when granting permissions. Additionally, a unit test file for HiveMetastore Federation has ...
Read more

v0.52.0

12 Dec 14:42
136c536
Compare
Choose a tag to compare
  • Added handling for Databricks errors during workspace listings in the table migration status refresher (#3378). In this release, we have implemented changes to enhance error handling and improve the stability of the table migration status refresher in the open-source library. We have resolved issue #3262, which addressed Databricks errors during workspace listings. The assessment workflow has been updated, and new unit tests have been added to ensure proper error handling. The changes include the import of DatabricksError from the databricks.sdk.errors module and the addition of a new method _iter_catalogs to list catalogs with error handling for DatabricksError. The _iter_schemas method now replaces _ws.catalogs.list() with self._iter_catalogs(), also including error handling for DatabricksError. Furthermore, new unit tests have been developed to check the logging of the TableMigration class when listing tables in the Databricks workspace, focusing on handling errors during catalog, schema, and table listings. These changes improve the library's robustness and ensure that it can gracefully handle errors during the table migration status refresher process.
  • Convert READ_METADATA to UC BROWSE permission for tables, views and database (#3403). The uc_grant_sql method in the grants.py file has been modified to convert READ_METADATA permissions to BROWSE permissions for tables, views, and databases. This change involves adding new entries to the dictionary used to map permission types to their corresponding UC actions and has been manually tested. The behavior of the grant_loader function in the hive_metastore module has also been modified to change the action type of a grant from READ_METADATA to EXECUTE for a specific case. Additionally, the test_grants.py unit test file has been updated to include a new test case that verifies the conversion of READ_METADATA to BROWSE for a grant on a database and handles the conversion of READ_METADATA permission to UC BROWSE for a new udf="function" parameter. These changes resolve issue #2023 and have been tested through manual testing and unit tests. No new methods have been added, and existing functionality has been changed in a limited scope. No new unit or integration tests have been added as it is assumed that the existing tests will continue to pass after these changes have been made.
  • Migrates Pipelines crawled during the assessment phase (#2778). A new utility class, PipelineMigrator, has been introduced in this release to facilitate the migration of Databricks Labs SQL (DLT) pipelines. This class is used in a new workflow that tests pipeline migration, which involves cloning DLT pipelines in the assessment phase with specific configurations to a new Unity Catalog (UC) pipeline. The migration can be skipped for certain pipelines by specifying their pipeline IDs in a list. Three test scenarios, each with different pipeline specifications, are defined to ensure the proper functioning of the migration process under various conditions. The class and the migration process are thoroughly tested with manual testing, unit tests, and integration tests, with no reliance on a staging environment. The migration process takes into account the WorkspaceClient, WorkspaceContext, AccountClient, and a flag for running the command as a collection. The PipelinesMigrator class uses a PipelinesCrawler and JobsCrawler to perform the migration and ensures better functionality for the users with additional parameters. The commit also introduces a new command, migrate_dlt_pipelines, to the CLI of the ucx package, which helps migrate DLT pipelines. The migration process is tested using a mock installation, unit tests, and integration tests. The tests cover the scenario where the installation has two jobs, test and 'assessment', with job IDs 123 and 456 respectively. The state of the installation is recorded in a state.json file. A configuration file pipeline_mapping.csv is used to map the source pipeline ID to the target catalog, schema, pipeline, and workspace names.
  • Removed try-except around verifying the migration progress prerequisites in the migrate-tables cli command (#3439). In the latest release, the ucx package's migrate-tables CLI command has undergone a significant modification in the handling of progress tracking prerequisites. The previous try-except block surrounding the verification has been removed, and the RuntimeWarning is now propagated, providing a more specific and helpful error message. If the prerequisites are not met, the verify method will raise an exception, and the migration will not proceed. This change enhances the accuracy of error messages for users and ensures that the prerequisites for migration are properly met. The tests for migrate_tables have been updated accordingly, including a new test case test_migrate_tables_errors_out_before_assessment that checks whether the migration does not proceed with the verification fails. This change affects the existing databricks labs ucx migrate-tables command and brings improved precision and reliability to the migration process.
  • Removed redundant internal methods from create_account_group (#3395). In this change, the create_account_group function's internal methods have been removed, and its signature has been modified to retrieve the workspace ID from accountworkspace._workspaces() instead of passing it as a parameter. This resolves issue #3170 and improves code efficiency by removing unnecessary parameters and methods. The AccountWorkspaces class now accepts a list of workspace IDs upon instantiation, enhancing code readability and eliminating redundancy. The function has been tested with unit tests, ensuring it creates a group if it doesn't exist, throws an exception if a group already exists, filters system groups, and handles cases where a group already has the required number of members in a workspace. These changes simplify the codebase, eliminate redundancy, and improve the maintainability of the project.
  • Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.34 (#3407). In this release, we have updated the sqlglot requirement to version 25.33.9999 from a range that included versions 25.5.0 to 25.32.9999. This update allows us to utilize the latest version of sqlglot, which includes various bug fixes and new features. In v25.33.0, there were two breaking changes: the TIMESTAMP data type now maps to Type.TIMESTAMPTZ, and the NEXT keyword is now treated as a function keyword. Several new features were also introduced, including support for generated columns in PostgreSQL and the ability to preserve tables in the replace_table method. Additionally, there were several bug fixes, including fixes for issues related to BigQuery, Presto, and Spark. The v25.32.1 release contained two bug fixes related to BigQuery and one bug fix related to Presto. Furthermore, v25.32.0 had three breaking changes: support for ATTACH/DETACH statements, tokenization of hints as comments, and a fix to datetime coercion in the canonicalize rule. This release also introduced new features, such as support for TO_TIMESTAMP* variants in Snowflake and improved error messages in the Redshift transpiler. Lastly, there were several bug fixes, including fixes for issues related to SQL Server, MySQL, and PostgreSQL.
  • Updated sqlglot requirement from <25.33,>=25.5.0 to >=25.5.0,<25.35 (#3413). In this release, the sqlglot dependency has been updated from a version range that allows up to 25.33, but excludes 25.34, to a version range that allows 25.5.0 and above, but excludes 25.35. This update was made to enable the latest version of sqlglot, which includes one breaking change related to the alias expansion of USING STRUCT fields. This version also introduces two new features, an optimization for alias expansion of USING STRUCT fields, and support for generated columns in PostgreSQL. Additionally, two bug fixes were implemented, addressing proper consumption of dashed table parts and removal of parentheses from CURRENT_USER in Presto. The update also includes a fix to make TIMESTAMP map to Type.TIMESTAMPTZ, a fix to parse DEFAULT in VALUES clause into a Var, and changes to the BigQuery and Snowflake dialects to improve transpilation and JSONPathTokenizer leniency. The commit message includes a reference to issue [#3413](https://github.com/databrickslabs/ucx/issues/3413) and a link to the sqlglot changelog for further reference.
  • Updated sqlglot requirement from <25.35,>=25.5.0 to >=25.5.0,<26.1 (#3433). In this release, we have updated the required version of the sqlglot library to a range that includes version 25.5.0 but excludes version 26.1. This change is crucial due to the breaking changes introduced in sqlglot v26.0.0 that are not yet compatible with our project. The commit message includes the changelog for sqlglot v26.0.0, which highlights the breaking changes, new features, bug fixes, and other modifications in this version. Additionally, the commit includes a list of commits merged into the sqlglot repository for a comprehensive understanding of the changes. As a software engineer, I recommend approving this change to maintain compatibility with sqlglot. However, I advise thorough testing to ensure the updated version does n...
Read more

v0.51.0

02 Dec 20:39
b422e78
Compare
Choose a tag to compare
  • Added assign-owner-group command (#3111). The Databricks Labs Unity Catalog Exporter (UCX) tool now includes a new assign-owner-group command, allowing users to assign an owner group to the workspace. This group will be designated as the owner for all migrated tables and views, providing better control and organization of resources. The command can be executed in the context of a specific workspace or across multiple workspaces. The implementation includes new classes, methods, and attributes in various files, such as cli.py, config.py, and groups.py, enhancing ownership management functionality. The assign-owner-group command replaces the functionality of issue #3075 and addresses issue #2890, ensuring proper schema ownership and handling of crawled grants. Developers should be aware that running the migrate-tables workflow will result in assigning a new owner group for the Hive Metastore instance in the workspace installation.
  • Added opencensus to known list (#3052). In this release, we have added OpenCensus to the list of known libraries in our configuration file. OpenCensus is a popular set of tools for distributed tracing and monitoring, and its inclusion in our system will enhance support and integration for users who utilize this tool. This change does not affect existing functionality, but instead adds a new entry in the configuration file for OpenCensus. This enhancement will allow our library to better recognize and work with OpenCensus, enabling improved performance and functionality for our users.
  • Added default owner group selection to the installer (#3370). A new class, AccountGroupLookup, has been added to the AccountGroupLookup module to select the default owner group during the installer process, addressing previous issue #3111. This class uses the workspace_client to determine the owner group, and a pick_owner_group method to prompt the user for a selection if necessary. The ownership selection process has been improved with the addition of a check in the installer's _static_owner method to determine if the current user is part of the default owner group. The GroupManager class has been updated to use the new AccountGroupLookup class and its methods, pick_owner_group and validate_owner_group. A new variable, default_owner_group, is introduced in the ConfigureGroups class to configure groups during installation based on user input. The installer now includes a unit test, "test_configure_with_default_owner_group", to demonstrate how it sets expected workspace configuration values when a default owner group is specified during installation.
  • Added handling for non UTF-8 encoded notebook error explicitly (#3376). A new enhancement has been implemented to address the issue of non-UTF-8 encoded notebooks failing to load by introducing explicit error handling for this case. A UnicodeDecodeError exception is now caught and logged as a warning, while the notebook is skipped and returned as None. This change is implemented in the load_dependency method in the loaders.py file, which is a part of the assessment workflow. Additionally, a new unit test has been added to verify the behavior of this change, and the assessment workflow has been updated accordingly. The new test function in test_loaders.py checks for different types of exceptions, specifically PermissionError and UnicodeDecodeError, ensuring that the system can handle notebooks with non-UTF-8 encoding gracefully. This enhancement resolves issue #3374, thereby improving the overall robustness of the application.
  • Added migration progress documentation (#3333). In this release, we have updated the migration-progress-experimental workflow to track the migration progress of a subset of inventory tables related to workspace resources being migrated to Unity Catalog (UCX). The workflow updates the inventory tables and tracks the migration progress in the UCX catalog tables. To use this workflow, users must attach a UC metastore to the workspace, create a UCX catalog, and ensure that the assessment job has run successfully. The Migration Progress section in the documentation has been updated with a new markdown file that provides details about the migration progress, including a migration progress dashboard and an experimental migration progress workflow that generates historical records of inventory objects relevant to the migration progress. These records are stored in the UCX UC catalog, which contains a historical table with information about the object type, object ID, data, failures, owner, and UCX version. The migration process also tracks dangling Hive or workspace objects that are not referenced by business resources, and the progress is persisted in the UCX UC catalog, allowing for cross-workspace tracking of migration progress.
  • Added note about running assessment once (#3398). In this release, we have introduced an update to the UCX assessment workflow, which will now only be executed once and will not update existing results in repeated runs. To accommodate this change, we have updated the README file with a note clarifying that the assessment workflow is a one-time process. Additionally, we have provided instructions on how to update the inventory and findings by uninstalling and reinstalling the UCX. This will ensure that the inventory and findings for a workspace are up-to-date and accurate. We recommend that software engineers take note of this change and follow the updated instructions when using the UCX assessment workflow.
  • Allowing skipping TACLs migration during table migration (#3384). A new optional flag, "skip_tacl_migration", has been added to the configuration file, providing users with more flexibility during migration. This flag allows users to control whether or not to skip the Table Access Control Language (TACL) migration during table migrations. It can be set when creating catalogs and schemas, as well as when migrating tables or using the migrate_grants method in application.py. Additionally, the install.py file now includes a new variable, skip_tacl_migration, which can be set to True during the installation process to skip TACL migration. New test cases have been added to verify the functionality of skipping TACL migration during grants management and table migration. These changes enhance the flexibility of the system for users managing table migrations and TACL operations in their infrastructure, addressing issues #3384 and #3042.
  • Bump databricks-sdk and databricks-labs-lsql dependencies (#3332). In this update, the databricks-sdk and databricks-labs-lsql dependencies are upgraded to versions 0.38 and 0.14.0, respectively. The databricks-sdk update addresses conflicts, bug fixes, and introduces new API additions and changes, notably impacting methods like create(), execute_message_query(), and others in workspace-level services. While databricks-labs-lsql updates ensure compatibility, its changelog and specific commits are not provided. This pull request also includes ignore conditions for the databricks-sdk dependency to prevent future Dependabot requests. It is strongly advised to rigorously test these updates to avoid any compatibility issues or breaking changes with the existing codebase. This pull request mirrors another (#3329), resolving integration CI issues that prevented the original from merging.
  • Explain failures when cluster encounters Py4J error (#3318). In this release, we have made significant improvements to the error handling mechanism in our open-source library. Specifically, we have addressed issue #3318, which involved handling failures when the cluster encounters Py4J errors in the databricks/labs/ucx/hive_metastore/tables.py file. We have added code to raise noisy failures instead of swallowing the error with a warning when a Py4J error occurs. The functions _all_databases() and _list_tables() have been updated to check if the error message contains "py4j.security.Py4JSecurityException", and if so, log an error message with instructions to update or reinstall UCX. If the error message does not contain "py4j.security.Py4JSecurityException", the functions log a warning message and return an empty list. These changes also resolve the linked issue #3271. The functionality has been thoroughly tested and verified on the labs environment. These improvements provide more informative error messages and enhance the overall reliability of our library.
  • Rearranged job summary dashboard columns and make job_name clickable (#3311). In this update, the job summary dashboard columns have been improved and the need for the 30_3_job_details.sql file, which contained a SQL query for selecting job details from the inventory.jobs table, has been eliminated. The dashboard columns have been rearranged, and the job_name column is now clickable, providing easy access to job details via the corresponding job ID. The changes include modifying the...
Read more

v0.50.0

18 Nov 14:48
@nfx nfx
2483f3f
Compare
Choose a tag to compare
  • Added pytesseract to known list (#3235). A new addition has been made to the known.json file, which tracks packages with native code, to include pytesseract, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of pytesseract within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation of pytesseract and its native components. However, specific details on the usage of pytesseract within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of pytesseract and its native dependencies, making it easier to work with.
  • Added hyperlink to database names in database summary dashboard (#3310). The recent change to the Database Summary dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a linkUrlTemplate property to the database field in the encodings object within the overrides property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard.
  • Bump codecov/codecov-action from 4 to 5 (#3316). In this release, the version of the codecov/codecov-action dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including binary, gcov_args, gcov_executable, gcov_ignore, gcov_include, report_type, skip_validation, and swift_project. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking.
  • Depend on a Databricks SDK release compatible with 0.31.0 (#3273). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new InvalidState error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the pyproject.toml file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project.
  • Eliminate redundant migration-index refresh and loads during view migration (#3223). In this pull request, we have optimized the view migration process in the databricks/labs/ucx/hive_metastore/table_metastore.py file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class TableMigrationIndex and imported the TableMigrationStatusRefresher class. The _migrate_views method now takes an additional argument migration_index, which is used in the ViewsMigrationSequencer and in the _migrate_view method. The _view_can_be_migrated and _sql_migrate_view methods now also take migration_index as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly.
  • Fixed backwards compatibility breakage from Databricks SDK (#3324). In this release, we have addressed a backwards compatibility issue (Issue #3324) that was caused by an update to the Databricks SDK. This was done by adding new methods to the databricks.sdk.service module to interact with dashboards. Additionally, we have fixed bug #3322 and updated the create function in the conftest.py file to utilize the new dashboards module and its Dashboard class. The function now returns the dashboard object as a dictionary and calls the publish method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the --cov-fail-under=89 flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality.
  • Fixed issue with cleanup of failed create-missing-principals command (#3243). In this update, we have improved the create_uc_roles method within the access.py file of the databricks/labs/ucx/aws directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a PermissionDenied or NotFound exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the databricks labs ucx create-missing-principals command by handling permission errors and restoring the system to its initial state.
  • Improve error handling for assess_workflows task (#3255). This pull request introduces improvements to the assess_workflows task in the databricks/labs/ucx module, focusing on error handling and logging. A new error type, DatabricksError, has been added to handle Databricks-specific exceptions in the _temporary_copy method, ensuring proper handling and re-raising of Databricks-related errors as InvalidPath exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from error to warning. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the assess_workflows task, ensuring appropriate handling and logging of any errors that may occur during execution.
  • Require at least 4 cores for UCX VMs (#3229). In this release, the selection of node_type_id in the policy.py file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the node_type_id parameter. The updated node_type_id selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly.
  • Skip test_feature_tables integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features.
  • Speed up update_migration_status jobs by eliminating lots of redundant SQL queries (#3200). In this relea...
Read more

v0.49.0

08 Nov 15:37
@nfx nfx
f97883e
Compare
Choose a tag to compare
  • Added MigrationSequencer for jobs (#3008). In this commit, a MigrationSequencer class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable of MigrationStep objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue #1415 and supersedes issue #2980. Additionally, the commit removes some unnecessary imports and fixtures from a test file.
  • Added phik to known list (#3198). In this release, we have added phik to the known list in the provided JSON file. This change addresses part of issue #1931, as outlined in the linked issues. The phik key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding the phik key.
  • Added pmdarima to known list (#3199). In this release, we are excited to announce the addition of support for the pmdarima library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have added pmdarima to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integrating pmdarima, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue #1931 and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available.
  • Added preshed to known list (#3220). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython, preshed is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules, preshed and "preshed.about," this addition partially resolves issue #1931, improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage the preshed library's features and optimized routines for their projects, reducing development time and increasing efficiency.
  • Added py-cpuinfo to known list (#3221). In this release, we have added support for the py-cpuinfo library to our project, enabling the use of the cpuinfo functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue #1931 and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources.
  • Cater for empty python cells (#3212). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the _python_trees dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the _load_children_from_tree method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input.
  • Create TODO issues every nightly run (#3196). A commit has been made to update the acceptance repository version in the acceptance.yml GitHub workflow from acceptance/v0.4.0 to acceptance/v0.4.2, which affects the integration tests. The Run nightly tests step in the GitHub repository's workflow has also been updated to use a newer version of the databrickslabs/sandbox/acceptance action, from v0.3.1 to v0.4.2. Software engineers should verify that the new version of the acceptance repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues.
  • Fixed Integration test failure of migration_tables (#3108). This release includes a fix for two integration tests (test_migrate_managed_table_to_external_table_without_conversion and test_migrate_managed_table_to_external_table_with_clone) related to Hive Metastore table migration, addressing issues #3054 and #3055. Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing @pytest.mark.skip markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase.
  • Replace MockInstallation with MockPathLookup for testing fixtures (#3215). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue #3115.
  • Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 (#3224). The open-source library sqlglot has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpiling ANY to EXISTS, supporting the MEDIAN() function, wrapping values in NOT value IS ..., and parsing information schema views into a single identifier. New features include support for the JSONB_EXISTS function in PostgreSQL, transpiling ANY to EXISTS in Spark, transpiling Snowflake's TIMESTAMP() function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding a NULL filter on ARRAY_AGG only for columns, improving parsing of WITH FILL ... INTERPOLATE in Clickhouse, generating LOG(...) for exp.Ln in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release.
  • Use acceptance/v0.4.0 (#3192). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the databrickslabs/sandbox/acceptance runner to acceptance/v0.4.0 and granting write permissions for the issues field in the permissions section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. A TODO comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly.
  • Warn about errors instead to a...
Read more